Exam (elaborations)

Automatic scoring for answers to Arabic test questions

3 views 0 purchase

Course
Automatic scoring

Institution
Automatic Scoring

All of the measures described previously in Sections 3.2 and 3.3 work on a word-to-word similarity basis. This section presents the computation of semantic similarity at the sentence level. Our system relies on the BOW model; in this model, a sentence is represented as an unordered collection of ...

[Show more]

Preview 3 out of 25 pages

View example

Uploaded on August 5, 2024
Number of pages 25
Written in 2024/2025
Type Exam (elaborations)
Contains Questions & answers

automatic scoring for answers to arabic test quest
3 text similarity measures thissection presents a

Institution Automatic scoring
Course Automatic scoring

Tutorgrades Member since 10 months 46 documents sold

$18.49

Added

Add to cart Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

+Model
YCSLA-628; No. of Pages 25 ARTICLE IN PRESS
Available online at www.sciencedirect.com

ScienceDirect
Computer Speech and Language xxx (2013) xxx–xxx

Automatic scoring for answers to Arabic test questions夽
Wael Hassan Gomaa a,∗ , Aly Aly Fahmy b,1
a Modern Academy for Computer Science & Management Technology, Computer Science Department, 304 St., New Maadi, Saqr Qrysh, Postal
Code 11913, Cairo, Egypt
b Faculty of Computers and Information, Cairo University, 5 Ahmed Zoweil St., Dokki, Postal Code 12613, Giza, Egypt

Received 1 March 2013; received in revised form 19 September 2013; accepted 18 October 2013

Abstract
Most research in the automatic assessment of free text answers written by students address English language. This paper handles the
assessment task in Arabic language. This research focuses on applying multiple similarity measures separately and in combination.
Many aspects are introduced that depend on translation to overcome the lack of text processing resources in Arabic, such as extracting
model answers automatically from an already built database and applying K-means clustering to scale the obtained similarity values.
Additionally, this research presents the first benchmark Arabic data set that contains 610 students’ short answers together with their
English translations.
© 2013 Elsevier Ltd. All rights reserved.

Keywords: Short answer scoring; Text similarity; Semantic similarity; Arabic corpus

1. Introduction

The rapidly growing educational community, both electronic and traditional, with an enormous number of tests
has caused a need for automatic scoring systems. Automatic Scoring (AS) systems address evaluating a student’s
answer by comparing it to model answer(s). AS technology handles different types of students’ responses, such as
writing, speaking and mathematics. Writing assessment comes in two forms: Automatic Essay Scoring (AES) and
Short-Answer Scoring. Speaking assessment includes low and high entropy spoken responses, while mathematical
assessments include textual, numeric or graphical responses. AS Systems are easily implemented for certain types of
questions, such as Multiple Choice, True–False, Matching and Fill-in-the-Blank. Implementing an automatic scoring
system for questions that require free text answers is more difficult because students’ answers require complicated
text understanding and analysis. In this research, short-answer scoring is handled through an approach that addresses
students’ answers holistically and depends on text similarity measures (Mohler and Mihalcea, 2009; Mohler et al.,
2011). Three types of text similarity measures are handled: String similarity, Corpus-based similarity and Knowledge-
based similarity. String similarity measures operate on string sequences and character composition. Corpus-based

夽 This paper has been recommended for acceptance by Edward J. Briscoe.
∗ Corresponding author. Tel.: +2 011 46767466.
E-mail addresses: Wael.Goma@gmail.com (W.H. Gomaa), Aly.Fahmy@cu.edu.eg (A.A. Fahmy).
1 Tel.: +2 012 23420162; fax: +2 02 33350109.

0885-2308/$ – see front matter © 2013 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.csl.2013.10.005

Please cite this article in press as: Gomaa, W.H., Fahmy, A.A., Automatic scoring for answers to Arabic test questions. Comput.
Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.10.005

,+Model
YCSLA-628; No. of Pages 25 ARTICLE IN PRESS
2 W.H. Gomaa, A.A. Fahmy / Computer Speech and Language xxx (2013) xxx–xxx

depends on information derived from large corpora. Knowledge-based uses semantic networks (Mihalcea et al., 2006;
Budanitsky and Hirst, 2001; Gomaa and Fahmy, 2013).
This research presents a system for short-answer scoring in the Arabic language. Arabic is a widespread language
that is spoken by approximately 300 million people around the world. From a natural language point of view, the
Arabic language is characterized by high ambiguity, rich morphology, complex morpho-syntactic agreement rules and
a large number of irregular forms (Habash, 2010). Our system focuses mainly on measuring the similarity between the
student and the model answers using a bag of words (BOW) model and disregarding complex Arabic computational
linguistics tasks.
The system translates students’ responses into English to overcome the lack of text processing resources in the
Arabic language. Acknowledging that machine translation is sub-optimal, but it is still helpful for the scoring task as
experiments will explain in the next sections. Different methods of scaling the similarity values to be in the same range
as the manual scores are presented and tested. Multiple text similarity measures were combined using supervised and
unsupervised methods; this combination affected the obtained results positively.
Additionally, the system presents a module that searches for a model answer from an already built database that is
aligned with the curriculum.
This paper is organized as follows: Section 2 presents related work on automatic short-answer scoring systems.
Section 3 introduces the three main categories of Similarity Algorithms used in this research. Section 4 presents the
first Arabic data set to be used for benchmarking short-answer scoring systems. In Section 5, the proposed system is
illustrated with a walk-through example. Section 6 shows the experiment results, and finally, Section 7 presents the
conclusions of the research.

2. Related work

A substantial amount of work has recently been performed in short-answer grading at the SemEval-2013 task #7:
The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge (Dzikovska et al., 2013).
This task offered three problems: a 5-way task, with 5 different answer judgments, and 3-way and 2-way tasks, which
conflate more judgment categories each time. Two different corpora, Beetle and SciEntsBank, were labeled with the
5 following labels: Correct, Partially correct incomplete, Contradictory, Irrelevant and Non Domain, as described in
Dzikovska et al. (2012). A system called ETS (Heilman and Madnani, 2013) was presented through a short-answer
grading approach that uses stacking (Wolpert, 1992) and domain adaptation (Daumé and Marcu, 2007) to support
the integration of various types of task specific and general features. The full system included many features, such
as baseline, intercept, Word-based N-gram, character-based N gram and text similarity features. Evaluation results
indicate that the system achieves relatively high levels of agreement with human scores, compared to other systems
that were submitted to the shared task.
The SOFTCARDINALITY (Jimenez et al., 2013) system utilized text overlap based on soft cardinality (Jimenez
et al., 2010) plus a machine learning classifier. Soft cardinality is a general model for object comparison that has
been tested on text applications. The system performed well, especially with “unseen domain” instances, which was
the more challenging test set. Additionally, it obtained 1st place in a 2-way task and 2nd place in the 3-way and
5-way tasks, considering the overall accuracy across all of the data sets and test sets. The CNGL (Biçici and van
Genabith, 2013) system was based on referential translation machines (RTMs), a computational model for identifying
translation acts between any two data sets with respect to a reference corpus selected on the same domain, which
can be used for automatically grading student answers. RTMs provide a clean and intuitive computational model for
automatically grading student answers by measuring the acts of translation that are involved, and it was found to be the
2nd best system on some tasks in the Student Response Analysis challenge. EHU-ALM (Aldabe et al., 2013) is a 5-
way supervised system that is based on syntactic-semantic similarity features. The model deploys the following: Text
overlap measures, WordNet-based lexical similarities, graph-based similarities, corpus based similarities, syntactic
structure overlap and predicate argument overlap measures. The results showed that the system is above the median
and mean on all of the evaluation scenarios of the task. The UKP-BIU (Torsten Zesch et al., 2013) system was
based on training a supervised model (Naive Bayes) using Weka (Hall et al., 2009), with feature extraction based on
clearTK (Ogren et al., 2008). The features used were BOW, syntactic, basic similarity, semantic similarity, spelling
and entailment features. The UKP-BIU results summarized that the Correct category was classified quite reliably
but the Irrelevant category was especially hard. The LIMSIILES (Gleize and Grau, 2013) system was modeled as a

Please cite this article in press as: Gomaa, W.H., Fahmy, A.A., Automatic scoring for answers to Arabic test questions. Comput.
Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.10.005

, +Model
YCSLA-628; No. of Pages 25 ARTICLE IN PRESS
W.H. Gomaa, A.A. Fahmy / Computer Speech and Language xxx (2013) xxx–xxx 3

paraphrase identification problem, based on substitution by basic English variants. Basic English paraphrases were
acquired from the Simple English Wiktionary. Substitutions are applied on both the model and student answers to
reduce the diversity of their vocabulary and map them to a common vocabulary. The evaluation showed promising
results, and this work is a first step toward an open domain system that would be able to exhibit deep text understanding
capabilities.
In addition to the systems described in SemEval-2013 task #7, an excellent and more detailed overview of related
work can be found in Ziai et al. (2012), such as CarmelTC, C-Rater, Intelligent Assessment Technologies (IAT),
Oxford-UCLES and Texas systems. CarmelTC (Rosé et al., 2003) is a Virtual Learning Environment system that has
been developed at the University of Pittsburgh. It has been used in the tutorial dialog system Why2-Atlas (VanLehn
et al., 2002). This system has the ability to assign scores for students’ answers and detect which set of correct features
are present in the student essays. The system combines machine learning classification methods using the features
extracted from both Carmel’s linguistic analysis of the text and the Rainbow Naive Bayes classification. The student’s
answer is first broken into a set of sentences that are passed to a Bayesian network to extract the correct features that
represent each sentence. These features are used to generate a vector that indicates the presence or absence of each
correct feature. Finally, the ID3 tree learning algorithm is applied to feature vectors to create the rules for identifying
the sentence classes. The system was tested with 126 physics essays, and the results were 90% precision, 80% recall
and an 8% false alarm rate.
C-Rater was released by the Educational Testing System (ETS) (Leacock and Chodorow, 2003). It used gold-
standard model patterns to score student answers according to their syntactical structure. These patterns are built
semi-automatically by converting each answer into a set of one or more predicate-argument tuples. C-Rater reported
having an accuracy of between 81% and 90% when used by The National Assessment of Education Progress agency.
Modern work on C-Rater (Sukkarieh and Blackmore, 2009; Sukkarieh and Stoyanchev, 2009) treats the grading task
more similar to a textual entailment task. It analyzed 100–150 graded student answers to create a set of concepts for
which each is represented by a set of sentences supplemented by a lexicon. Scoring is based on the presence or absence
of these concepts. For more development of C-Rater, the student answers are parsed, to extract a predicate argument
structure that is then categorized as absent, present, or negated for each concept, using a maximum entropy-based
matching algorithm. The reported agreement (per concept-math) was 84.8% compared to an annotator agreement of
90.3%.
Intelligent Assessment Technologies (IAT) is a scoring system that is presented in Mitchell et al. (2002). This
system depends on information extraction templates that are manually created by a special-purpose authoring tool that
explores a sample of student’s responses. A student’s answer is then compared to the templates that correspond to a
question. This system was applied to a progress test that had to be taken by medical students in which 800 students
went through 270 test items. According to the authors, their system reached 99.4% accuracy on the full data set after
the manual adjustment of the templates via the moderation process. They reported an error of “between 5 and 5.5%”
in inter-grader agreement.
Oxford-UCLES (Pulman and Sukkarieh, 2005) is an information-extraction short-answer scoring system developed
at Oxford University to fulfill the needs of the University of Cambridge Local Examination Syndicate (UCLES). This
system utilizes pattern matching for scoring. A human expert discovers information extraction patterns so that each
set of patterns is associated with a corresponding question. This set is then split into a group of equivalence classes for
which the members of the same equivalence class deliver the same message and/or information. The scoring algorithm
matches the student answers to the equivalence classes and assigns scores according to the number of matches. The
evaluation of the latest version of the Oxford-UCLES system was conducted by using approximately 260 answers for
each of the nine questions that were taken from a UCLES GCSE biology exam. The full mark for these questions ranged
from 1 to 4. Two hundred marked answers were used as the training set to extract the patterns, while 60 unmarked
answers were kept for the testing phase. The average percentage agreement between the system’s grade and the human
expert’s grade was 84%.
A Text-to-Text (Texas) system was introduced in Mohler and Mihalcea (2009). Here, the score is assigned according
to a measure of the semantic similarity between a student answer and a model answer when using several measures,
including knowledge-based and corpus-based. The system was applied to a computer science data set that contains
21 questions and 610 student responses, where the best Pearson correlation value between the automatic and manual
scores was r = 0.47. An enhanced version of the Texas system was introduced in Mohler et al. (2011), which used
dependency graph alignments that were generated by machine learning. The data set used in this version contained 80

Please cite this article in press as: Gomaa, W.H., Fahmy, A.A., Automatic scoring for answers to Arabic test questions. Comput.
Speech Lang. (2013), http://dx.doi.org/10.1016/j.csl.2013.10.005

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Tutorgrades. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $18.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

76449 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Exam (elaborations)

Automatic scoring for answers to Arabic test questions

Document information

Subjects

Written for

Seller

Reviews received

Content preview