Exam (elaborations)

Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam

5 views 0 purchase

Course
Predicting Item Survival

Institution
Predicting Item Survival

and complexity of individual words as candidates for a fillin-the-blanks test and this ranking is used to estimate the difficulty of the particular example. A slightly different approach to predicting test difficulty is presented in Pado´ (2017), where each question is manually annotated and l...

[Show more]

Preview 2 out of 7 pages

View example

Uploaded on August 25, 2024
Number of pages 7
Written in 2024/2025
Type Exam (elaborations)
Contains Questions & answers

predicting item survival for multiple choice quest
a 55 year old woman with small cell carcinoma of t

Institution Predicting Item Survival
Course Predicting Item Survival

TIFFACADEMICS Member since 1 year 556 documents sold

$15.49

Added

Add to cart Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6812–6818
Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC

Predicting Item Survival for Multiple Choice Questions in a High-stakes
Medical Exam
Victoria Yaneva1 , Le An Ha2 , Peter Baldwin1 , Janet Mee1
1 - National Board of Medical Examiners, Philadelphia, USA
2 - Research Institute in Information and Language Retrieval, University of Wolverhampton, UK
{vyaneva, pbaldwin, jmee}@nbme.org; l.a.ha@wlv.ac.uk

Abstract
One of the most resource-intensive problems in the educational testing industry relates to ensuring that newly-developed exam questions
can adequately distinguish between students of high and low ability. The current practice for obtaining this information is the costly
procedure of pretesting: new items are administered to test-takers and then the items that are too easy or too difficult are discarded. This
paper presents the first study towards automatic prediction of an item’s probability to “survive” pretesting (item survival), focusing on
human-produced MCQs for a medical exam. Survival is modeled through a number of linguistic features and embedding types, as well
as features inspired by information retrieval. The approach shows promising first results for this challenging new application and for
modeling the difficulty of expert-knowledge questions.

Keywords: Multiple Choice Questions, Difficulty Prediction, Educational Applications

1. Introduction pretesting slots for items that are more likely to pass the
Large-scale testing relies on a pool of test questions, which thresholds. To address these issues, we present a method
must be replenished, updated, and expanded over time1 . for modeling item survival within a large-scale real-world
Writing high-quality test questions is challenging as they data set of multiple choice questions (MCQs) for a high-
must satisfy certain quality standards before they can be stakes medical exam.
used to score examinees. These standards are based on sta- Contributions: i) The paper introduces a new practical ap-
tistical criteria and ensure that: i) items are not too easy plication area of NLP related to predicting item survival
or too difficult for the intended examinee population, and for improving high-stakes exams. ii) The developed mod-
ii) the probability of success on each item is positively re- els outperform three baselines with a statistically signif-
lated to overall examinee performance (Section 3.). While icant difference, including a strong baseline of 113 lin-
the exact thresholds vary, most exam programs have such guistic features. iii) Owing to the generic nature of the
a requirement. Even when item writers are well-trained features, the presented approach is generalizable to other
and adhere to industry best practices, it has generally not MCQ-based exams. iv) We make our code available2 at:
been possible to identify which items will satisfy the vari- https://bit.ly/2EaTFNN.
ous statistical criteria without first obtaining examinee re-
sponses through pretesting. Pretesting involves embedding
2. Related Work
new items within a standard live exam and, based on the Predicting item survival from item text is a new application
collected responses, a determination is made about whether area for NLP and, to the best of our knowledge, there is no
or not a given item satisfies conditions i) and ii). Items that prior work investigating this specific issue. The problem
meet the criteria are considered to have “survived” pretest- is, however, related to the limited available research on pre-
ing and can later be used to score examinees. The propor- dicting question difficulty with the important difference that
tion of surviving items varies across programs; however, predicting survival involves predicting an additional item
Brennan (2006) recommends pretesting at least twice the parameter that captures the relation between the probabil-
number of items needed. ity of success for the individual item and overall examinee
While necessary, the enterprise of pretesting is costly. performance (Section 3.).
Scored items compete with pretest items for exam space, With regards to estimating question difficulty for humans,
the scarcity of which can create a bottleneck. As a result, it the majority of studies focus on applying readability met-
is sometimes not possible to pretest as many new items as rics to language comprehension tests, where the compre-
needed and some exam programs may not be able to afford hension questions refer to a given piece of text and, there-
pretesting at all. This problem is expected to grow with ad- fore, there is a relationship between the difficulty of the
vances in automatic question generation (Gierl et al., 2018), two (Huang et al., 2017; Loukina et al., 2016). For exam-
where a large amount of new questions are generated but ple, Loukina et al. (2016) investigate the extent to which
there is no criteria on how to evaluate their suitability for the difficulty of listening items in an English language pro-
live use. Conceivably, having advance knowledge of an ficiency test can be predicted by the textual properties of
item’s probability to survive can allow using the available the prompt by using text complexity features (e.g. syn-
tactic complexity, cohesion, academic vocabulary, etc). In
1
This constant need for new test questions arises as the popu- another study, Beinborn et al. (2015) rank the suitability
lation of test-takers grows, new topics for exam content are iden-
2
tified, item exposure threatens exam security, etc. The questions cannot be released because of test security.

6812

, A 55-year-old woman with small cell carcinoma of the lung is admitted to the hospital to undergo
chemotherapy. Six days after treatment is started, she develops a temperature of 38C (100.4F).
Physical examination shows no other abnormalities. Laboratory studies show a leukocyte count of
100/mm3 (5% segmented neutrophils and 95% lymphocytes).
Which of the following is the most appropriate pharmacotherapy to increase this patient’s leukocyte
count?
(A) Darbepoetin (B) Dexamethasone
(C) Filgrastim (D) Interferon alfa
(E) Interleukin-2 (IL-2) (F) Leucovorin

Table 1: An example of a practice item

and complexity of individual words as candidates for a fill- ing a set of guidelines, stipulating adherence to a standard
in-the-blanks test and this ranking is used to estimate the structure. These guidelines required avoidance of “win-
difficulty of the particular example. A slightly different dow dressing” (extraneous material not needed to answer
approach to predicting test difficulty is presented in Padó the item), “red herrings” (information designed to mislead
(2017), where each question is manually annotated and la- the test-taker), and grammatical cues (e.g., correct answers
belled with the cognitive activities and knowledge neces- that are longer or more specific than the other options).
sary to answer it based on Bloom’s Taxonomy of Educa- Item writers had to ensure that the produced items did not
tional Objectives (Bloom and others, 1956). The results have flaws related to various aspects of validity. For exam-
indicate that questions that are low in Bloom’s hierarchy of ple, flaws related to irrelevant difficulty include: Stems or
skills are easier to answer than ones high in the hierarchy. options are overly long or complicated, Numeric data not
Nadeem and Ostendorf (2017) approach the same problem stated consistently and Language or structure of the options
in an opposite way, where they aim to predict the skills re- is not homogeneous. Flaws related to “testwiseness” are:
quired to solve assessment questions using a convolutional Grammatical cues; The correct answer is longer, more spe-
neural network (CNN). The ultimate goal of their experi- cific, or more complete than the other options; and A word
ments is to use annotated data with labels of such skills in or phrase is included both in the stem and in the correct
order to automatically populate a Q-matrix of skills used answer. The goal of standardizing items in this manner is
in education to determine how questions should be graded to produce items that vary in their difficulty and discrimi-
(e.g., more points should be awarded for solving questions nating power due only to differences in the medical content
that require more skill). they assess.
Alsubait et al. (2013) show that the difficulty of newly gen- The items were administered within a standard nine-hour
erated questions can be manipulated by changing the simi- exam, and test-takers had no way of knowing that they
larity between item components, e.g. the distractors and the would not be scored on these items. Each nine-hour exam
correct answer, the question and the distractors, the ques- contained approximately 40 pretest items and the data was
tion and the correct answer, etc. This assumption is later collected through embedding the items in different live
on used by Ha and Yaneva (2018) in automatic distractor exam forms for four consecutive years (2012 - 2015). On
generation for multiple choice questions, where the system average, each item was answered by 328 examinees (SD =
can rank distractors based on various similarity metrics. 67.17). Examinees were medical students from accredited3
In our prior work we predict MCQ difficulty and mean re- US and Canadian medical schools taking the exam for the
sponse times using a large number of linguistic features, first time as part of a multistep examination sequence re-
in addition to embeddings (Ha et al., 2019; Baldwin et al., quired for medical licensure in the US.
2020). The results presented in Ha et al. (2019) show that To survive, items had to satisfy two criteria:
the proposed approach predicts the difficulty of the ques-
• A proportion of correct answers between .30 and .95,
tions with a statistically significant improvement over sev-
i.e., the item had to be answered correctly by no fewer
eral baselines. As will be seen in Section 4., we use the full
than 30% and no more than 95% of test-takers. Within
list of linguistic features to obtain a strong baseline pre-
the educational-testing literature, this proportion of
diction for item survival. More details on the individual
correct answers is commonly referred to as a P-value.
features and their explanations can be found in Section 4..
We adopt this convention here but care should be taken
not to confuse this usage with a p-value indicating sta-
3. Data
tistical significance. The P-value is calculated in the
Data comprises 5,918 pretested MCQs from the Clinical following way:
Knowledge component of the United States Medical Li- PN
censing Examination (USMLE R ). An example of a test Un
Pi = n=1 ,
item is shown in Table 1. The part describing the case is N
referred to as stem and the incorrect answer options are
known as distractors. All items tested medical knowl- 3
Accredited by the Liaison Committee on Medical Education
edge and were written by experienced item-writers follow- (LCME).

6813

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller TIFFACADEMICS. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

78252 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Exam (elaborations)

Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam

Document information

Subjects

Written for

Seller

Reviews received

Content preview