Assessing Writing 35 (2018) 41–55
Contents lists available at ScienceDirect
Assessing Writing
journal homepage: www.elsevier.com/locate/asw
T
Examining the validity of an analytic rating scale for a Spanish test
for academic purposes using the argument-based approach to
validation
Arturo Mendozaa, , Ute Knochb
⁎
a
Department of Applied Linguistics, School of Languages, Linguistics and Translation, Universidad Nacional Autónoma de México, Circuito interior s/
n, CP 04510, Mexico City, Mexico
b
Director, Language Testing Research Centre, University of Melbourne, Parkville 3010, Victoria, Australia
AR TI CLE I NF O AB S T R A CT
Keywords: Rating scales are used to assess the performance of examinees presented with open-ended tasks.
Analytic rating scales Drawing on an argument-based approach to validation, this study reports on the development of
Writing assessment for academic purposes an analytic rating scale designed for a Spanish test for academic purposes. The study is one of the
Argument-based approach to validation first that sets out the detailed scale development and validation activities for a rating scale for
Many-facet Rasch measurement
Spanish as a second language. The rating scale was grounded in a communicative competence
model and developed and validated over two phases. The first version was trialed by five raters,
and its quality was analyzed by means of many-facet Rasch measurement. Based on the raters’
experience and on the statistical results, the rating scale was modified and a second version was
trialed by six raters. After the rating process, raters were sent an online questionnaire in order to
collect their opinions and perceptions of the rating scale, the training and the feedback provided
during the rating process. The results suggest the rating scale was of good quality and raters’
comments were generally positive, although they mentioned that more samples and training
were needed. The study has implications for rating scale development and validation for lan-
guages other than English.
1. Introduction
Students wishing to study at a university where the medium of instruction is different from their mother tongue are often required
to prove their proficiency by taking a language test for academic purposes. These tests are considered high-stakes because results are
used to make decisions that have important consequences in students’ lives (Bachman & Palmer, 2010; Kane, 2013). In order to
guarantee that scores are fair, language tests must be carefully scrutinized and validated to ensure that the scores and the inter-
pretations based on the scores are valid and fair. When examinees are presented with open-ended writing tasks, the scripts they
produce are usually assessed by trained raters who use a rating scale to assign a score to the examinee’s performance. Rating such
performances is a complex undertaking. A score on such a writing test is not always purely a reflection of the writers’ performance,
but the outcome of the interaction between the rater, the rating scale and the script (Crusan, 2014; McNamara, 1996; Weigle, 2002).
This interaction can lead to undesired sources of variability that threaten the reliability of the exam and its results (East, 2009). This is
why rater training and monitoring is essential (Knoch, 2009, 2011; Weigle, 2002), as is studying the quality of the scoring process and
⁎
Corresponding author at: Department of Applied Linguistics, School of Languages, Linguistics and Translation, Universidad Nacional Autónoma de México,
Circuito interior s/n, CP 04510, Mexico City, Mexico.
E-mail addresses: a.mendoza@enallt.unam.mx (A. Mendoza), uknoch@unimelb.edu.au (U. Knoch).
https://doi.org/10.1016/j.asw.2017.12.003
1075-2935/ © 2018 Elsevier Inc. All rights reserved.
Received 11 July 2016; Received in revised form 7 December 2017; Accepted 19 December 2017
, A. Mendoza, U. Knoch Assessing Writing 35 (2018) 41–55
the scores (Montee & Malone, 2014).
While publications on work on rating processes, including scale development, and rater functioning are abundant in the as-
sessment of English as a second or foreign language, very little has been written about similar endeavors in the assessment of other
languages, for example Spanish (but see e.g., Ducasse & Hill, 2015 for the development of a rating scale to assess the writing of
Spanish speaking graduate students). In this paper, we describe the development and validation of a rating scale for Spanish for
academic purposes. This study is important as it sets out in detail the kind of procedures that other researchers involved in scale
development for rating scales for languages other than English may want to follow or adapt. In particular, we argue that rating scales
for languages other than English cannot simply rely on adapting a scale developed for English in such contexts as there are clear
differences in the languages and in the way second language ability develops. In the literature review that follows, we describe the
existing literature on rating scale development and validation, the assessment of language for academic purposes more generally and
the assessment of Spanish for academic purposes. We then describe the context of the study and the current project in more detail.
1.1. Rating scale development and validation
The development and validation of rating scales for academic writing is no simple undertaking. Scales should be conceived and
designed with the purpose of the assessment in mind (Crusan, 2014; Fulcher, 2010; Knoch, 2009; Montee & Malone, 2014; Weigle,
2002) and should be a good representation of the construct of the assessment (McNamara, 2002). In the anglophone context, rating
scales are often adapted or adopted from existing scales (Becker, 2011). For instance, in an academic setting, rating scales might be
derived from rating scales used in large-scale language tests for academic purposes. However, East (2009) cautions about the perils of
adapting rating scales from other similar ones, especially across languages. He argues that rating scales should take the target
language into account.
Rating scale developers have a number of decisions to make in the development process, all of which have been described in detail
in the literature. The type of rating scale selected (e.g., holistic, analytic, checklist, etc.) needs to closely reflect the purpose of the test
(Crusan, 2014; Hamp-Lyons, 1991; Montee & Malone, 2014; Weigle, 2002) and the outcome reported to users (Knoch, 2009). The
criteria in a scale are usually a reflection of the test construct and can either be based on a theory of language learning or devel-
opment or may be a reflection of a careful empirical analysis of written data produced by students (Fulcher, 2010; Knoch, 2009, 2011;
Montee & Malone, 2014). Scale designers need to ensure that the scale is not overly context-dependent, and therefore not gen-
eralizable to other testing contexts (Fulcher, 2010). Further decisions involve the number of band levels included in a scale (see e.g.,
Alderson, Clapham, & Wall, 1995; Attali, Lewis, & Steier, 2012).
Rating scale validation is often not clearly articulated in scale development reports, which makes it difficult to conduct com-
parisons between studies or replication research. Scale validation projects are also rarely framed within a theoretical model of scale
validation or validation in assessment. A brief review of recently published scale development and validation studies in language
assessments shows that very few of these studies were grounded within a theoretical model of validation (but see Deygers & Van
Gorp, 2015; Janssen, Meier, & Trace, 2015; Knoch, 2009; Lallmamode, Daud, & Kassim, 2016; Youn, 2015). In a recent paper
integrating rating processes into an argument-based framework to validation, Knoch and Chapelle (2017) put forward a range of
warrants, assumptions and possible sources of backing, many of which are directly relevant to the validation of rating scales. Drawing
on Kane’s conceptualization of inferences, warrants and assumptions Kane (2001, 2006, 2013), they were able to show that rating
processes are not only located within the evaluation inference as commonly conceptualized, but have relevance throughout most
inferences described in validation work. The warrants and assumptions relating to rating scales focused not only narrowly on the
scoring inference (as it was previously conceptualized), but showed that rating scales relate more broadly to all inferences in an
argument-based approach to validation, including the explanation inference (which examines the theoretical construct underlying
the test and the scale, as well as test consequences and decisions). Their framework provides a useful starting point for rating scale
validation and we will draw on this framework to situate our validation work as outlined in the description of the current study
below. Due to the scope of this study, we will focus on parts of the evaluation and explanation inference in this paper only, however
in the final section of the paper, we also provide suggestions for future work to broaden the validation activities. We list the specific
warrants and assumptions for which we sought backing for this study in Table 1 in the methodology section.
1.2. Language tests for academic purposes
Tests designed for academic purposes should authentically reflect the writing skills needed by students for academic success
(Cumming, 2013, 2014). These skills vary from field to field, making the selection of writing tasks a difficult endeavor. Studies
conducted in Anglophone contexts have shown the diversity of genres and writing tasks required of university students across
academic disciplines (Canseco & Byrd, 1989; Cooper & Bikowski, 2007; Gardner & Nesi, 2012; Hale et al., 1996; Horowitz, 1986).
Research has also been conducted with faculty members and students regarding the importance of different academic writing skills
(Rosenfeld, Courtney, & Fowles, 2004; Rosenfeld, Leung, & Oltman, 2001) and these studies have highlighted the importance of
academic writing skills such as paraphrasing, and the ability to appropriately cite from a range of sources. For Spanish, such studies
are few, but they reflect to a large extent what has been found in Anglophone contexts (Castelló et al., 2012; Hernández & Castelló,
2014; Mendoza, 2014). Without a careful examination of the setting under assessment – in this case, academic writing – there is a risk
of under-representing or ill-defining the construct (Cumming, 2014).
42