Organizational
Research Methods
Volume 11 Number 4
October 2008 815-852
2008 Sage Publications
Answers to 20 Questions 10.1177/1094428106296642
http://orm.sagepub.com
About Interrater Reliability hosted at
http://online.sagepub.com
and Interrater Agreement
James M. LeBreton
Purdue University
Jenell L. Senter
Wayne State University
The use of interrater reliability (IRR) and interrater agreement (IRA) indices has increased
dramatically during the past 20 years. This popularity is, at least in part, because of the
increased role of multilevel modeling techniques (e.g., hierarchical linear modeling and mul-
tilevel structural equation modeling) in organizational research. IRR and IRA indices are
often used to justify aggregating lower-level data used in composition models. The purpose
of the current article is to expose researchers to the various issues surrounding the use of IRR
and IRA indices often used in conjunction with multilevel models. To achieve this goal, the
authors adopt a question-and-answer format and provide a tutorial in the appendices illustrat-
ing how these indices may be computed using the SPSS software.
Keywords: interrater agreement; interrater reliability; aggregation; multilevel modeling
A s the use of multilevel modeling techniques has increased in the organizational
sciences, the uses (and the potential for misuses) of interrater reliability (IRR) and
interrater agreement (IRA) indices (often used in conjunction with multilevel modeling)
have also increased. The current article seeks to provide answers to common questions
pertaining to the use and application of IRR and IRA indices. Our hope is that this discus-
sion will serve as a guide for researchers new to these indices and will help expand
research possibilities to those already using these indices in their work.
Our article has three main objectives. First, we synthesize and integrate various defini-
tional issues concerning the concepts of IRR and IRA and the indices most commonly used
to assess these concepts. In doing so, we both recapitulate previous work and offer our own
extensions and interpretations of this work. Second, we recognize that a number of provo-
cative questions exist about the concepts of IRR and IRA and the primary indices used to
assess these concepts. This is especially true of researchers being exposed to multilevel
modeling for the first time. Thus, we also provide answers to some of the more common
questions associated with using these indices when testing multilevel models. Some of
these questions have been previously addressed, whereas some have not. The purpose of
Authors’ Note: We would like to thank Paul Bliese, Rob Ployhart, and three anonymous reviewers for their
constructive comments and feedback on earlier versions of this article. An earlier version of this article was
presented at the 66th annual meeting of the Academy of Management in Atlanta, Georgia. Correspondence
concerning this article should be addressed to James M. LeBreton, Department of Psychological Sciences,
Purdue University, 703 Third St., West Lafayette, IN 47907-2081; e-mail: lebreton@psych.purdue.edu.
815
,816 Organizational Research Methods
the article is to draw together, in a single resource, answers to a number of common ques-
tions pertaining to the use of IRR and IRA indices. Finally, we demonstrate the principles
discussed in our answers via empirical tutorials contained in an appendix. The purpose of
the last objective is to provide new researchers with concrete examples that will enable
them to integrate their conceptual grasp of IRR and IRA with the technical skills necessary
to answer their research questions (i.e., guidance using SPSS software). All of the data
analyzed in the current article are presented in the appendix and are also available from
either of the authors.
Definitional Questions About IRR and IRA
What is meant by IRR and IRA, and how are these concepts similar to and different
from one another? How are IRR and IRA related to discussions of multilevel modeling?
Such questions are often asked by researchers, both faculty and students, who are under-
taking their first multilevel project. How one goes about answering these questions has
a profound impact on (a) the approach one takes when estimating IRR and IRA, (b) the
conclusions one will draw about IRR and IRA, and (c) the appropriateness of conducting
a multilevel analysis. Thus, we address these definitional questions below. Throughout
our article, we use the following notation:
X = an observed score, typically measured on an interval scale of measurement,
SX2 = the observed variance on X,
J = the number of items ranging from j = 1 to J,
K = the number of raters or judges ranging from k = 1 to K, and
N = the number of targets ranging from i = 1 to N.
Question 1: What is meant by IRR and IRA, and how are these concepts similar to and differ-
ent from one another?
IRR refers to the relative consistency in ratings provided by multiple judges of multiple
targets (Bliese, 2000; Kozlowski & Hattrup, 1992; LeBreton, Burgess, Kaiser, Atchley, &
James, 2003). Estimates of IRR are used to address whether judges rank order targets in a
manner that is relatively consistent with other judges. The concern here is not with the
equivalence of scores but rather with the equivalence of relative rankings. In contrast,
IRA refers to the absolute consensus in scores furnished by multiple judges for one or
more targets (Bliese, 2000; James, Demaree, & Wolf, 1993; Kozlowski & Hattrup, 1992;
LeBreton et al., 2003). Estimates of IRA are used to address whether scores furnished by
judges are interchangeable or equivalent in terms of their absolute value.
The concepts of IRR and IRA both address questions concerning whether or not ratings
furnished by one judge are ‘‘similar’’ to ratings furnished by one or more other judges
(LeBreton et al., 2003). These concepts simply differ in how they go about defining inter-
rater similarity. Agreement emphasizes the interchangeability or the absolute consensus
between judges and is typically indexed via some estimate of within-group rating disper-
sion. Reliability emphasizes the relative consistency or the rank order similarity between
judges and is typically indexed via some form of a correlation coefficient. Both IRR and
, LeBreton, Senter / Interrater Reliability and Interrater Agreement 817
IRA are perfectly reasonable approaches to estimating rater similarity; however, they are
designed to answer different research questions. Consequently, researchers need to make
sure their estimates match their research questions.
Question 2: How are IRR and IRA related to discussions of multilevel modeling?
The basic idea underlying multilevel modeling is that there are variables measured
at different levels of analysis (e.g., individuals, work groups, work divisions, different
organizations) that affect dependent variables, typically measured at the lowest level of
analysis (e.g., individuals). In some instances, the higher-level variables are actually mea-
sured at a higher level of analysis (e.g., organizational net profits). However, in other
instances, higher-level variables are composites of lower-level variables (e.g., aggregated
individual-level measures of affect used to measure group affective tone; George, 1990).
Depending on the theoretical nature of the aggregated construct, it may (or may not)
be necessary to demonstrate that the data collected at a lower level of analysis (e.g.,
individual-level climate perceptions) are similar enough to one another prior to aggregat-
ing those data as an indicator of a higher-level construct (e.g., shared climate perceptions
within work teams). For example, Kozlowski and Klein (2000) discussed two approaches
to bottom-up processing (where individual- or lower-level data are combined to reflect a
higher-level variable): composition and compilation approaches. Chan (1998) and Bliese
(2000) reviewed various composition and compilation models and concluded that IRA
and IRR are important when using composition models but less so for compilation
models.
Compilation processes rest on the assumption that there are apparent differences
between aggregated and nonaggregated data. Therefore, it is not necessary that individual-
or lower-level data demonstrate consensus prior to aggregation. For example, additive
models rely on a simple linear combination of lower-level data and do not require the
demonstration of within-group agreement (Chan, 1998). In contrast, composition processes
are often based on the assumption that individual- or lower-level data are essentially
equivalent with the higher-level construct. Therefore, to justify aggregating lower-level
data to approximate a higher-level construct, it is necessary to demonstrate that the lower-
level data are in agreement with one another (e.g., individuals within a work group have
highly similar or interchangeable levels of affect that are different from individuals’ affect
levels in another work group, and, thus, each work group has a unique affective tone).
Because such composition models focus on the interchangeability (i.e., equivalence) of
lower-level data, estimates of IRA are often used to index the extent of agreement, or lack
thereof, among lower-level observations. The equivalence of lower-level data may be
demonstrated via estimates of IRA or IRR + IRA. When only a single target is assessed,
the empirical support needed to justify aggregation may be acquired via IRA indices such
as rWG (e.g., direct consensus models and referent-shift consensus models; Chan, 1998).
When multiple targets are assessed, the empirical support needed to justify aggregation
may be acquired via IRA indices such as rWG and via IRR + IRA indices such as intra-
class correlation coefficients (ICCs). In sum, when lower-level data are aggregated to form
a higher-level variable, estimates of IRA or IRR + IRA are often invoked to aid in justify-
ing this aggregation.
, 818 Organizational Research Methods
Question 3: Okay, so how do I figure out which form of interrater similarity is relevant to my
research question?
The form of interrater similarity used to justify aggregation in multilevel modeling
should depend mainly on one’s research question and the type of data that one has col-
lected. Estimates of IRA tend to be more versatile because they can be used with one or
more targets, whereas estimates of IRR or IRR + IRA necessitate having multiple targets
(e.g., organizations). However, it should be mentioned that because our discussion pertains
to multilevel modeling and the need to provide sufficient justification for aggregation, esti-
mates of both IRA and IRR + IRA are typically used. This is because justification of
aggregating lower-level data is predicated on the consensus (i.e., interchangeability) among
judges furnishing scores on these lower-level data, and estimates of IRR only measure con-
sistency. Consequently, pure measures of IRR are rarely used in multilevel modeling
because justification of aggregation is typically not predicated on the relative consistency
of judges’ ratings irrespective of their absolute value. The remainder of our article
addresses questions primarily associated with estimating IRA or IRR + IRA.
Question 4: What are the most commonly used techniques for estimating IRA, IRR, and
IRR + IRA?
Measures of IRA
rWG indices. Table 1 summarizes the most commonly used indices of IRA, IRR, and IRR
+ IRA. Arguably, the most popular estimates of IRA have been James, Demaree, and
Wolf’s (1984, 1993) single-item rWG and multi-item rWGðJÞ indices. The articles introdu-
cing these indices have been cited more than 700 times in fields ranging from strategic
management to nursing. When multiple judges rate a single target on a single variable
using an interval scale of measurement, IRA may be assessed using the rWG index, which
defines agreement in terms of the proportional reduction in error variance,
S 2X
rWG = 1 − , ð1Þ
s2E
where S2X is the observed variance on the variable X (e.g., leader trust and support) taken
over K different judges or raters and s2E is the variance expected when there is a complete
lack of agreement among the judges. This is the variance obtained from a theoretical null
distribution representing a complete lack of agreement among judges. As discussed under
Questions 9 and 10, determining the shape of this distribution is one of the factors that
most complicates the use of rWG . Basically, it is the variance one would expect if all of
the judges responded randomly when evaluating the target. Thus, it is both a theoretical
(i.e., it is not empirically determined) and conditional (i.e., assumes random responding)
distribution.
The use of rWG is predicated on the assumption that each target has a single true score
on the construct being assessed (e.g., leader trust and support). Consequently, any var-
iance in judges’ ratings is assumed to be error variance. Thus, it is possible to index