Psychometric Methods Theory Into Practice PDF
Psychometric Methods Theory Into Practice PDF
Psychometric Methods Theory Into Practice PDF
This series provides applied researchers and students with analysis and research design books that
emphasize the use of methods to answer research questions. Rather than emphasizing statistical
theory, each volume in the series illustrates when a technique should (and should not) be used and
how the output from available software programs should (and should not) be interpreted. Common
pitfalls as well as areas of further development are clearly articulated.
RECENT VOLUMES
Larry R. Price
The term psychometrics has an almost mystical aura about it. Larry Price brings his vast
acumen as well as his kind and gentle persona to demystify for you the world of psycho-
metrics. Psychometrics is not just a province of psychology. In fact, the theory-to-practice
orientation that Larry brings to his book makes it clear how widely applicable the fun-
damental principles are across the gamut of disciplines in the social sciences. Because
psychometrics is foundationally intertwined with the measurement of intelligence, Larry
uses this model to convey psychometric principles for applied uses. Generalizing these
principles to your domain of application is extremely simple because they are presented
as principles, and not rules that are tied to a domain of inquiry.
Psychometrics is an encompassing field that spans the research spectrum from inspi-
ration to dissemination. At the inspiration phase, psychometrics covers the operational
characteristics of measurement, assessment, and evaluation. E. L. Thorndike (1918) once
stated, “Whatever exists at all exists in some amount. To know it thoroughly involves
knowing its quantity as well as its quality.” I interpret this statement as a callout to mea-
surement experts: Using the underlying principles of psychometrics, “figure out how to
measure it!” If it exists at all, it can be measured, and it is up to us, as principled psy-
chometricians, to divine a way to measure anything that exists. Larry’s book provides
an accessible presentation of all the tools at your disposal to figure out how to measure
anything that your research demands.
Thorndike’s contemporary, E. G. Boring (1923) once quipped, “Intelligence is what
the tests test.” Both Thorndike’s and Boring’s famous truisms have psychometrics at the
core of their intent. Boring’s remarks move us more from the basics of measurement to
the process of validation, a key domain of psychometrics. I have lost count of the many
different kinds of validities that have been introduced, but fortunately, Larry’s book enu-
merates the important ones and gives you the basis to understand what folks mean when
they use the word validity in any phase of the research process.
vii
viii Series Editor’s Note
Todd D. Little
On the road in Corvallis, Oregon
References
Boring, E. G. (1923). Intelligence as the tests test it. New Republic, 36, 35–37.
Thorndike, E. L. (1918). The nature, purposes, and general methods of measurement of educa-
tional products. In S. A. Courtis (Ed.), The measurement of educational products (17th Year-
book of the National Society for the Study of Education, Pt. 2, pp. 16–24). Bloomington, IL:
Public School.
Acknowledgments
ix
Contents
1 • Introduction1
1.1 Psychological Measurement and Tests 1
1.2 Tests and Samples of Behavior 3
1.3 Types of Tests 3
1.4 Origin of Psychometrics 4
1.5 Definition of Measurement 5
1.6 Measuring Behavior 5
1.7 Psychometrics and Its Importance to Research and Practice 7
1.8 Organization of This Book 9
Key Terms and Definitions 10
xi
xii Contents
5 • Scaling141
5.1 Introduction 141
5.2 A Brief History of Scaling 142
5.3 Psychophysical versus Psychological Scaling 144
5.4 Why Scaling Models Are Important 146
5.5 Types of Scaling Models 146
5.6 Stimulus-Centered Scaling 147
5.7 Thurstone’s Law of Comparative Judgment 148
5.8 Response-Centered Scaling 150
5.9 Scaling Models Involving Order 150
5.10 Guttman Scaling 151
5.11 The Unfolding Technique 153
5.12 Subject-Centered Scaling 156
5.13 Data Organization and Missing Data 160
5.14 Incomplete and Missing Data 162
5.15 Summary and Conclusions 162
Key Terms and Definitions 162
6 • Test Development165
6.1 Introduction 165
6.2 Guidelines for Test and Instrument Development 166
6.3 Item Analysis 182
6.4 Item Difficulty 182
6.5 Item Discrimination 184
6.6 Point–Biserial Correlation 186
6.7 Biserial Correlation 188
6.8 Phi Coefficient 189
6.9 Tetrachoric Correlation 190
6.10 Item Reliability and Validity 190
6.11 Standard Setting 193
6.12 Standard-Setting Approaches 194
6.13 The Nedelsky Method 195
6.14 The Ebel Method 196
6.15 The Angoff Method and Modifications 196
6.16 The Bookmark Method 198
6.17 Summary and Conclusions 199
Key Terms and Definitions 199
7 • Reliability203
7.1 Introduction 203
7.2 Conceptual Overview 204
7.3 The True Score Model 206
7.4 Probability Theory, True Score Model, and Random Variables 207
xiv Contents
8 • Generalizability Theory257
8.1 Introduction 257
8.2 Purpose of Generalizability Theory 258
8.3 Facets of Measurement and Universe Scores 259
8.4 How Generalizability Theory Extends Classical Test Theory 260
8.5 Generalizability Theory and Analysis of Variance 260
8.6 General Steps in Conducting a Generalizability Theory Analysis 263
8.7 Statistical Model for Generalizability Theory 263
8.8 Design 1: Single-Facet Person-by-Item Analysis 266
8.9 Proportion of Variance for the p × i Design 271
8.10 Generalizability Coefficient and CTT Reliability 273
8.11 Design 2: Single-Facet Crossed Design with Multiple Raters 274
8.12 Design 3: Single-Facet Design with the Same Raters
on Multiple Occasions 278
8.13 Design 4: Single-Facet Nested Design with Multiple Raters 279
8.14 Design 5: Single-Facet Design with Multiple Raters Rating
on Two Occasions 280
8.15 Standard Errors of Measurement: Designs 1–5 281
8.16 Two-Facet Designs 281
8.17 Summary and Conclusions 286
Key Terms and Definitions 287
Contents xv
9 • Factor Analysis289
9.1 Introduction 289
9.2 Brief History 291
9.3 Applied Example with GfGc Data 292
9.4 Estimating Factors and Factor Loadings 294
9.5 Factor Rotation 301
9.6 Correlated Factors and Simple Structure 306
9.7 The Factor Analysis Model, Communality, and Uniqueness 309
9.8 Components, Eigenvalues, and Eigenvectors 312
9.9 Distinction between Principal Components Analysis
and Factor Analysis 315
9.10 Confirmatory Factor Analysis 319
9.11 Confirmatory Factor Analysis and Structural Equation Modeling 319
9.12 Conducting Factor Analysis: Common Errors to Avoid 322
9.13 Summary and Conclusions 325
Key Terms and Definitions 325
Introduction
This chapter introduces psychological measurement and classification. Psychological tests are
defined as devices for measuring human behavior. Tests are broadly defined as devices for
measuring ability, aptitude, achievement, attitudes, interests, personality, cognitive function-
ing, and mental health. Psychometrics is defined as the science of evaluating the charac-
teristics of tests designed to measure psychological attributes. The origin of psychometrics
is briefly described, along with the seminal contributions of Francis Galton. The chapter
ends by highlighting the role of psychological measurement and psychometrics in relation
to research in general.
During the course of your lifetime, most likely you have been affected by some form
of psychological measurement. For example, you or someone close to you has taken a
psychological test for academic, personal, or professional reasons. The process of psy-
chological measurement is carried out by way of a measuring device known as a test. A
psychological test is a device for acquiring a sample of behavior from a person. The term
test is used to broadly describe devices aimed toward measuring ability, aptitude, achieve-
ment, attitudes, interests, personality, cognitive functioning, and mental health. Tests are
often contextualized by way of a descriptor such as “intelligence,” “achievement,” or “per-
sonality.” For example, a well-known intelligence test is the Wechsler Adult Intelligence
Scale—Fourth Edition (WAIS-IV; 2008). A well-known achievement test is the Stanford
Achievement Test (SAT; Pearson Education, 2015) and the NEO Five-Factor Inventory
(NEO-FFI; Costa & McCrae, 1992) is a well-known test or instrument that measures
personality. Also, tests have norms (a summary of test results for a representative group
1
2 PSYCHOMETRIC METHODS
of subjects) or standards by which results can be used to predict other more important
behavior. Table 1.1 provides examples of common types of psychological tests.
Individual differences manifested by scores on such tests are real and often sub-
stantial in size. For example, you may have observed differences in attributes such as
personality, intelligence, or achievement based on the results you or someone close to
you received on a psychological test. Test results can and often do affect people’s lives in
important ways. For example, scores on tests can be used to classify a person as brain
damaged, weak in mathematical skills, or strong in verbal skills. Tests can also be used for
selection purposes in employment settings or in certain types of psychological counsel-
ing. Tests are also used for evaluation purposes (e.g., for licensure or certification in law,
medicine, and public safety professions).
Prior to examining the attributes of persons measured by tests, we must accurately
describe the attributes of interest. To this end, the primary goal of psychological measure-
ment is to describe the psychological attributes of individuals and the differences among them.
Describing psychological attributes involves some form of measurement or classification
scheme. Measurement is broadly concerned with the methods used to provide quanti-
tative descriptions of the extent to which persons possess or exhibit certain attributes.
Classification is concerned with the methods used to assign persons to one or another
of two or more different categories or classes (e.g., a major in college such as biology,
history, or English; diseased or nondiseased; biological sex [male or female]; or pass/fail
regarding mastery of a subject).
Tests measuring cognitive ability, cognitive functioning, and achievement are classified
as criterion-referenced or norm-referenced. For example, criterion-referenced tests are
used to determine where persons stand with respect to highly specific educational objec-
tives (Berk, 1984). In a norm-referenced test, the performance of each person is inter-
preted in reference to a relevant standardization sample (Peterson, Kolen, & Hoover,
1989). Turning to the measurement of attitudes, instruments are designed to measure the
intensity (i.e., the strength of a person’s feeling), direction (i.e., the positive, neutral, or
negative polarity of a person’s feeling), and target (i.e., the object or behavior with which
the feeling is associated; Gable & Wolfe, 1998). Tests or instruments may be used to
quantify the variability between people (i.e., interindividual differences) at a single point
in time or longitudinally (i.e., how a person’s attitude changes over time). Tests and other
measurement devices vary according to their technical quality. The technical quality of
a test is related to the evidence that verifies that the test is measuring what it is intended
to measure in a consistent manner. The science of evaluating the characteristics of tests
designed to measure psychological attributes of people is known as psychometrics.
4 PSYCHOMETRIC METHODS
Science is defined here as a systematic framework that allows us to establish and organize
knowledge in a way that provides testable explanations and predictions about psycho-
logical measurement and testing.
Charles Darwin’s On the Origin of Species (1859) advanced the theory that chance varia-
tions in species would facilitate selection or rejection by nature. Such chance variations
manifested themselves as individual differences. Specifically, Darwin was likely respon-
sible for the beginning of interest in the study of individual differences, as is seen in the
following quote from Origin of Species:
The many slight differences which appear in the offspring from the same parents . . . may be
called individual differences. . . . These individual differences are of the highest importance . . .
for they afford materials for natural selection to act on. (p. 125)
Previously, measurement was described as being concerned with the methods used to
provide quantitative descriptions of the extent to which persons possess or exhibit cer-
tain attributes. Following this idea, measurement is the process of assigning numbers
(i.e., quantitative descriptions) to persons in an organized manner, providing a way to
represent the attributes of the persons. Numbers are assigned to persons according to a
prescribed and reproducible procedure. For example, an intelligence test yields scores
based on using the same instructions, questions, and scoring rules for each person.
Scores would not be comparable if the instructions, questions, and scoring rules were
not the same for each person. In psychological measurement, numbers are assigned in
a systematic way based on a person’s attributes. For example, a score of 100 on an intel-
ligence test for one person and a score of 115 for another yields a difference of 15 points
on the attribute being measured—performance on an intelligence test. Another example
of measurement for classification purposes is based on a person’s sex. For example, the
biological sex of one person is female and the other is male, providing a difference in the
attribute of biological sex.
Measurement theory is a branch of applied statistics that describes and evaluates
the quality of measurements (including the response process that generates specific score
patterns by persons), with the goal of improving their usefulness and accuracy. Psycho-
metricians use measurement theory to propose and evaluate methods for developing
new tests and other measurement instruments. Psychometrics is the science of evaluating
the characteristics of tests designed to measure the psychological attributes of people.
Although our interest in this book is in psychological measurement, we begin with some
clear examples of measurement of observed properties of things in the physical world.
For example, if we want to measure the length of a steel rod or a piece of lumber, we
can use a tape measure. Things in the physical world that are not directly observable are
measured as well. Consider measurement of the composition of the air we breathe—
approximately 21% oxygen and 79% nitrogen. These two gases are invisible to the human
eye, yet devices or tests have been developed that enable us to measure the composition
of the air we breathe with a high degree of accuracy. Another example is a clock used to
measure time; time is not directly observable, but we can and do measure it daily. In psy-
chological measurement, some things we are interested in studying are directly observable
(e.g., types of body movements in relation to a certain person’s demeanor; reaction time
to a visual stimulus; or perhaps to evaluate someone’s ability to perform a task to a certain
level or standard). More often in psychological measurement, the things we are interested
in measuring are not directly observable. For example, intelligence, personality, cogni-
tive ability, attitude, and reading ability are unobservable things upon which people vary
(i.e., they individually differ). We label these unobservable things as constructs. These
6 PSYCHOMETRIC METHODS
unobservable things (i.e., constructs) are intangible and not concrete, although the people
we are measuring are very real. In this case, we call the variable an intellectual construct.
The quantitative reasoning test under the construct of fluid intelligence (see Table 1.1) is
a variable because people’s scores vary on the test.
In this book we use the construct of intelligence to illustrate the application of psy-
chometric methods to real data. The construct of intelligence is unobservable, so how can
we measure it? Although a number of theories of intelligence have been forwarded over
time, in this book we use a model based on the multifactor form of the general theory
of intelligence (GfGc theory; Horn, 1998), which includes fluid and crystallized compo-
nents of intelligence and a short-term memory component. Why use G or GfGc theory of
intelligence versus one of the other theories? First, psychometric methods and the theory
and measurement of intelligence share a long, rich history (i.e., over a century). Second,
the G theory of intelligence, and variations of it such as GfGc and other multiple-factor
models, boast a substantial research base (in terms of quantity and quality). The research
base on the theory of general intelligence verifies that any given sample of people pos-
sesses varying degrees of ability on cognitively demanding tasks. For example, if a person
excels at cognitively challenging tasks, we say that he or she has an above-average level of
general intelligence (Flynn, 2007). Furthermore, empirical research has established that
the cognitive components of G theory are correlated (Flanagan, McGrew, & Ortiz, 2000).
For instance, people measured according to G theory have patterns of (1) large vocabu-
laries, (2) large funds of general information, and (3) good arithmetic skills. The use of
G theory throughout this book is in no way intended to diminish the legitimacy of other
models or theories of intelligence such as people who exhibit an exceptional level of
musical ability (i.e., musical G) or one who exhibits a high level of kindness, generosity,
or tolerance (i.e., moral G; Flynn, 2007). Rather, use of G theory ideally provides a data
structure that enhances moving from measurement concepts to psychometric techniques
to application and interpretation.
Two components of G theory are crystallized and fluid intelligence (i.e., GfGc denotes
the fluid and crystallized components of G theory). To measure each component, we
use measurements of behavior that reflect certain attributes of intelligence as posited by
G theory. Specifically, we make inferences to the unobservable construct of intelligence
based on the responses to test items on several components of the theory. Table 1.2 pro-
vides each subtest that constitutes three components of the general theory of intelligence:
crystallized and fluid intelligence and short-term memory.
In Table 1.2, three components of the theory of general intelligence—fluid (Gf),
crystallized (Gc), and short-term memory (Gsm)—are used in examples throughout the
book to provide connections between a theoretical model and actual data. The related
dataset includes a randomly generated set of item responses based on a sample size N =
1,000 persons. The data file is available in SPSS (GfGc.sav), SAS (GfGc.sd7), or delimited
file (GfGc.dat) formats and are downloadable from the companion website (www.guilford.
com/price2-materials).
In GfGc theory, fluid intelligence is operationalized as process oriented and crystal-
lized intelligence as knowledge or content oriented. Short-term memory is composed of
Introduction 7
recall of information, auditory processing, and mathematical knowledge (see Table 1.1).
In Figure 1.1, GfGc theory is illustrated as a model, with the small rectangles on the far
right representing individual test items. The individual test items are summed to create
linear composite scores represented as the second larger set of rectangles. The ovals in
the diagram represent latent constructs as measured by the second- and first-level observed
variables. Table 1.2 provides an overview of the subtests, level of measurement, and
descriptions of the variables for a sample of 1,000 persons or examinees in Figure 1.1.
1.7 P
sychometrics and Its Importance to Research
and Practice
As previously noted, psychological measurement and testing affect people from all walks
of life. Psychological measurement also plays an important role in research studies of
all types—applied and theoretical. The role measurement plays related to the integrity
of a research study can’t be understated. For this reason, understanding psychological
measurement is essential to your being able to evaluate the integrity and/or usefulness of
scores obtained from tests and other instruments. If you are reading this book, you may
be enrolled in a graduate program in school or clinical psychology that will involve you
making decisions based on scores obtained from a test, personality inventory, or other
8 PSYCHOMETRIC METHODS
Item 01
Fluid
intelligence
test 1
Item 10
Item 01
Fluid
Fluid intelligence intelligence
(Gf ) test 2
Item 20
Item 01
Fluid
intelligence
test 3
Item 20
Item 01
Crystallized
intelligence
test 1
Item 25
Item 01
Crystallized
intelligence
test 2
Item 25
General
Crystallized
intelligence
intelligence (Gc)
(G)
Item 01
Crystallized
intelligence
test 3
Item 15
Item 01
Crystallized
intelligence
test 4
Item 15
Item 01
Short-term
memory
test 1
Item 20
Item 01
Short-term Short-term
memory memory
(Stm) test 2
Item 10
Item 01
Short-term
memory
test 3
Item 15
Psychological measurement (and test theory more specifically) have the most rel-
evance for points 2 through 4 above. However, the process of measurement must be
considered from the outset because the outcomes of the study are directly related to how
they are measured.
for establishing evidence of content validity. The final section of the chapter covers tech-
niques for establishing evidence of construct validity. Chapter 5 introduces scaling and
the fundamental role it plays in psychometrics. In Chapter 6, guidelines for test and
instrument development are introduced along with methods for evaluating the quality
of test items. Chapter 7 presents score reliability within the classical test theory (CTT)
framework. Chapter 8 introduces generalizability theory as an extension of the CTT
model for estimating the reliability of scores based on the scenario in which raters or
judges score persons. In Chapter 9, factor analysis is presented as an important tool for
studying the underlying structure of a test. Connections are made to the process of con-
struct validation (Chapter 4). Chapter 10 introduces item response theory and advanced
test theory, which are very useful for modeling a person’s true score (a.k.a. latent trait)
based on patterns of responses to test questions. The final chapter (11) covers the devel-
opment of norms and test equating. Examples of how standard scores (norms) are devel-
oped are provided, along with their utility in measurement and testing. The chapter
ends with an introduction to test score equating based on the linear, equipercentile and
items response theory true score methods. Example applications are provided using three
equating designs. Now we turn to Chapter 2, on measurement and statistical concepts, to
provide a foundation for the material presented in subsequent chapters.
Classification. Concerned with the measurement methods used to assign persons to one
or another of two or more different categories or classes.
Composite score. A score created by summing the individual items on a test. Composite
scores may be equally weighted or unequally weighted.
Constructs. Unobservable things that are intangible and not concrete. For example, intel-
ligence is known as an intellectual construct.
Criterion-referenced test. Used to determine where persons stand with respect to highly
specific educational objectives.
Francis Galton. Known as the father of psychometrics due to his work in measurement of
human anthropometrics, differentiation, and abilities.
Measurement. The process of assigning numbers (i.e., quantitative descriptions) to per-
sons in an organized manner, providing a way to represent the attributes of the
persons.
Measurement theory. A branch of applied statistics that describes and evaluates the
quality of measurements with the goal of improving their usefulness and accuracy.
Norm-referenced test. A test where the performance of each person is interpreted in
reference to a well-defined standardization sample.
Psychological test. A device for acquiring a sample of behavior from a person.
Introduction 11
Variable. Characteristics or qualities in which persons differ among themselves. The char-
acteristics or qualities are represented numerically. For example, a test score is a
variable because people often differ in their scores.
2
This chapter presents measurement and statistical concepts essential to understanding the
theory and practice of psychometrics. The properties of numbers are described, with an
explanation of how they are related to measurement. Techniques for organizing, summariz-
ing, and graphing distributions of variables are presented. The standard normal distribu-
tion is introduced, along with the role it plays in psychometrics and statistics in general.
Finally, correlation and regression are introduced, with connections provided relative to the
fundamental role each plays in the study of variability and individual differences.
2.1 Introduction
We begin our study of psychometrics by focusing on the properties of numbers and how
these properties work together with four levels of measurement. The four levels of mea-
surement provide a clear guide regarding how we measure psychological attributes. For
the more mathematically inclined or for those who want a more in-depth treatment of the
material in this chapter, see the Appendix. Reviewing the Appendix is useful in extend-
ing or refreshing your knowledge and understanding of statistics and psychometrics.
The Appendix also provides important connections between psychometrics and statistics
beyond the material provided in this chapter. Source code from SPSS and SAS is included
in the Appendix to carry out analyses.
13
14 PSYCHOMETRIC METHODS
–6.5 –5.5 –4.5 –3.5 –2.5 –1.5 –.5 .5 1.5 2.5 3.5 4.5 5.5 6.5
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6
IQ score: 60 70 80 90 100 110 120 130 140
treated differently depending on their level or scale of measurement. They are used in
psychological measurement in two fundamental ways. First, numbers can be used to
categorize people. For example, for biological sex, the number “1” can be assigned to
reflect females and the number “2” males. Alternatively, the response to a survey ques-
tion may yield a categorical response (e.g., a person answers “Yes,” “No,” “Maybe,” or
“Won’t Say”). In the previous examples, there is no ordering, only categorization. A
second way numbers are useful to us in psychological measurement is to establish order
among people. For example, people can be ordered according to the amount or level of
psychological attribute they possess (e.g., the number “1” may represent a low level of
anxiety, and the number 5 may represent a high level of anxiety). However, in the order
property the size of the units between the score points is not assumed to be equal (e.g.,
the distance between 1 and 2 and the distance between 2 and 3 on a 5-point response
scale are not necessarily equal).
When we use real numbers, we enhance our ability to measure attributes by defin-
ing the basic size of the unit of measurement for a test. Real numbers are also continuous
because they represent any quantity along a number line (Figure 2.1). Because they lie
on a number line, their size can be compared. Real numbers can be positive or negative
and have decimal places after the point (e.g., 3.45, 10.75, or –25.12). To this end, a real
number represents an amount of something in precise units. For example, if a person
scores 100 on a test of general intelligence and another person scores 130; the two people
are precisely 30 IQ points apart (Figure 2.1).
A final point about our example of real number data expressed as a continuous
variable is that in Figure 2.1, although there are intermediate values between the whole
numbers, it is only the whole numbers that are used in analyses and reported.
of measurement includes criteria or rules for how numbers are assigned to persons in
relation to the attribute being measured. Also, the different levels of measurement convey
different amounts of information.
Harvard University psychologist S. S. Stevens conducted the most extensive experi-
mentation on the properties and systems of measurement. Stevens’s work produced a useful
definition of measurement and levels of measurement that are currently the most widely
used in the social and behavioral sciences. Stevens defines measurement as “the assign-
ment of numerals to objects or events according to rules” (1951b, p. 22). Stevens does not
mention the property of the numbers (i.e., identity, order, equal intervals, absolute zero);
instead, his definition states that numbers are assigned to objects or events according to rules.
However, it is the rules that provide the operational link between the properties of numbers
and the rules for their assignment in the Stevens tradition. Figure 2.2 illustrates the link
between the properties of numbers and the rules for their assignment.
To illustrate the connection between Stevens’s work on numbers and the properties
of numerical systems, we begin with the property of identity. The property of identity
allows us to detect the similarity or differentness among people. We can consolidate these
contrasting terms into “distinctiveness.” The most basic level of measurement (nominal)
allows us to differentiate among categories of people according to their distinctiveness
(e.g., for two persons being measured, one is female and the other is male, or one person
has red hair and the other blonde hair). Notice that in the examples of the identity prop-
erty in combination with the nominal level of measurement, no ordering exists; there is
only classification on the basis of the distinctiveness of the attribute being measured (see
Table 2.1). As we see in Figure 2.3, when only the identity property exists in the mea-
surement process, the level of measurement is nominal. Another example of the identity
property and the nominal level of measurement is provided in Figure 2.4, where a person
Yes
No
Maybe
Won’t Say
No
Maybe Yes
Won’t say
FIGURE 2.4. Graphic illustrating no order in response alternatives. From de Ayala (2009, p. 239).
Copyright 2009 by The Guilford Press. Reprinted by permission.
responds to a survey question by selecting one of four options. The options are discrete
categories (i.e., only identity is established, not order of any kind).
Next, if two persons share a common attribute, but one person has more of the attri-
bute than the other, then the property of order is established (i.e., the ordinal level of
measurement). Previously, in the nominal level of measurement, only identity or dis-
tinctiveness was a necessary property reflected by the numbers. However, in the ordinal
level of measurement, the properties of identity and quantity must exist. Figure 2.5 illus-
trates an ordinal scale designed to measure anxiety that captures the properties of identity
and order. On the scale in the figure, the number 1 identifies the lowest level of anxiety
expressed by the qualitative descriptor “never,” and the number 5 identifies the highest
level of anxiety expressed by the qualitative descriptor “always.”
Before continuing with properties of numbers and measurement levels, the following
section provides important information related to the quantity property of measurement
and its relationship to units of measurement.
Units of Measurement
The property of quantity requires that units of measurement be specifically defined. We are
familiar with how things are measured according to units in physical measurement. For
example, if you want to measure the length of a wall, you use a tape measure marked in inches
or centimeters. The length of the wall is measured by counting the number of units from one
end of the wall to the other. Consider the psychological attribute of intelligence—something
not physically observable. How can we measure intelligence (e.g., what are the units we can
use, and what do these units actually represent)? For example, the units are the responses
to a set of questions included on a test of verbal intelligence, but how sure are we that the
responses to the questions actually represent intelligence? Based on these ideas, you begin to
understand that the measurement of attributes that are not directly observable (in a physical
sense) present one of the greatest challenges in psychometrics.
interval property to hold, the 10-point difference must be the same at different points
along the score scale. Finally, notice that when equal intervals exist, the property of order
is also met.
FIGURE 2.6. Length measured using two different measurement rules. Adapted from Glenberg
and Andrzejewski (2008, p. 11). Copyright 2008 by Lawrence Erlbaum Associates. Adapted by per-
mission. Application: The lengths of the bars can be directly compared by using, say, centimeters.
However, when the bars are measured using only ranks, we can only say that “bar A” is shorter
than “bar B.” In psychological measurement, we might say that a person ranked according to “bar
A” has “less” of some attribute than a person ranked according to “bar B.”
20 PSYCHOMETRIC METHODS
Degrees Celsius
0° 50° 100°
50% 50%
FIGURE 2.7. Three temperatures represented on the Celsius and Kelvin scales. From King and
Minium (2003). Copyright 2003 by Wiley. Reprinted by permission. Applications: The zero point
on the Celsius scale does not actually reflect a true absence of temperature (i.e., because we see that
a measurement of zero degree on the Celsius scale actually represents 300° Kelvin). However, the
difference between 0° and 50° Celsius reflects the same distance as 300° and 350° Kelvin. So, the
Kelvin and Celsius scales both exhibit the property of an interval scale, but only the Kelvin scale
displays the property of absolute zero.
case, absolute zero has meaning because of the psychophysical properties of visual perception
and how a person responds. The previous example was clear in part because the thing being
measured (sensory reaction to a visual stimulus) was directly observable and zero had an
absolute meaning. Using another example we are all familiar with, Figure 2.7 illustrates how
absolute and relative meanings of zero are used in the measurement of temperature.
Now we turn to the unobservable construct of intelligence for an example of the
meaning of zero being relative. Consider the case where a person scores zero on an intel-
ligence test. Does this mean that the person has a complete absence of intelligence (i.e.,
according to the absolute definition of zero)? The previous interpretation is likely untrue
since the person probably has some amount of intelligence. The point to understand is
that a score of zero is relative in this case; that is, the score is relative to the specific type of
intelligence information the test was designed to measure (i.e., according to a particular
theory of intelligence). This example does not mean that the same person would score
zero on a different test of intelligence that is based on a different theory.
2.4 Levels of Measurement
The levels of measurement proposed by S. S. Stevens (1946) that are widely used today are
nominal, ordinal, interval, and ratio. Notice that one can apply the previously mentioned
kinds of measurement in relation to Stevens’s levels of measurement for a comprehensive mea-
surement scheme. The defining elements, along with some commonly accepted conventions
or practical applications of Stevens’s levels of measurement, are presented in Table 2.1.
Measurement and Statistical Concepts 21
Nominal
The nominal scale represents the most unrestricted assignment of numerals to objects.
That is, numbers are used simply to label or classify objects. The appropriate statistic
to use with this scale is the number or frequency of “cases.” For example, the number
of cases may represent the number of students within a particular teacher’s class. Such
counts may be graphically displayed using bar graphs representing frequency counts of
students within the class or ethnic groups within a defined geographical region of a coun-
try. For example, a numerical coding scheme organized according to the nominal level of
measurement may be biological sex of female = 1 and male = 2.
Ordinal
The ordinal scale is derived from the rank ordering of scores. The scores or numbers in
an ordinal scale are not assumed to be real numbers as previously defined (i.e., there are no
equally spaced units of measurement between each whole number on the scale—more on
this later). Examples in the behavioral sciences include using a Likert-type scale to measure
attitude or a rating scale to measure a teacher’s performance in the classroom. Examples of
other constructs often measured on an ordinal level include depression, ability, aptitude,
personality traits, and preference. Strictly speaking, the permissible descriptive statistics to
use with ordinal scales do not include the mean and standard deviation, because these sta-
tistics mathematically imply more than mere rank ordering of objects. Formally, use of the
mean and standard deviation implies that mathematical equality of intervals between succes-
sive integers (real numbers) representing the latent trait of individuals on some construct is
present. However, if the empirical data are approximately normally distributed, and the number
of scale points exceeds four, treating ordinal data as interval produces statistically similar, if not
identical, results. Ultimately, researchers should be able to defend their actions (and any con-
clusions they draw from them) mathematically, philosophically, and psychometrically.
Interval
The interval scale represents a scale whose measurements possess the characteristic of
“equality of intervals” between measurement points. For example, on temperature scales,
equal intervals of temperature are derived by noting equal volumes of gas expansion. An
arbitrary or relative zero point is established for a particular scale (i.e., Celsius or Fahr-
enheit), and the scale remains invariant when a constant is added. The intelligence test
scores used throughout this book are based on an interval level of measurement.
Ratio
The ratio scale represents a scale whose measurements possess the characteristic of
“equality of intervals” between measurement points. For example, on temperature scales,
equal intervals of temperature are derived by noting equal volumes of gas expansion. An
22 PSYCHOMETRIC METHODS
absolute zero point exists for a particular scale (i.e., temperature measured in Kelvin),
and the scale remains invariant when a constant is added. Ratio scales are uncommon
in psychological measurement because the complete absence of an attribute, expressed
as absolute zero, is uncommon. However, a ratio scale may be used in psychophysical
measurement when the scale is designed to measure response to a visual stimulus or
auditory stimulus. In this case, a measurement of zero will have a clear meaning.
Based on the evolution of measurement and scaling over the past half-century, Brennan
(1998) revisited Stevens’s (1946, 1951b) framework and provided a revised interpretation
of scaling and the levels of measurement to reflect what has been learned through practice.
Scaling is defined as “the mathematical techniques used for determining what numbers
should be used to represent different amounts of a property or attribute being measured”
(Allen & Yen, 1979, p. 179). Broadly speaking, Brennan argues that scaling is assumed by
many to be a purely objective activity when in reality it is subjective involving value-laden
assumptions (i.e., scaling does not occur in a “psychometric vacuum”). These value-laden
assumptions have implications for the validity of test scores, a topic covered in Chapter
3. Brennan states that the rules of measurement and scaling methodology are inextricably
linked such that “the rules of measurement are generally chosen through the choice of
a scaling methodology” (1998, p. 8). Brennan maintains that measurement is not an end
unto itself, but rather is a means to an end—the end being sound decisions about what it is
that we are measuring (e.g., intelligence, student learning, personality, proficiency). Based
on Brennan’s ideas, we see that psychometrics involves both subjective and objective rea-
soning and thought processes (i.e., it is not a purely objective endeavor).
At the heart of the measurement of individuals is the concept of variability. For exam-
ple, people are different or vary on psychological attributes or constructs such as intel-
ligence, personality, or memory. Because of variability, in order to learn anything from
data acquired through measurement, the data must be organized. Descriptive statistical
techniques exist as a branch of statistical methods used to organize and describe data.
Descriptive statistical techniques include ways to (1) order and group scores into distribu-
tions that describe observations/scores, (2) calculate a single number that summarizes a
set of observations/scores, and (3) represent observations/scores graphically. Descriptive
statistical techniques can be applied to samples and populations, although most often
they are applied to samples from populations. Inferential statistical techniques are used
to make educated guesses (inferences) about populations based on random samples from
the populations. Inferential statistical techniques are the most powerful methods available
Measurement and Statistical Concepts 23
Measurements acquired on a variable or variables are part of the data collection process.
Naturally, these measurements will differ from one another. A variable refers to a property
whereby members of a group differ from one another (i.e., measurements change from one
person to another). A constant refers to property whereby members of a group do not dif-
fer from one another (e.g., all persons in a study or taking an examination are female; thus,
biological sex is constant). Variables are defined as quantitative or qualitative and are related to
the levels of measurement presented in Tables 2.1 and 2.2. Additionally, quantitative variables
may be discrete or continuous. A discrete variable can take specific values only. For example,
the values obtained in rolling a die are 1, 2, 3, 4, 5, or 6. No intermediate or in-between values
are possible. Although the underlying variable measurements (the numbers observed in the
die-rolling example) may be theoretically continuous, all sets of real or empirical data in the
die example are discrete. A continuous variable may take any values within a defined range
of values. The possible range of values belongs to a continuous series. For example, between any
two values of the variable, an infinitely large number of in-between values may occur (e.g.,
weight, chronological time, height). In this book, the data used in examples are based on
discrete variables that are scores for a finite sample of 1,000 persons on an intelligence test.
Frequency Distributions
To introduce frequency distributions, suppose you are working on a study examining
the correlates of crystallized and fluid intelligence. As a first step, you want to know
how a group of individuals performed on the language development (vocabulary) subtest
of crystallized intelligence. The vocabulary subtest is one of four subtests comprising
crystallized intelligence in the GfGc dataset used throughout this book. Table 2.2 (intro-
duced in Chapter 1) provides the subtests in the GfGc dataset used throughout this book
(the shaded row is the language development/vocabulary test). We see that this subtest
is composed of 25 items scored as 0 = no credit, 1 = 1 point, 2 = 2 points. The scores/
points on each of the 25 items are summed for each person to create a total score on the
language development subtest for each person tested.
The score data for 100 persons out of the total GfGc data of 1,000 persons on the lan-
guage development/vocabulary is provided in Table 2.3. Before proceeding, an important
note on terminology when working with data and frequency distributions is provided to
help you avoid confusion. Specifically, the terms measurement observations and scores are
often used interchangeably and refer to a single value or datum in a cell.
percentages). We see in the table that the relative frequency for a score is derived by
taking the score value’s frequency and dividing it by the total number of measurements
(e.g., 100). For example, the score 34 has a relative frequency of 0.10 (10%) because a
score of 34 occurs 10 times out of 100 observations or measurements (i.e., 10/100 = 0.10;
0.10 × 100 = 10%). Also, note that the fourth column in Table 2.4 (i.e., the cumulative
frequency) sums to 100 (as it should since the column consists of proportions). Relative
frequency distributions provide more information than raw frequency distributions (e.g.,
only columns 1 and 2 in Table 2.4) and are often preferable since information about the
number of measurements is included with frequency of score occurrence. In random
samples, relative frequency distributions provide another advantage. For example, using
long-run probability theory (see the Appendix for more detail), we see that the proportion
of observations at a particular score level is an estimate of the probability of a particular
score occurring in the population. For this reason, in random samples, relative frequen-
cies are treated as probabilities (e.g., the probability that a particular score will occur in
the population is the score’s relative frequency).
The fifth column in Table 2.4 is the cumulative relative frequency distribution.
This distribution is created by tabulating the relative frequencies of all measurements at
or below a particular score. Cumulative relative frequency distributions are often used
for calculating percentiles, a type of information useful in describing the location of a
person’s score relative to others in the group.
The grouped frequency distribution is another form of frequency distribution
when there is a large number of different scores and when listing and describing indi-
vidual scores using the frequency distribution in Table 2.4 is less than ideal. Table 2.5
illustrates a grouped frequency distribution using the same data as in Table 2.4.
Examining the score data in Table 2.5, we are able to more clearly interpret the pat-
tern of scores. For example, we see that most of the individuals scored between 34 and
39 points on the vocabulary subtest (in fact, 48% of the people scored in this range!). We
also can easily see that 21 scores fell in the range of 34 to 36 and that this range of scores
contains the median or 50th percentile.
FIGURE 2.8. Grouped relative frequency distribution histogram for 100 individuals (from Table
2.5 data). Application: The height of each bar represents a score’s relative frequency. When histo-
grams are used for grouped frequency distributions, the bar is located over each class interval. For
example, based on the data in Table 2.6, we see that the interval of scores 34 to 36 contains 21
observations or measurements. The width of the class interval (or of the bar) that includes 34 and
36 bisects the Y-axis at a frequency of 21.
28 PSYCHOMETRIC METHODS
FIGURE 2.9. Relative frequency polygon for 100 individuals (from Table 2.4 data).
SPSS syntax for frequency distribution and histogram for data in Table 2.5
FREQUENCIES VARIABLES=Score
/HISTOGRAM
/ORDER=ANALYSIS.
In Figure 2.8, the class interval width is set at 3 points. Figure 2.9 depicts a frequency
polygon. The relative frequency polygon maps the frequency count (vertical or Y-axis) by
the score in the distribution (horizontal or X-axis). The frequency polygon differs from
the histogram in that a “dot” or single point is placed over the midpoint so that the height
of the dot represents the relative frequency of the class interval.
The adjacent dots are connected to form a continuous distribution representing the
score data. The line represents a continuous variable. For example, the location where the
line changes represents the number of times a score value occurs. For example, a score of 31
occurs 6 times in the dataset in Table 2.4, and a score of 34 occurs 10 times in the dataset.
GRAPH
/LINE(SIMPLE)=COUNT BY Score.
depends on preference; however, the type and nature of the variable also serve as a guide for
when to use one type of graph rather than another. For example, when a variable is discrete,
score values can only take on whole numbers that can be measured exactly—and there are
no intermediate values between the score points. Even though a variable may be continuous in
theory, the process of measurement always reduces the scores on a variable to a discrete level
(e.g., a discrete random variable; see the Appendix for a rigorous mathematical treatment of
random variables and probability). In part, this is due to the accuracy and/or precision of the
instrumentation used and the integrity of the data acquisition/collection method. Therefore,
continuous scales are in fact discrete ones with varying degrees of precision or accuracy.
Returning to Figure 2.1, for our test of general intelligence any of the scores may appear
to be continuous but are actually discrete because a person can only obtain a numerical value
based on the sum of his or her responses across the set of items on a test (e.g., it is not pos-
sible for a person to obtain a score of 15.5 on a total test score). The frequency histogram
can also be used with variables such as zip codes or family size (i.e., categorical variables
with naturally occurring discrete structures). Alternatively, the nature of the frequency poly-
gon technically suggests that there are intermediary score values (and therefore a continuous
score scale) between the points and/or dots in the graph. The intermediary values on the
line in a polygon can be estimated using the intersection of the X- and Y-axes anywhere on
the line. An example of a continuously measured variable from psychological measurement
is reaction time to visual stimulus. In this case, the score values can take on anywhere from
zero (no reaction at all) to an amount of time that, theoretically, is infinitesimally small.
The previous section showed how to describe the shape of the distribution of a variable
using tabular (frequency table) and graphic (histogram and polygon) formats. This sec-
tion introduces central tendency and variability, two characteristics that describe the cen-
ter and width of a distribution expressed as how different the scores are from one another.
Before discussing these two concepts, an explanation is provided on the notation used in
psychometrics and statistics—summation or sigma notation.
Sigma notation is a form of notation used to sum an identified number of quantities
(e.g., scores or other measurements). To illustrate summation notation, we use the first
10 scores from our sample of 100 people in Table 2.3 from our test of language develop-
ment; here are the scores in ascending order:
Person
1 2 3 4 5 6 7 8 9 10
Score
19 23 23 26 26 26 27 27 27 27
In sigma notation, the direction to sum the number of scores for these 10 people is given
in Expression 1:
30 PSYCHOMETRIC METHODS
The shorthand notation above can be expanded by writing out all of the Xs for all
of the scores of the index between the starting value of the index and the final value as
illustrated below in Equation 2.1.
Finally, the notation S X is defined as “the sum of all the measurements of X”; for our
example set of 10 scores, S X = 251.
Another frequent use of summation notation is provided next. In Expression 2, we
see that each score is squared as a first step, and then summation occurs.
∑ X 2 = 192 + 232 + 232 + 262 + 262 + 262 + 272 + 272 + 272 + 272 = 6363
Measurement and Statistical Concepts 31
( ΣX ) = (19 + 23 + 23 + 26 + 26 + 26 + 27 + 27 + 27 + 27 ) = 63001
2 2
Σ ( X − C)2 = (19 − 3)2 + (23 − 3)2 + (23 − 3)2 + (26 − 3)2 + (26 − 3)2 + (26 − 3)2
= 256 + 400 + 400 + 529 + 529 + 529 + 576 + 576 + 576 + 576 = 4947
Expression 3 provided above provides yet another summation example often encoun-
tered in psychometrics and statistics—squaring the summated scores.
Notice that there is a clear distinction between Expressions 2 and 3; for example, the
“sum of squared scores” does not equal the “square of the summed scores.” Remember the
order of operations rule to always conduct the operations within the parentheses first before
carrying out the operation outside of the parentheses (i.e., work from the inside to outside).
Next, we turn to the situation where a constant is added to our scores to see how
this is applied in summation notation. A constant is a value applied to each score that is
unchanging. Suppose we want to subtract a constant of 3 to each of our scores, and then
proceed with summing the squares of the difference enclosed within parentheses. This is
illustrated using summation notation in Expression 4.
Becoming familiar with sigma and summation notation requires a little practice. You
are encouraged to practice using the expressions and equations above with single integer
values. Given that sigma notation is used extensively in psychometrics and statistics,
familiarity with it is essential.
A B C
Relative frequency
D E F
Score value
are not able to be divided into mirror halves. Positively skewed distributions are those
with low frequencies that trail off with positive numbers to the right. The distributions in
the bottom half of Figure 2.10 (E, F) are positively skewed. If the tail of the distribution
is directed toward positive numbers, the skew is positive. For example, home prices in
major metropolitan cities are often positively skewed because professional athletes pur-
chase homes at very high prices, producing a positive skew in the distribution of home
prices. In Figure 2.10, panel D illustrates a negatively skewed distribution. Finally, the
modality is the number of clearly identifiable peaks in the distribution. Distributions
with a single peak are unimodal and distributions with two peaks are bimodal. For
example, in Figure 2.10, panels C and F illustrate bimodal distributions, whereas panels
A, D, and E illustrate unimodal distributions. Panel B in Figure 2.10 illustrates a rect-
angular distribution. Notice that this type of distribution does not have a well-defined
mode and includes a large number of score values at the same frequency. For example,
consider tossing a fair die. The relative frequency (i.e., probability) of rolling any value
(1, 2, 3, 4, 5, 6) on the die face is 1/6. This pattern of frequencies produces a rectangular
distribution because all of the relative frequencies are the same. Panel E in Figure 2.10
illustrates a type of distribution where the relative frequency of rare events occurs (e.g., in
occurrences of rare diseases). For example, most people will never contact an extremely
rare disease, so the relative frequency is greatest at a value of zero. However, some people
have contracted the disease, and these people create the long tail trending to the right.
Score distributions differ in terms of their central tendency and variability. For exam-
ple, score distributions can vary only in the central tendency (center) or only in their
variability (spread), or both. Examining the graphic display of distributions for different
groups of people on a score distribution provides an informative and intuitive way to
learn about how and to what degree groups of people differ on a score. Central tendency
and variability are described in the next section.
Measurement and Statistical Concepts 33
Central Tendency
Central tendency is the score value at the center (position at the center of the X-axis) that
marks the center of the distribution of scores. Knowing the center of a distribution for a
set of scores is important for the following reasons. First, since a measure of central ten-
dency is a single number, it is a concise way to provide an initial view of the set of scores.
Second, measures of central tendency can quickly and easily be compared. Third, many
inferential statistical techniques use a measure of central tendency to test hypotheses of
various types. In this section we cover three measures of central tendency: the mean,
median, and mode.
Mean
The mean is a measure of central tendency appropriate for data acquired at an interval/
ratio level; it is equal to the sum of all the values (e.g., scores or measurements) of a vari-
able divided by the number of values (e.g., scores or measurements). The formula for the
mean is provided in Equations 2.2a and 2.2b.
A statistic is computed for measurements or scores in a sample; a parameter is a
value computed for scores or measurements in a population.
Median
The median is a measure of central tendency that is defined as the value of a variable
that is at the midpoint of the measurements or scores. For example, the median is the
value at which half of the scores or measurements on a variable are larger and half of the
scores are smaller. The median is appropriate for use with ordinal and interval/ratio-level
µ=
ΣX
N
• ΣX = sum of measurements.
• N = total number of measurements in the population.
X=
ΣX
N
• Σ
X = sum of measurements.
• n = total number of measurements in the sample.
34 PSYCHOMETRIC METHODS
N
− FCUM
D = L + W 2
F
measurements or scores. The median is also less sensitive to extreme scores or outliers
(e.g., when the distribution is skewed). For this reason, it better represents the middle of
skewed score distributions. Finally, the median is not appropriate for nominal measure-
ments because quantities such as larger and smaller do not apply to variables measured
on a nominal level (e.g., categorical data like political party affiliation or biological sex).
The formula for the median is provided in Equation 2.3.
To illustrate Equation 2.3 with the data from Table 2.5, we have the following result:
N
2 − FCUM 50 − 29
MD = LM + W = 34 + 2.5 = 36.5
F M 21
Mode
The mode is the score or measurement that occurs most frequently. For example, for
the data in Table 2.4, 37 is the score that most frequently occurs in the distribution (i.e.,
occurring 12 times).
Variability
Measures of variability for a set of scores or measurements provide a value of how spread
or dispersed the individual scores are in a distribution. In this section the variance and
standard deviation are introduced.
Measurement and Statistical Concepts 35
Variance
Variance is the degree to which measurements or scores differ from the mean of the
population or sample. The population variance is the average of the squared deviations of
each score from the population mean “mu” (m). The symbol for the population variance
is sigma squared (s2). Recall that Greek letters are used to denote population parameters.
Equation 2.4 provides an example of the population variance.
Equation 2.5 illustrates the sample variance. As you notice, there are only two
changes from Equation 2.4 for the population variance. The first change is that each
score is subtracted from the sample mean (rather than the population mean). The second
difference is that the sum of squares is divided by n – 1. Using n – 1 makes the sample
variance an unbiased estimate of the population variance—provided the sample is drawn
or acquired in a random manner (see the Appendix for more detail).
Finally, the population standard deviation (s) is the square root of the population
variance; the sample standard deviation (s) is the square root of the sample variance.
To illustrate calculation of the population and sample variance, 10 scores are used
from the data in Table 2.4 and are provided in Table 2.6.
(X X)
2
2 (X )
S
N 1 N 1
• ( X − X )2 = s um of the squared deviations of each score
from the population mean.
• n – 1 = total number of measurements or scores in
sample minus 1.
• SS(X) = sum of the squared deviations of each score
from the mean (a.k.a. sum of squares).
36 PSYCHOMETRIC METHODS
For the data in Table 2.6, application of Equation 2.4 yields a population variance of
S( X )
2
SS( X ) 694
s2 69.4
N N 10
For the same data, application of Equation 2.5 yields a sample variance of:
(X X)
2
S S (X) 694
S2 77.11
N 1 N 1 9
Percentiles
In psychological measurement, individual measurements or scores are central to our
study of individual differences. Percentiles are used to provide an index of the relative
standing for a person with a particular score relative to the other scores (persons) in a
distribution. Percentiles reflect relative standing and are therefore classified as ordinal
values. Percentiles do not reflect how far apart scores are from one another. Scores that reside
at or near the top of the distribution are highly ranked or positioned, whereas scores
that reside at or near the bottom of the distribution exhibit a low ranking. For example,
consider the scores in Table 2.4 (on p. 25). The percentiles relative to the raw scores in
column 1 are located in the last column of the table. We see in the last column (labeled
cumulative relative frequency) that a person with a score of 36 is located at the 50th per-
centile of the distribution. A person with a score of 43 is located at the 95th percentile
Measurement and Statistical Concepts 37
(0.50). The percentile rank is another term used to express the percentage of people scor-
ing below a particular score. For example, based on the same table, for a person scoring
43, his or her percentile rank is 95. The person’s standing is interpreted by stating that the
person scored higher or better than 95% of the examinees.
z-Scores
Raw scores in a distribution provide little information specific to (1) other persons’
scores in the distribution and (2) because the meaning of zero changes from distribution
to distribution. For example, in Table 2.3 (on p. 24) a score of 36 tells little about where
this person stands relative to others in the distribution. However, if we know that a
score of 47 is ~2s standard deviations above the mean, we know that this score is rela-
tively high in this particular distribution of scores. Turning to another example, con-
sider that a person takes three different tests (on the same subject) and that on each test
the points awarded for each item differ (see Table 2.7). The total number of points on
each test is 100, but the number of points awarded for a correct response to each item differs
by test (e.g., see row 4 of Table 2.7). It appears from the data in Table 2.7 that the person
is performing progressively worse on each test moving from test 1 to test 3. However,
this assumes that a score of zero has the same meaning for each test (and this may not
be the case). For example, assume that the lowest score on test 1 is 40 and that the
lowest score on test 2 is zero. Under these circumstances, zero on test 1 is interpreted
as 40 points below the lowest score. Alternatively, on test 2, zero is the lowest score.
The previous example illustrates why raw scores are not directly comparable. However,
if we rescale or standardize the value of zero so that it means the same thing in every dis-
tribution, we can directly compare scores from different tests with different distributions.
Transforming raw scores to z-scores accomplishes this task. Returning to Table 2.7, if we
create difference scores by subtracting the person’s score from the mean of the distribution,
we see that if the raw score equals the mean, then the difference score equals zero; this is true
regardless of the distributional characteristics of the three tests. So, the difference score always
has the same meaning regardless of the characteristics of the distribution of scores. Notice
that based on difference scores, the person’s performance improved from test 1 to test 2,
even though the raw score is lower! Applying the z-score transformation in Equation 2.6a
to each of the score distributions of tests 1–3 yields the values in the last row of Table 2.7.
Alternatively, if one wants to obtain a raw score from a known z-score, Equation 2.6b
serves this purpose.
Finally, based on inspection of the raw and difference scores in Table 2.7, it may
appear that the person performed worse on test 3 than on test 2. However, this is not true
because each question is worth 1 point on test 3 but 5 points on test 2. So, relative to
the mean of each distribution, the person performed better on test 3 than on test 2. This
pattern or trend is captured in the z-scores in Table 2.7. For example, inspection of the
z-scores created using Equation 2.6a in the table illustrates that the person’s performance
improved or increased relative to others in the distribution of scores for each of the three
tests. To summarize, difference scores standardize the meaning of zero across distributions,
and z-scores standardize the unit of measurement.
To make the relationship between raw and z-scores clearer, Table 2.8 provides a set
of 20 scores from the language development/vocabulary test in the GfGc dataset. We treat
these scores as a population and apply the sigma notation to these data to illustrate deri-
vation of the mean, standard deviation, sum of squares, and the parallel locations in the
score scale between z- and raw scores. Figure 2.11 illustrates the position equivalence for
a raw score of 45 and a z-score of 1.14.
ΣX = 690 Σz = 0
ΣX = 25430
2
Σz = 19
2
µ =
∑ X = 690 = 34.5 µZ = ∑ =
Z 0
=0
X
N 20 N 20
SS( X ) = ∑ X 2 − Nµ2X SS(Z ) = ∑ Z 2 − Nµ2Z
Normal Distributions
There are many varieties of shapes of score distributions (e.g., see Figure 2.10 for a
review). One commonly encountered type of distribution in psychological measurement
is the normal distribution. Normal distributions share three characteristics. First, they
are symmetric (the area to the left and right of the center of the distribution is the same).
Second, they are often bell-shaped (or a close approximation to a bell-type shape). When
the variance of the distribution is large, the height of the bell portion of the curve is lower
(i.e., the curve is much flatter) than an ideal bell-shaped curve. Similarly, when the vari-
ance is small, the height of the bell portion of the curve is more peaked and the width of
the curve is narrower than an ideal bell-shaped curve. Third, the tails of the distribution
extend to positive and negative infinity. Figure 2.12 illustrates three different varieties
of normal distributions. In the figure, the tails touch the X-axis, signifying that we have
discernible lower and upper limits to the distributions rather than the purely theoretical
depiction of the normal distribution where the tails never actually touch the X-axis.
40
Z = X – μ/σ
3 X 3 Z
2 X X X X X 2 Z Z Z Z Z
Frequency
Frequency
1 X X X X X X X X X X X X X X 1 Z Z Z Z Z Z Z Z Z Z Z Z Z Z
18 20 22 24 30 33 37 39 40 42 43 45 48 49 –1.8 –1.6–1.4–1.1–0.5–0.2 0.3 0.5 0.6 0.8 0.9 1.1 1.5 1.6
Score (X) z-scores
(Mean = 34.5/σ = 9) (Mean = 0/σ = 1)
FIGURE 2.11. Frequency polygons for raw and z-scores for Table 2.8 data. For example, a raw score of 45 equals a z-score of 1.14. Adapted from
Glenberg and Andrzejewski (2008). Copyright 2008 by Lawrence Erlbaum Associates. Adapted by permission.
Measurement and Statistical Concepts 41
Y (ordinate)
Bell-shaped curve
Height
Scores
X (abscissa)
Mean
Median
Mode
FIGURE 2.12. Normal distributions with same mean but different variances.
• p = 3.1416.
• u = height of the normal curve.
• s2 = variance of the distribution.
• e = 2.7183.
• X = score value.
• m = mean of score distribution.
Any value can be inserted for the mean (m) and variance (s 2), so an infinite number of
normal curves can be derived using Equation 2.7.
There are at least two reasons why normal distributions are important to psychological
measurement and the practice of psychometrics. First, the sampling distribution of many
statistics (e.g., such as the mean) are normally distributed. For example, if the mean test score
is calculated based on a random sample from a population of persons, the sampling distri-
bution of the mean is normally distributed as well. Statistical inference is based on this fact.
Second, many variables in psychological measurement follow the normal distribution (e.g.,
intelligence, ability, achievement) or are an approximation to it. The term approximate means
that although, from a theoretical perspective, the tails of the normal distribution extend to
infinity on the left and right, when using empirical (actual) score data, there are in fact limits
42 PSYCHOMETRIC METHODS
in the upper and lower end of the score continuum. Finally, we can use z-scores in combina-
tion with normal distributions to answer a variety of questions about the distribution.
Correlation
A multitude of questions in psychology and other fields can be investigated using corre-
lation. At the most basic level, correlation is used to study the relationship between two
Measurement and Statistical Concepts 43
variables. For example, consider the following questions. Is there a relationship between
verbal ability and quantitative reasoning? Is there a relationship between dementia and
short-term memory? Is there a relationship between quantitative reasoning and math-
ematical achievement? The correlation coefficient provides an easily interpretable measure
of linear association to answer these questions. For example, correlation coefficients have
a specific range of −1 to +1. Using the correlation, we can estimate the strength and direction
of the relationship for the example questions above, which in turn helps us to understand
individual differences relative to these questions. In this section, we limit our discussion
to the linear relationship between two variables. The Appendix provides alternative measures
of correlation appropriate (1) for ranked or ordinal data, (2) when the relationship between
variables is nonlinear (i.e., curvilinear), and (3) for variables measured on the nominal level.
The Pearson correlation coefficient is appropriate for interval- or ratio-level vari-
ables, assumes a linear relationship, and is symbolized using r (for a statistic) and r
(rho) for a population parameter. The correlation coefficient (r or r) has the following
properties:
Equations 2.8a and 2.8b illustrate the Pearson correlation coefficient using raw scores
from Table 2.9.
Figure 2.13 illustrates the X-Y relationship for the score data in Table 2.9. The
graph in Figure 2.13 is known as a scattergraph or scatterplot and is essential to
understanding the nature of the X-Y relationship (e.g., linear or nonlinear). Based on
Figure 2.14, we see that the X-Y relationship is in fact linear (i.e., follows a diagonal
line or slope, whereas each score value of X increases, and there is a corresponding
positive change in Y-scores). Figure 2.14 illustrates the linear relationship between
X and Y.
P
44 PSYCHOMETRIC METHODS
N P ∑ XY − ∑ X ∑ Y
R=
P ∑ X − (∑ X ) P ∑ Y − (∑ Y)
N 2 2
N 2 2
291300 − 274320
=
[136680 − 129600][622820 − 580644 ]
16980
=
[7080][ 42176]
16980 16980
= = = .983
298606080 17280.22
FIGURE 2.13. Scatterplot of fluid and crystallized intelligence total scores. Correlation (r) is .983.
FIGURE 2.14. Scatterplot of fluid and cystallized intelligence with regression line. Correlation
(r) is .983; r 2 = .966.
Covariance
The covariance is defined as the average cross product of two sets of deviation scores. The
covariance retains the original units of measurement for two variables and is expressed in
deviation score form or metric (a deviation score being a raw score minus the mean of the
distribution of scores). Because of its raw score metric, the covariance is an unstandardized
version of the correlation. The covariance is useful in situations when we want to conduct
an analysis and interpret the results in the original units of measurement. For example,
46 PSYCHOMETRIC METHODS
we may want to evaluate the relationships among multiple variables (e.g., three or more
variables), and using a standardized metric like the correlation would provide mislead-
ing results because the variables are not on the same metric or level of measurement. In
this case, using the covariance matrix consisting of more than two variables makes more
sense. In fact, the multivariate technique structural equation modeling (SEM; used in
a variety of psychometric analyses) typically employs the covariance matrix rather than
the correlation matrix in the analysis. Thus, SEM is also called covariance structure
modeling. The equation for the covariance using raw scores is provided in Equations
2.9a and 2.9b for the population and sample. The Appendix provides examples of how to
derive the covariance matrix for more than two variables.
An important link between the correlation coefficient r and the covariance is illus-
trated in Equations 2.10a and 2.10b.
( − )( − )
SXY =
N −1
1698
= = 188.66
9
• X = deviation score on a single measure.
• Y = deviation score on a single measure.
• xy = raw score on any two measures.
• X = mean on measure X.
• Y = mean on measure Y.
• sxy = covariance for measures X and Y.
Measurement and Statistical Concepts 47
Regression
Recall that use of the correlation concerned the degree or magnitude of relation between
variables. Sometimes the goal is to estimate or predict one variable from knowledge of
another (notice that this remains a relationship-based question, as was correlation). For
example, research may suggest that fluid intelligence directly affects crystallized intelli-
gence to some degree. Based on this knowledge, and using scores on the fluid intelligence
test in the GfGc data, we find that our goal may be to predict the crystallized intelligence
score from the fluid intelligence score. To address problems of predicting one variable
from knowledge of another we use simple linear regression.
The rules of linear regression are such that we can derive the line that best fits our
data (i.e., best in a mathematical sense). For example, if we want to predict Y (crystal-
lized intelligence) from X (fluid intelligence), the method of least squares locates the
line in a position such that the sum of squares of distances from the points to the line
taken parallel to the Y-axis is at a minimum. Application of the least-squares crite-
rion yields a straight line through the scatter diagram in Figure 2.13 (illustrated in
Figure 2.14).
Using the foundations of plane geometry, we can define any straight line by specify-
ing two constants, called the slope of the line and its intercept. The line we are interested
in is the one that we will use to predict values of Y (crystallized intelligence) given values
48 PSYCHOMETRIC METHODS
of X (fluid intelligence). The general equation for a straight line is stated as: The height of
the line at any point X is equal to the slope of the line times X plus the intercept. The equation
for deriving the line of best fit is provided in Equation 2.11.
To apply the regression equation to actual data, we need values for the constants a
and b. Computing the constant for b is provided in Equations 2.12 and 2.13 using data
from Table 2.9.
The equation for calculating the intercept is illustrated in Equation 2.14 using data
from Table 2.9.
Yˆ = BX + A
Ŷ = predicted values of Y.
b = slope of the regression line.
a = intercept of the line.
SY 20.54
B= R = .983 = 2.40
SX 8.41
• r = correlation between X and Y.
• sy = standard deviation of Y.
• sx = standard deviation of X.
ΣXY − ΣXNΣY
B=
ΣX 2 − (ΣX)
2
N
• ΣXY = sum of the product of X times Y.
• ΣXΣY = sum of X-scores times sum of Y-scores.
• ΣX2 = sum of X-scores after they are squared.
• (ΣX)2 = sum of X-scores then squaring the sum.
• n = sample size.
Measurement and Statistical Concepts 49
A = Y − BX
= 76.2 − 2.40(36)
= 76.2 − 86.43
= − 10.23
• Ŷ = mean of Y-scores.
• X = mean of X-scores.
• b = slope of the regression line.
Returning to Equation 2.11, now that we know the constants a and b, the equation for
predicting crystallized intelligence (Ŷ) from fluid intelligence (X) is given in Equation 2.15a.
To verify that the equation is correct for a straight line, we can choose two values of
X and compute their respective Ŷ from the preceding regression equation, as follows in
Equations 2.15b and 2.15c.
Figure 2.15 illustrates the regression line for the data in Table 2.9. The “stars” rep-
resent predicted scores on crystallized intelligence for a person who scores (1) 30 on the
fluid intelligence test and (2) 43 on the fluid intelligence test.
Yˆ = BX + A
= 2.40( X ) − 10.23
Yˆ = BX + A
= 2.40(30) − 10.23
= 61.77
50 PSYCHOMETRIC METHODS
Yˆ = BX + A
= 2.40(43) − 10.23
= 92.97
FIGURE 2.15. Line of best fit (regression line) for Table 2.9 data. When fluid intelligence = 30,
crystallized intelligence is predicted to be 61.77. When fluid intelligence = 43, crystallized intel-
ligence is predicted to be 92.97. Highlighting these two points verifies that these two points do in
fact determine a line (i.e., the drawn regression line).
Error of Prediction
We call the difference between the actual value of Y and its predicted value Ŷ the error
of prediction (sometimes also called the residual). The symbol used for the error of
prediction is e. Thus, the error of prediction for the ith person is ei and is obtained by
Equation 2.16.
The errors of prediction are illustrated in Figure 2.16. The errors of prediction are
shown as arrows from the regression line to the data point.
Measurement and Statistical Concepts 51
E I = YI − ŶI.
• Yi = observed score for person i on variable Y.
• ŶI = predicted score for person i on variable Y.
120.0
80.0
60.0
40.0
Fluid Intelligence
For example, we see that a person with an observed score of 112 on crystallized intel-
ligence will have a predicted score of 102.57 based on our prediction equation previously
developed with a slope of 2.40 and intercept of −10.23. Note that negative errors are indi-
cated by points below the regression line and positive errors are indicated by points above
the regression line. So, errors of prediction are defined as the vertical distance between
the person’s data point and the regression line.
the errors of prediction is zero. For example, any line that passes through the mean of Y
and mean of X will have errors of prediction summing to zero. To overcome this dilemma,
we apply the least-squares criterion to determine the best regression line. The best
regression line according to the least-squares criterion is a line (1) exhibiting a sum of the
errors of prediction being zero and (2) where the sum of the squared errors of prediction
is smaller than the sum of squared errors of prediction for any other possible line.
Coefficient of Determination
When using regression, we want an idea of how accurate our prediction is likely to be.
The square of the correlation coefficient (r2), known as the coefficient of determination,
measures the extent to which one variable determines the magnitude of another. Recall
that when the correlation coefficient is close to ±1, our prediction will be accurate or
good. Also, in this case, r2 will be close to ±1. Furthermore, when r2 is close to ±1,
1 − R 2 is close to zero. The relationship between R, R 2, 1 − R 2, AND 1 − R 2 is provided in
Table 2.10.
S ( EI - E ) S ( EI )
2 2
145.278
SE = = = = 18.159 = 4.26
N -2 N -2 8
9
= 20.54 1 − .966
8
= 20.54 (.184 )(1.06 )
r r2 1 – r2 1- 2
2.11 Summary
Analysis of variance. A statistical method to test differences between two or more means.
Also used to test variables between groups.
Bimodal distribution. A distribution exhibiting two most frequently occurring scores.
Negatively skewed distribution. A distribution in which the tail slants to the left.
Nominal. A form of categorical data where the order of the categories is not significant.
Nominal scale. A measurement scale that consists of mutually exclusive and exhaustive
categories differing in some qualitative aspect.
Normal distribution. A mathematical abstraction based on an equation with certain
properties. The equation describes a family of normal curves that vary according to
the mean and variance of a set of scores.
Ordinal. Categorical data for which there is a logical ordering to the categories based
on relative importance or order of magnitude.
Measurement and Statistical Concepts 57
Ordinal scale. A scale exhibiting the properties of a nominal scale, but in addition the
observations or measurements may be ranked in order of magnitude (with nothing
implied about the difference between adjacent steps on the scale).
Parameter. A descriptive index of a population.
Pearson correlation coefficient. Measures the linear relationship between two variables
on an interval or ratio level of measurement.
Percentile. A point on the measurement scale below which a specified percentage of the
cases in a distribution falls.
Population. The complete set of observations about which a researcher wishes to draw
conclusions.
Positively skewed distribution. A distribution in which the tail slants to the right.
Random sample. Sample obtained in a way that ensures that all samples of the same
size have an equal chance of being selected from the population.
Ratio. Data consisting of ordered, constant measurements with a natural origin or zero
point.
Ratio scale. A scale having all the properties of an interval scale plus an absolute zero
point.
Real numbers. The size of the unit of measurement is specified, thereby allowing any
quantity to be represented along a number line.
Sample. A subset of a population.
Simple linear regression. The regression of Y on X (only one predictor and one outcome).
Slope of the line. Specifies the amount of increase in Y that accompanies one unit of
increase in X.
Standard error of the estimate. A measure of variability for the actual Y values around
the predicted value Ŷ .
Standard normal distribution. A symmetric, unimodal distribution that has a mean of 0
and a standard deviation of 1 (e.g., in a z-score metric).
Statistic. A descriptive index of a sample.
58 PSYCHOMETRIC METHODS
This chapter introduces validity, including the statistical aspects and the validation process.
Criterion, content, and construct validity are presented and contextualized within the com-
prehensive framework of validity. Four guidelines for establishing evidence for the validity
of test scores are (1) evidence based on test response processes, (2) evidence based on
the internal structure of the test, (3) evidence based on relations with other variables, and
(4) evidence based on the consequences of testing. Techniques of estimation and interpre-
tation for score-based criterion validity are introduced.
3.1 Introduction
59
60 PSYCHOMETRIC METHODS
Validity
Collective evidence
FIGURE 3.1. Validity continuum. Bulleted information is from AERA, APA, and NCME (1999,
pp. 11–13).
To clarify the role validity plays in test score use and interpretation, consider the follow-
ing three scenarios based on the general theory of intelligence used throughout this book.
intended due to the qualifiers in the test items. Therefore, the scores produced
by the measurement actually are indicative of testwiseness rather than inductive
quantitative reasoning skill.
2. The test is designed to measure communication ability (a subtest contained
in crystallized intelligence), but the test items require a high level of reading
skill and vocabulary. Result: The test was made harder than intended due to the
required level of reading skill and vocabulary in the test items. Therefore, the
scores produced by the measurement actually are indicative of reading skill and
vocabulary levels rather than communication ability. This problem may be fur-
ther confounded by educational access in certain sociodemographic groups of
children.
3. The test is designed to measure working memory (a subtest contained in
short-term memory) by requiring an examinee to complete a word-pair asso-
ciation test by listening to one word, then responding by providing a second
word that completes the word pair, but the test items require a high level of
reading skill and vocabulary. Result: The test was made harder than intended
due to the required level of vocabulary in the test items (i.e., the words pre-
sented to the examinee by the examiner). Therefore, the scores produced by
the measurement actually are indicative of vocabulary level rather than short-
term working memory.
From the scenarios presented, you can see how establishing evidence for the validity of
test scores relative to their interpretation can be substantially undermined. The points
covered in the three scenarios illustrate that establishing validity evidence in testing
involves careful attention to the psychometric aspects of test development and in some
instances the test administration process itself.
Recall that validity evidence involves establishing evidence based on (1) test
response processes, (2) the internal structure of the test, (3) relations with other vari-
ables, and (4) the consequences of testing. Each component of validity addresses dif-
ferent but related aspects in psychological measurement. However, the three types of
validity are not independent of one another; rather, they are inextricably related. The
degree of overlap among the components may be more or less, depending on (1) the
purpose of the test, (2) the adequacy of the test development process, and (3) subsequent
score interpretations.
Using the quantitative reasoning (i.e., inductive reasoning) test described in scenario
number 1 previously described, evaluation of content and criterion validity is concerned
with the question, “To what extent do the test items represent the traits being measured?”
A trait is defined as “a relatively stable characteristic of a person . . . which is manifested
to some degree when relevant, despite considerable variation in the range of settings and
circumstances” (Messick, 1989, p. 15). For example, when a person is described as being
sociable and another as shy, we are using trait names to characterize consistency within
Criterion, Content, and Construct Validity 63
individuals and also differences between them (Fiske, 1986). For additional background
on the evolution and use of trait theory, see McAdams & Pals (2007).
One example of the overlap that may occur between content and criterion-related
validity is the degree of shared relationship between the content (i.e., expressed in the test
items) and an external criterion (e.g., another test that correlates highly with the induc-
tive reasoning test). Construct validity addresses the question, “What traits are measured
by the test items?” In scenario number 1, the trait being measured is inductive reasoning
within the fluid reasoning component of general intelligence. From this example, you
can see that construct, criterion, and content validity concerns the representativeness of
the trait relative to (1) trait theory, (2) an external criterion, and (3) the items comprising
the test designed to measure a specific trait such as fluid intelligence.
Criterion validity emerged first among the three components of validity. The criterion
approach to establishing validity involves using correlation and/or regression techniques
to quantify the relationship between test scores and a true criterion score. A true crite-
rion score is defined as the score on a criterion corrected for its unreliability. In criterion-
related validity studies, the process of validation involves addressing the question, “How
well do test scores estimate criterion scores?”. The criterion can be performance on a task
(e.g., job performance—successful or unsuccessful), or the existence of a psychological
condition such as depression (e.g., yes or no), or academic performance in an educational
setting (e.g., passing or failing a test). The criterion may also be a matter of degree in the
previous examples (i.e., not simply a “yes” or “no” or “pass” or “fail” outcome); in such
situations, the criterion takes on more than two levels of the outcome. In the criterion
validity model, test scores are considered valid for any criterion for which they provide
accurate estimates (Gulliksen, 1987). To evaluate the accuracy of the criterion validity
approach, every examinee included in a validation study has a unique value on the crite-
rion. Therefore, the goal in acquiring evidence of criterion-related validity is to estimate
an examinee’s score on the criterion as accurately as possible.
Establishing criterion validity evidence occurs either by the concurrent or the predic-
tive approach. In the concurrent approach, criterion scores are obtained at the same time
(or approximately) as the scores on the test under investigation. Critical to the efficacy of
the criterion validity model is the existence of a valid criterion. The idea is that if an accu-
rate criterion is available, then it serves as a valid proxy for the test currently being used.
Alternatively, in the predictive approach, the goal is to accurately estimate the future perfor-
mance of an examinee (e.g., in an employment, academic, or medical setting). Importantly,
if a high-quality criterion exists, then powerful quantitative methods can be used to estimate
a validity coefficient (Cohen & Swerdlik, 2010; Cronbach & Gleser, 1965). Given the
utility of establishing criterion-related validity evidence, what are the characteristics of a
high-quality criterion? I address this important question in the next section.
64 PSYCHOMETRIC METHODS
criterion involves value judgments and consideration of the consequences for examinees
based on using the criterion for placement and/or selection decisions. Although the cri-
terion validity model provides important advantages to certain aspects of the validation
process, the fact remains that validating the criterion itself is often viewed as an inher-
ent weakness in the approach (Gregory, 2000; Ebel & Frisbie, 1991). To illustrate, in
attempting to validate a test to be used as a criterion, there must be another test that
can serve as a reference for the relevance of the criterion attempting to be validated; to
this end, a circular argument ensues. Therefore, validation of the criterion itself is the pri-
mary shortcoming of the approach. One strategy in addressing this challenge is to use the
content validity model supplemental to the criterion validity model. This strategy makes
sense because remember that establishing a comprehensive validity argument involves all three
components—criterion, content, and construct, with construct validity actually subsuming
criterion and content aspects. The content validity model is discussed later in the chapter.
Next we turn to challenge 2—sample size.
Restriction of range may also occur when the predictor or criterion tests exhibit
floor or ceiling effects. A floor effect occurs when a test is exceptionally difficult result-
ing in most examinees scoring very low. Conversely, a ceiling effect occurs when a test is
exceptionally easy, resulting in most examinees scoring very high.
This section introduces the psychometric and statistical aspects related to the estimation of
criterion-related validity. To facilitate understanding, we use the GfGc data to merge theory
with application. Recall that one of the subtests of crystallized intelligence tests measures
the language development component according to the general theory of intelligence. Sup-
pose a psychologist wants to concurrently evaluate the criterion validity between the lan-
guage development component (labeled as “cri1_tot” in the GfGc dataset) of crystallized
Criterion, Content, and Construct Validity 67
intelligence and an external criterion. The criterion measure is called the Highly Valid Scale
of Crystallized Intelligence (HVSCI; included in the GfGc dataset). Furthermore, suppose
that based on published research, the HVSCI correlates .92 with the verbal intelligence
(VIQ) composite measure on the Wechsler Scale of Intelligence for Adults—Third Edition
(WAIS-III; Wechsler, 1997b). The validity evidence is firmly established for the WAIS-III
VIQ composite as reported by published research. To this end, the correlation evidence
between the HVSCI and the WAIS-III VIQ provides evidence that the HVSCI meets one
aspect of the criteria discussed earlier for a high-quality criterion. Although the general the-
ory of intelligence is different from the Wechsler theory of intelligence WAIS-III (e.g., it is
based on a different theory and has different test items), the VIQ composite score provides
a psychometrically valid external criterion by which the language development test can be
evaluated. Finally, because there is strong evidence of a relationship between the HVSCI
and the WAIS-III VIQ (i.e., the correlation between the HVSCI and VIQ is .92), we will use
the HVSCI as our external criterion in the examples provided in this section.
The criterion validity of the language development test can be evaluated by calcu-
lating the correlation coefficient using examinee scores on the language development
subtest and scores on the HVSCI. For example, if we observe a large, positive correlation
between scores on the language development subtest and scores on the HVSCI, we have
evidence that scores on the two tests converge, thereby providing one source of validity
evidence within the comprehensive context of validity. The correlation between the lan-
guage development total score and examinee scores on the HVSCI is .85 (readers should
verify the value of .85 using the GfGc N = 1000 dataset). The .85 coefficient is a form of
statistical validity evidence referred to as the validity coefficient. The concurrent validity
coefficient provides one type of criterion-based evidence (i.e., statistical) in the approach
to evaluating the validity of scores obtained on the language development test.
Recall that test (score) reliability affects the value of the validity coefficient. One way
to conceptualize how score reliability affects score validity is based on the unreliability
(i.e., 1 – rxx) of the scores on the test. To deal with the influence of score reliability in vali-
dation studies, we can correct for the error of measurement (i.e., the unreliability of the
test scores). The upper limit of the validity coefficient is constrained by the square root of
the reliability of each test (i.e., the predictor and the criterion). By taking the square root
of each test’s reliability coefficient, we are using reliability indexes rather than reliability
coefficients (e.g., in Equation 3.1a on page 68). From classical test theory, the reliability
index is defined as the correlation between true and observed scores (e.g., see Chapter 7
for details). To apply this information to our example, the reliability coefficient is .84 for
crystallized intelligence test 1 and .88 for the HVSCI external criterion (readers should
verify this by calculating the coefficient alpha internal consistency reliability for each test
using the GfGc data). Knowing this information, we can use Equation 3.1a to estimate
the theoretical upper limit of the validity coefficient with one predictor.
Inserting the information for our two tests into Equation 3.1a provides the fol-
lowing result in Equation 3.1b.
Once the correlation between the two tests is corrected for their reliability, we see
that the upper limit of the validity coefficient is .86. However, this upper limit is purely
68 PSYCHOMETRIC METHODS
theoretical because in practice we are using fallible measures (i.e., tests that are not per-
fectly reliable). To further understand how the upper limit on the validity coefficient is
established, we turn to an explanation of the correction for attenuation.
In practice, we use tests that include a certain amount of error—a situation that
is manifested as the unreliability of test scores. For this reason, we must account for
the amount of error in the criterion when estimating the validity coefficient. To correct
validity coefficients for attenuation in the criterion measure only but not the predictor,
Equation 3.3a is used (Guilford, 1954, p. 401, 1978, p. 487; AERA, APA, and NCME,
1999, pp. 21–22). Equation 3.3b illustrates the application of 3.3b with our example
reliability information.
rx y .85 .85
rxω = = = = .91
ryy .88 .94
Effective use of the correction for attenuation requires accurate estimates of score reliability.
For example, if the reliability of the scores on a test or the criterion is underestimated, the cor-
rected coefficient will be overestimated. Conversely, if the reliability of the test or the criterion
is overestimated, the corrected coefficient will be underestimated. To err on the conservative
side, you can use reliability coefficients for the test and criterion that are overestimated in the
correction formula. The aforementioned point calls into question, “Which type of reliability
estimate is best to use when correction formulas are to be applied?” For example, when using
internal consistency reliability methods such as coefficient alpha, the reliability of true scores
is often underestimated. Because of the underestimation problem with coefficient alpha, alternate
forms of reliability estimates are recommended for use in attenuation correction formulas. Finally,
correlation coefficients fluctuate based on the degree of sampling and measurement error.
The following recommendations are offered regarding the use and interpretation
of correction attenuation formulas. First, when conducting validity studies researchers
should make every attempt to reduce sampling error by thoughtful sampling protocols
paired with rigorous research design (e.g., see the section on challenges to the criterion
validity model earlier in the chapter). Second, large samples are recommended since this
action aids in reducing sampling error. Third, corrected validity coefficients should be
interpreted with caution when score reliability estimates are low (i.e., the reliability of
either the predictor or criterion or both is low).
3.7 E
stimating Criterion Validity with Multiple Predictors:
Partial Correlation
Establishing validity evidence using the criterion validity model sometimes involves using
multiple predictor variables (e.g., several tests). Central to the multiple variable problem rela-
tive to test or score validity is the question, “Am I actually studying the relationships among
the variables that I believe I am studying?” The answer to this question involves thoughtful
reasoning to ensure that we are actually studying the relationships we believe we are studying.
We can employ statistical control to help answer the validity-related question, “Am I study-
ing the relationship among the variables that I believe I am studying?” In a validation study,
statistical control means controlling the influence of a “third” or “additional” predictor (e.g.,
test) by accounting for (partialling out) its relationship with the primary predictor (e.g., test)
Criterion, Content, and Construct Validity 71
of interest in order to more accurately estimate its effect on the criterion. The goal in statisti-
cal control is to (1) maximize the systematic variance attributable to the way examinees
respond to test items (e.g., artifacts of the test or testing conditions that cause examinees to
score consistently high or low); (2) minimize error variance (e.g., error attributable to the
content of the test or instrument or the research design used in a study); and (3) control
extraneous variance (e.g., other things that increase error variance such as elements specific
to the socialization of examinees). Chapter 7 on score reliability based on classical test theory
summarizes the issues that contribute to the increase in variability of test scores.
In validation studies, multiple predictor variables (tests) are often required in order to
provide a comprehensive view of the validity of test scores. For example, consider the sce-
nario where, in addition to the primary predictor variable (test), there is a second predictor
variable that correlates with the primary predictor variable and the criterion. To illustrate, we
use as the criterion the Highly Valid Scale of Intelligence (HVSCI) in the GfGc dataset and the
language development subtest component of crystallized intelligence as the primary predic-
tor. Suppose research has demonstrated that fluid intelligence is an important component
that is related to language development. Therefore, accounting for fluid intelligence provides
a more accurate picture of the relationship between language development and the HVSCI.
The result is an increase in the integrity of the validity study. Armed with this knowledge, the
graphic identification subtest of fluid intelligence (labeled “fi2_tot” in the GfGc dataset) is
introduced with the goal of evaluating the relationship between the criterion (HVSCI) and the
primary predictor for a group of examinees whose graphic identification scores are similar. To
accomplish our analytic goal, we use the first-order partial correlation formula illustrated
in Equation 3.4a.
The first-order partial correlation refers to one of two predictors being statistically
controlled and involves three variables (i.e., criterion, predictor 1, and predictor 2). Alter-
natively, the zero-order correlation (i.e., Pearson) involves only two variables (i.e., a
criterion and one predictor). To apply the first-order partial correlation, we return to our
GfGc data and use the HVSCI as the criterion (Y; labeled HVSCI), the language develop-
ment total score (based on the sum of the items on crystallized intelligence test 1) as
the primary predictor (X1; labeled cri1_tot), and a second predictor (X2; labeled fi2_tot)
based on a measure of fluid intelligence (i.e., the graphic identification subtest total score).
Equation 3.4b illustrates the use of the first-order partial correlation using the GfGc data;
we see that the result is .759. To illustrate how to arrive at this result using SPSS, syntax is
provided below (readers should conduct this analysis and verify their work with the par-
tial output provided in Table 3.1). The dark shading in Table 3.1 is Pearson (zero-order)
correlations, and the lightly shaded values in the bottom of the table include the first-order
partial correlation.
PARTIAL CORR
/VARIABLES=HVSCI cri1_tot BY fi2_tot
/SIGNIFICANCE=TWOTAIL
/STATISTICS=CORR
/MISSING=LISTWISE.
Continuing with our example, we see from the SPSS output in Table 3.1 that language
development and graphic identification are moderately correlated (.39). Using this informa-
tion, we can answer the question, “What is the correlation (i.e., validity coefficient) between
language development (the primary predictor) and HVSCI (the criterion) given the exam-
inees’ scores (i.e., ability level) on graphic identification?” Using the results of the analysis,
we can evaluate or compare theoretical expectations based on previous research related to
the previous question (e.g., “Does our analysis concur with previous research or theoretical
expectations?”). As you can see, the partial correlation technique provides a way to evaluate
different and sometimes more complex score validity questions beyond the single-predictor
case. Inspection of Equation 3.4a and Table 3.1 reveals that when two predictors are highly
and positively correlated with the criterion, the usefulness of the second predictor diminishes
because the predictors are explaining much of the same thing.
The results of the partial correlation analysis can be interpreted as follows. Controlling
for examinee scores (i.e., their ability) on the graphic identification component of fluid intel-
ligence, we see that the correlation between HVSCI and language development is .759. Notice
that the zero-order correlation between HVSCI and language development is .799. By par-
tialing out or accounting for the influence of graphic identification, the correlation between
HVSCI and language development reduces to .759. Although the language development and
graphic interpretation tests are moderately correlated (.39), the graphic identification test
adds little to the relationship between language development and the HVSCI. To this end,
the graphic identification component of fluid intelligence contributes little above and beyond
what language development contributes alone to the HVSCI. However, the first-order partial
correlation technique allows us to isolate the contribution each predictor makes to the HVSCI
in light of the relationship between the two predictors.
Equation 3.5a illustrates another way to understand how we arrived at the result of
.759 in Equations 3.4a and 3.4b. Equation 3.5 illustrates the semipartial correlation and
allows for partitioning the correlation in a way that isolates the variance in the HVSCI
74 PSYCHOMETRIC METHODS
(Y) accounted for by language development (X1) after the effect of graphic identification
(X2) is partialed or controlled.
Applying the correlation coefficients from our example data, we have the result in
Equation 3.5b. Note that the result below agrees with Equation 3.4b. Therefore, we have
illustrated a second way to arrive at the same conclusion but the semipartial correlation
provides a slightly different way to isolate or understand the unique and nonunique rela-
tionships among the predictor variables in relation to the criterion.
Figure 3.2 provides a Venn diagram depicting the results of our analysis in Equation
3.5b.
AND
(.0076)(100) = 76%
Y-HVSCI
X1 – Language development
FIGURE 3.2. Venn diagram illustrating the semipartial correlation. The circles represent percentages (e.g., each circle represents 100% of each
variable). This allows for conversion of correlation coefficients into the proportion of variance metric, r2. The r2 metric can then be converted to
percentages to aid interpretation.
75
76 PSYCHOMETRIC METHODS
A final point relates to the size of the sample required to ensure adequate statistical
power for reliable results. A general rule of thumb regarding the necessary sample size
for conducting partial correlation and multiple regression analysis is minimally 15 sub-
jects per predictor variable when (1) there are between 3 and 25 predictors and (2) the
squared multiple correlation, R2 = .50 (Stevens, 2003, p. 143). The sample size require-
ment for partial correlation and regression analysis also involves consideration of (1) the
anticipated effect size and (2) the alpha-level used to test the null hypothesis that R2 =
0 in the population (Cohen, Cohen, West, & Aiken, 2003, pp. 90–95). Interpretation of
effect size in terms of proportion of variance accounted for R2 (see Figure 3.2 as an exam-
ple) are .02—small, .13—medium, and .26—large (Cohen et al., 2003, p. 93). These
sample guidelines are general, and a sample size/power analysis should be conducted as
part of a validation study to ensure accurate and reliable results. Finally, remember that,
in general, sampling error is reduced as sample size increases.
*
• rYX 2 .X1 = partial correlation corrected for attenuation.
We see in Equation 3.6b that correcting for the attenuation using all three variables
substantially changes the partial correlation for graphic identification’s predictive validity
to .18. In practical testing situations, the predictors will never be completely reliable (i.e.,
100% free from measurement error), so it is more reasonable to correct for attenuation in
the criterion only. The result of this approach is provided in Equation 3.6c.
As Equation 3.6c shows, correcting for attenuation in just the criterion makes a sub-
stantial change from the case where we corrected for attenuation in all three variables.
Specifically, the validity coefficient derived based on the partial correlation corrected for
attenuation in the criterion only is .36 (substantially higher than .18).
3.8 E
stimating Criterion Validity with Multiple Predictors:
Higher-Order Partial Correlation
The first-order partial correlation technique can be expanded to include more than a
single variable. For example, you may be interested in controlling the influence of an
additional predictor variable that is related to the primary predictor variable. In this sce-
nario, the higher-order partial correlation technique provides a solution. To illustrate,
consider the case where you are conducting a validity study with the goal of evaluating
the criterion validity of the HVSCI using a primary predictor of interest, but now you
have two predictors that previous research has indicated influence the criterion validity
of the HVSCI. Building on the first-order partial correlation technique, the equation for
higher-order partial correlation is presented in Equation 3.7.
78 PSYCHOMETRIC METHODS
PARTIAL CORR
/VARIABLES=HVSCI cri1_tot BY fi2_tot stm2_tot
/SIGNIFICANCE=TWOTAIL
/STATISTICS=DESCRIPTIVES CORR
/MISSING=LISTWISE.
Criterion, Content, and Construct Validity 79
Reviewing the results presented in Tables 3.2a and 3.2b, we see that the higher-order
partial correlation between the criterion HVSCI and language development (cri1_tot) is
.746 (lightly shaded) after removing the influence of graphic identification (fi2_tot) and
short-term memory (stm2_tot). In Chapter 2, correlation and regression were introduced
as essential to studying the relationships among variables. When we have more than
two variables, regression provides a framework for estimating validity. The next section
80 PSYCHOMETRIC METHODS
illustrates how using multiple regression techniques helps explain how well a criterion is
predicted from a set of predictor variables.
(å Y )
2
n å Yi 2 - i
SSY =
n
sum squared.
Criterion, Content, and Construct Validity 81
OR
DF : (N − 1) (N − K − 1) (K)
• SSY = s um of the difference between each examinee’s score
on Y minus the mean of Y, then this result squared.
• Yi′ = predicted score on Y for each examinee in a sample.
• Y = mean of the criterion Y for the sample.
• Yi = score on Y for an examinee.
• S = summation operator.
• df = degrees of freedom.
• n = sample size.
• k = number of predictors.
Y.1,…,K
2 = 1( YX1 ) + 2 ( YX2 ) + + K ( YK)
As a prelude to the next section and to illustrate estimation of the partial and semipartial
correlations with our example data, the results of a multiple linear regression (MLR) analy-
sis are presented in Tables 3.3a, 3.3b, and 3.3c. The SPSS syntax that generated the output
tables is presented after Tables 3.3a through 3.3c. The mechanics of multiple regression analy-
sis are presented in more detail in the next section, but for now results are presented in order
(k)
Residual 163662.932 996 164.320
(n – k
– 1)
Total 478495.356 999
(n – 1)
a. Predictors: (Constant=intercept), Gsm short-term memory: auditory and visual components
(stm3_tot), Gc measure of vocabulary (cri1_tot), Gf measure of graphic identification (fi2_tot).
b. Dependent (criterion) Variable: Highly valid scale of crystallized intelligence—external
criterion measure of crystallized IQ (HVSCI).
c. Degrees of freedom (df)-related information for sample size and predictor variables has
been added in parentheses to aid interpretation.
d. F = mean square regression divided by mean square residual (e.g., 104944.14/164.32 = 638.65).
Criterion, Content, and Construct Validity 83
to (1) provide a connection to the analysis of variance (ANOVA) via the sums of squares
(Table 3.3b) and (2) highlight the partial and semipartial correlation coefficients with our
example data. Table 3.3c is particularly relevant to the information on partial and semipartial
correlation coefficients and how multiple predictors contribute to the relationship with the
criterion—in light of their relationship to one another. As an instructive exercise, readers should
use the sums of squares in Table 3.3b and insert them into Equation 3.11 to verify the R2 value
presented in Table 3.3a produced by SPSS.
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI (95) R ANOVA ZPP
/CRITERIA=PIN (.05) POUT (.10)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot stm3_tot.
N
ote. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time. Other predictor variable entry methods are discussed later in the chapter.
84 PSYCHOMETRIC METHODS
3.10 E
stimating Criterion Validity with More Than
One Predictor: Multiple Linear Regression
The correlation techniques presented so far are useful for estimating criterion validity by
focusing on the relationship between a criterion and one or more predictor variables—
when measured at the same point in time. However, many times the goal in validation
studies is to predict the outcome on some criterion in the future. For example, consider the
following question: “What will an examinee’s future score be on the HVSCI given our
knowledge of their scores on language development, graphic identification, and auditory
and visual short-term memory?” A related question is, “How confident are we about the
predicted scores for an examinee or examinees?” To answer questions like these, we turn
to multiple linear regression (MLR) introduced briefly in the last part of this section.
When tests are used for prediction purposes, the first step required is the devel-
opment of a regression equation (introduced in Chapter 2). In the case of multiple
predictor variables, a multiple linear regression equation is developed to estimate the
best-fitting straight line (i.e., a regression line) for a criterion from a set of predictor vari-
ables. The best-fitting regression line minimizes the sum of squared deviations from the
best-fitting straight line. For example, Figure 3.3a illustrates the regression line based on
a two-predictor multiple regression analysis using HVSCI as the criterion, and language
development (X1 − cri1_tot) and graphic identification (X2 − fi2_tot) as the predictor
variables.
Figure 3.3b illustrates the discrepancy between the observed HVSCI criterion scores
(i.e., the circular dots) for the 1,000 examinees and their predicted scores (i.e., the solid
straight line of best fit), based on the regression equation developed from our sample data.
FIGURE 3.3a. Regression line of best fit with 95% prediction interval. The dashed lines repre-
sent the 95% prediction interval based on the regression equation. The confidence interval is inter-
preted to mean that in (1 – a) or 95% of the sample confidence intervals that would be formed
from the multiple random samples, the population mean value of Y for a given value of X will be
included.
Criterion, Content, and Construct Validity 85
FIGURE 3.3b. Regression line of best fit with observed versus predicted Y values and the 95%
prediction interval.
The concepts we have developed in the previous sections (and in Chapter 2) provide
a solid foundation for proceeding to the development of a multiple linear regression
equation. However, before proceeding, we review the assumptions of the multiple linear
regression (presented in Table 3.4). Since the model is linear, several assumptions are
relevant to properly conduct an analysis. The model assumptions should be evaluated
with any set of data prior to conducting a regression analysis because violations of the
assumptions can yield inaccurate parameter estimates (i.e., intercept, regression slopes,
and standard errors of slopes). Moderate violations of the assumptions weaken the regres-
sion analysis but do not invalidate it completely. Therefore, researchers need a degree of
judgment specific to violations of the assumptions and their impact on the parameters to
be estimated in a regression analysis (e.g., see Tabachnick & Fidell, 2007, or Draper &
Smith, 1998, for detailed guidance).
For reasons of brevity and simplicity of explanation, we focus on the sample regres-
sion and prediction equations rather than the population equations. However, the
equation elements can be changed to population parameters under the appropriate cir-
cumstances (e.g., population focused and the design of the study includes randomization
in the sampling protocol and model cross validation). In the population equations, the
notation changes to Greek letters (i.e., for population parameters) rather than English
letters. The following sections cover (1) the unstandardized and standardized multiple
regression equations, (2) the coefficient of multiple determination, (3) multiple correla-
tion, and (4) tests of statistical significance. Additionally, the F-test for testing the signifi-
cance of the multiple regression equation is presented.
86 PSYCHOMETRIC METHODS
Testing the statistical significance of the overall regression equation involves the hypothe
sis in Equation 3.14 (i.e., the hypothesis that R2 is zero in the population; note the Greek
letter representing population parameters).
If the null hypothesis in Equation 3.14 is rejected, then at least one of the predictors is
statistically significant. Conversely, if the hypothesis is not rejected, then the overall test indi-
cates that none of the predictors plays a significant role in the equation. The statistical test of
the hypothesis that R2 is zero in the population is provided in Equation 3.15a.
Inserting values from Tables 3.5a and 3.5b into Equation 3.15b, we see that the result
concurs with the SPSS output.
H1 : ρY .1,...,m2 > 0
• H0 = null hypothesis.
• H1 = alternative hypothesis.
ρ
• Y .1,...,m = p
2 opulation coefficient of multiple determination
(R2 in the sample).
To illustrate Equations 3.15a and 3.15b, the results of a multiple regression analysis
using SPSS syntax below are presented in Tables 3.5a and 3.5b.
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI (95) R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN (.05) POUT (.10) CIN (95)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot stm3_tot
Criterion, Content, and Construct Validity 89
Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.
R2
F= M
(1 − R 2 )
(N − M − 1)
• m = number of predictors.
• n = sample size.
R2 .658
m 3 .219 .219
F= = = = = 638.65
(1 − R 2 ) (1 − .658) .342 .0003
(n − m − 1) (1000 − 3 − 1) 996
Regarding the statistics in Tables 3.5a and 3.5b, R is the multiple correlation (i.e.,
a single number representing the correlation among the three predictors and the crite-
rion). Notice that R is fairly large within the context of a correlation analysis (i.e., the
range of 0 to 1.0). Next, we see from Table 3.5b that the overall regression equation
is statistically significant at a probability of less than .001 (p < .001). The significant R
is interpreted as meaning that at least one variable is a significant predictor of HVSCI
(readers should verify this by referring to the F-table of critical values in a statistics
textbook). Table 3.5b provides the sum of squares (from Equations 3.9 and 3.11),
degrees of freedom, means square (explaining different parts of the regression model),
the F-statistic, and the significance (“Sig.” signifying the probability value associated
with the F-statistic).
The partial regression slopes in a multiple regression equation are directly related to the
partial and semipartial correlation presented earlier in the chapter. To illustrate, we use
our example data to calculate the partial slopes in our regression analysis. The equations
for estimating the partial slopes for b1 (the test of language development; cri1_tot) and b2
(the test of graphic identification) are derived in Equations 3.16a and 3.16b.
To determine if the partial slope(s) is(are) statistically significant from zero in the
population, the standard error of a regression slope is required. The hypothesis tested is
population-based and is provided in Equation 3.17.
The statistical significance of the partial regression slope is evaluated using critical
values of the t-distribution and associated degrees of freedom (where df = n – m – 1; n =
sample size and m = number of predictors). The standard error for the partial regression
coefficient is provided in Equation 3.18. Finally, the t-test for significance of the partial
regression slope is provided in Equation 3.19.
Criterion, Content, and Construct Validity 91
H 0 : βk = 0
H 0 : βk ≠ 0
• H0 = null hypothesis.
• bk = population regression coefficient for predictor k.
SRESIDUAL
S( BK) =
(N − 1)SK2 (1 − R2K)
b
t=
s(bk )
The regression of Y on X (or multiple X’s) can also be expressed in a z-score metric by
transforming raw scores on Y and X (see Chapter 2 for a review of how to transform a
raw score to a z-score). After transformation, the means and variances are now expressed
on a 0 and 1 metric. A result of this transformation is that the regression slopes are now
standardized (i.e., the standardized regression slopes) and are equal to rXY, the Pearson
correlation. Since the scores on Y and the multiple X’s are standardized, no intercept is
required for the regression equation. Equations 3.20–3.22 illustrate (1) the standardized
prediction equation, (2) the unstandardized regression equation, and (3) the sample pre-
diction equation using the example intelligence test data.
Note. SPSS regression output yields values for the a intercept of .855.
Difference is due to the number of decimal places used throughout the
hand calculations versus SPSS calculations.
94 PSYCHOMETRIC METHODS
In Equation 3.21, consider the scenario where an examinee has a language develop-
ment score of 7 and a graphic identification score of 6. Using the results from the previous
calculations in Equation 3.21 for b1, b2, and a, we can calculate the examinee’s predicted
score on HVSCI as shown in Equation 3.22.
Using the following syntax (below) to run the regression analysis yields a predicted
value of 25.1 for our examinee. To have SPSS save the predicted values for every exam-
inee (and the 95% prediction interval), the “/SAVE PRED” line in the syntax provided
below is required. A comparison of the predicted value for this examinee reveals that our
regression coefficients and intercept are in agreement (within decimal places/rounding
differences). The correlation between the actual (observed) HVSCI scores and the
predicted scores for the 1,000 examinees is .81. To evaluate the degree of association
between the actual and predicted scores, you can run the correlation between the saved
(predicted) scores for all 1,000 examinees and their actual scores.
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot
/SCATTERPLOT=(HVSCI ,*ZPRED)
/SAVE PRED.
Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.
Arguably, the most critical question related to predictive validity when using regression
analysis is, “How accurate is the regression equation in terms of observed scores versus
scores that are to be predicted by the equation?” Answering this question involves using
the standard error of the estimate (SEE), a summary measure of the errors of prediction
based on the conditional distribution of Y for a specific value of X (see Figure 3.4).
Criterion, Content, and Construct Validity 95
FIGURE 3.4. Conditional distribution of Y given specific values of predictor X. The criterion (Y)
HVSCI is regressed on the predictor (X; cri1_tot). Notice that the distribution appears the same for
each value of the predictor. The standard score (Z) to raw score equivalence on the language develop-
ment test (predictor X) is approximately: –3 = 8.0; –2 = 18.0; –1 = 26.0; 0 = 35.0; 1 = 44.0; 2 = 50.0.
Next, to illustrate the role of the SEE, we use the simple linear regression of Y
(HVSCI) on X (cri1_tot). For a sample with a single predictor, the standard error of the
estimate is provided in Equation 3.23a.
Using Equation 3.23a, we can calculate the standard error of the estimate for the
simple linear regression model. To provide a connection with the output produced in
SPSS, we conduct a regression analysis for estimating the sample coefficients for the
SY . X = ∑(Y − Y ′)2 =
SSRESIDUAL
N − K −1 N − K −1
• sY·X = sample standard error of the estimate for the
regression of Y on X.
• Y − Y¢ = difference between an observed score (on the
criterion) and the predicted score.
• S(Y − Y¢) = sum of the difference (errors of prediction)
2
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10) CIN(95)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot
/SCATTERPLOT=(HVSCI ,*ZPRED) (*ZPRED ,*ZRESID)
/RESIDUALS DURBIN HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3)
/SAVE PRED ZPRED MAHAL COOK ICIN RESID ZRESID.
Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.
As an exercise, we can apply the information in Table 3.6a into Equation 3.23b, and
we can verify that the standard error of the estimate is 13.176 as provided in the output
included in Table 3.6b.
Next, we calculate an examinee’s predicted score using the regression coefficients
and the intercept (in Table 3.6c). Specifically, we will predict a score on the HVSCI for an
examinee whose actual score is 25 on the HVSCI and 12 on the language development
Table 3.6a. Analysis of Variance Summary Table Providing the Sum of Squares
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 305224.404 1 305224.404 1758.021 .000a
Residual 173270.952 998 173.618
Total 478495.356 999
a. Predictors: (Constant), Gc measure of vocabulary
b. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion
measure of crystallized IQ (HVSCI)
Criterion, Content, and Construct Validity 97
(cri1_tot) test using Equation 3.13. The sample prediction equation using information
from the SPSS output (Table 3.6c) is given in Equation 3.24.
Because we created the predicted scores on HVSCI for all examinees in the GfGc data-
set (e.g., see the last line highlighted in the SPSS syntax “/SAVE PRED”), we can check if
the result in Equation 3.24 agrees with SPSS. The predicted score of 33.07 (now included
in the dataset as a new variable) is 8.07 points higher than the observed score of 25. This
discrepancy is due to the imperfect relationship (i.e., a correlation of .79) between language
development and HVSCI in our sample of 1,000 examinees. Finally, using the sum of
squares presented in Table 3.6a, we can calculate R2—the total variation in Y that is pre-
dictable using the predictor or predictors in a simple linear or multiple linear regression
equation. Recall that R2 is calculated by dividing the sum of squares regression by the sum
of squares total in Table 3.6b. For example, using the sums of squares in Table 3.6b, the
result is 305224.404/478495.356 = .638. Notice that .638 is the same as in Table 3.6a, the
regression model summary table.
Recall from the previous section that the correlation between HVSCI (Y) and lan-
guage development (X) was .79, an imperfect relationship. To address the question of
how accurate the predicted scores using the regression equation are requires application
of the standard error of a predicted score provided in Equation 3.25.
Equation 3.25 is also used to create confidence intervals for (1) predicted scores for
each examinee in a sample or (2) the mean predicted score for all examinees. The SPSS
syntax provided earlier in this section includes the options to produce the predicted
scores for each examinee and the associated 95% prediction intervals.
As mentioned earlier, it is common to have multiple predictor variables in a predictive
validity study. Estimation of the standard error of prediction for the multiple regression is more
complex and involves matrix algebra. Fortunately, the computer executes the calculations
1 ( X − X )2
SY ′ = SY ⋅X 1 + +
N ∑ X2
• sY·X = sample standard error of prediction for the
regression of Y on X.
• X − X = difference between an observed predictor score
and the mean predictor score.
• ( X − X )2 = difference between an observed predictor score
and the mean predictor score squared.
• N = sample size.
• k = number of independent or predictor variables.
• S
X2 = sum of the scores on predictor variable X squared.
Criterion, Content, and Construct Validity 99
for us. To understand the details of how the calculations are derived, readers are referred
to Pedhazur (1982, pp. 68–96) or Tabachnick and Fidell (2007). The standard error of pre-
diction for multiple linear regression is provided in Equation 3.26 (Pedhazur, 1982, p. 145).
The SPSS syntax for conducting multiple linear regression provided next includes
the options to produce the predicted scores for each examinee and the associated 95%
prediction intervals.
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN(.05) POUT(.10) CIN(95)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot stm3_tot
/SCATTERPLOT=(HVSCI ,*ZPRED) (*ZPRED ,*ZRESID)
/RESIDUALS DURBIN HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3)
/SAVE PRED ZPRED MAHAL COOK ICIN RESID ZRESID.
Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.
Tables 3.7a through 3.7c provide a partial output from SPSS syntax for multiple lin-
ear regression analysis.
To make the information more concrete, we will calculate an examinee’s predicted
score using the regression coefficients and the intercept in Table 3.7c by inserting the
coefficients into Equation 3.27. Specifically, we will predict a score on the HVSCI for an
examinee whose actual score is 25 on the HVSCI, 12 on language development, 0.0 on
Yi′ = a + b1 X1i + b 2 X2 i + b 3 X3 i
graphic identification, and 9.0 on the short-term memory test using Equation 3.27. The
sample prediction equation with the sample values applied to the regression coefficients
from the SPSS output (Table 3.7c) is illustrated in Equation 3.27.
Notice that the predicted score of 30.33 is 2.73 points closer to the examinee’s
actual HVSCI score of 25 than the score of 33.07 predicted with the single predictor
equation. From Table 3.7a, we see that the standard error of the estimate for the mul-
tiple regression equation is 12.82 (compared to 13.17 in the single predictor model).
Therefore, by adding short-term memory (but not fluid intelligence–based graphic
identification because the examinee’s score was 0) to the multiple regression equa-
tion, predictive accuracy was increased in the prediction model. At this point, using
multiple regression is often desirable in conducting validity studies, but how should
you go about selecting predictors to be included in a regression model? The next sec-
tion addresses this important question.
In behavioral research, many predictor variables are often available for use in con-
structing a regression equation to be used for predictive validity purposes. Also, the
predictor variables are correlated with each other in addition to being correlated with
the criterion. The driving factor dictating variable selection should be substantive
knowledge of the topic under study. Also, achieving model parsimony is desirable
by identifying the smallest number of predictor variables from a total set of variables
that provides the maximum variance explained in the criterion variable. Focusing on
model parsimony also improves the sample size to predictor ratio because the fewer
the predictors, the smaller the sample size required for reliable results. Also, Lord
and Novick (1968, p. 274) note that the addition of many predictor variables seldom
improves the regression equation because the incremental improvement in the vari-
ance accounted for based on adding new variables is very low after a certain point.
When the main goal of a regression analysis is to obtain the best possible equation,
several variable entry procedures are available. These techniques include (1) forward
entry, (2) backward entry, (3) stepwise methods, and (4) all possible regressions opti-
mization. The goal of variable selection procedures is to maximize the variance explained
in the criterion variable by the set of predictors. The techniques may or may not be used
in consideration of theory (e.g., in a confirmatory approach). One other technique of
variable entry is the enter technique, where all predictors are entered into the model
102 PSYCHOMETRIC METHODS
simultaneously (with no predetermined order). This technique (used in the examples ear-
lier in the chapter) produces the unique contribution of each predictor with the criterion
in addition to the relationship among the predictors. For a review and application of these
techniques, readers are referred to Cohen et al. (2003, pp. 158–162), Draper and Smith
(1998), and Hocking (1976).
3.18 Summary
This chapter introduced validity, an overview of the validation process and statistical tech-
niques for estimating validity coefficients. Validity was defined as a judgment or estimate of
how well a test or instrument measures what it is supposed to measure. For example, we
are concerned with the accuracy of answers regarding our research questions. Answering
research questions in psychological and/or behavioral research involves using scores obtained
from tests or other measurement instruments. To this end, the accuracy of the scores is cru-
cial to the relevance of any inferences made. Criterion, content, and construct validity were
presented and contextualized within the comprehensive framework of validity, with crite-
rion and content forms of score validity serving to inform construct validity. Four guidelines
for establishing evidence for the validity of test scores were discussed: (1) evidence based on
test response processes, (2) evidence based on the internal structure of the test, (3) evidence
based on relations with other variables, and (4) evidence based on the consequences of test-
ing. The chapter presented statistical techniques for estimating criterion validity, along with
applied examples using the GfGc data. Chapter 4 presents additional techniques for estab-
lishing score validity. Specifically, techniques for classification and selection and for content
and construct validity are presented together with applied examples.
Standard error of the estimate. A summary measure of the errors of prediction based
on the conditional distribution of Y for a specific value of X.
Standardized regression slope. The slope of a regression line that is in standard score
units (e.g., z-score units).
Statistical control. Controlling the variance by accounting for (i.e., partialing out) the
effects of some variables while studying the effects of the primary variable (i.e., test)
of interest.
Sum of squares regression. Sum of the squared differences between the mean and
predicted values of the dependent or criterion variable for all observations (Hair et
al., 1998, p. 148).
Sum of squares total. Total amount of variation that exists to be explained by the
independent or predictor variables. Created by summing the squared differences
between the mean and actual values on the dependent or criterion variables (Hair et
al., 1998, p. 148).
Systematic variance. An orderly progression or pattern, with scores obtained by an exam-
inee changing from one occasion to another in some trend (Ghiselli, 1964, p. 212).
t-distribution. A family of curves each resembling a variation of the standard normal
distribution for each possible value of the associated degrees of freedom. The
t-distribution is used to conduct tests of statistical significance in a variety of analysis
techniques.
Trait. A relatively stable characteristic of a person which is manifested to some degree
when relevant, despite considerable variation in the range of settings and circum-
stances (Messick, 1989, p. 15).
True criterion score. The score on a criterion corrected for its unreliability.
Unstandardized multiple regression equation. The best-fitting straight line for estimat-
ing the criterion from a set of predictors that are in the original units of measurement.
Validation. A process that involves developing an interpretative argument based on a
clear statement of the inferences and assumptions specific to the intended use of test
scores.
Validity. A judgment or statistical estimate based on accumulated evidence of how well
scores on a test or instrument measure what they are supposed to measure.
Validity coefficient. A correlation coefficient that provides a measure of the relationship
between test scores and scores on a criterion measure.
Zero-order correlation. The correlation between two variables (e.g., the Pearson cor-
relation based on X and Y ).
4
Statistical Aspects
of the Validation Process
This chapter continues with the topic of validity, including the statistical aspects and the
validation process. Statistical techniques based on classification and selection of individu-
als are presented within the context of predictive validity. Content validity is presented with
applications for its use in the validation process. Finally, construct validity is introduced
along with several statistical approaches to establishing construct evidence for tests.
Many, if not most, tests are used to make decisions in relation to some aspect of people’s
lives (e.g., selection for a job or classification into a diagnostic group). Related to the
criterion validity techniques already introduced in Chapter 3, another predictive valid-
ity technique is based on how tests are used to arrive at decisions about selection and/or
classification of individuals into selective groups. Examples include tests that are used for
the purpose of (1) predicting or distinguishing among examinees who will matriculate
to the next grade level based on passing or failing a prescribed course of instruction,
(2) making hiring decisions (personnel selection) in job settings, and (3) determining
which psychiatric patients require hospitalization. Tests used for selection and/or clas-
sification are based on decision theory. In the decision theory framework, a predictive
validation study has the goal of determining who will likely succeed or fail on some crite-
rion in the future. For example, examinees that score below a certain level on a predictor
variable (test) can be screened from employment, admission to an academic program of study,
or placed into a treatment program based on a diagnostic outcome. Another use of decision-
classification validity studies is to determine if a test correctly classifies examinees into
appropriate groups at a current point in time. For example, a psychologist may need a
105
106 PSYCHOMETRIC METHODS
Yes No
Yes No Yes No
Logistic
regression
test that accurately classifies patients into levels of depression such as mild, moderate,
and severe in order to begin an appropriate treatment program; or the psychologist may
need a test that accurately classifies patients as being either clinically depressed or not. In
educational settings, a teacher may need a test that accurately classifies students as being
either gifted or not for the purpose of placing the students into a setting that best meets
their needs. Figure 4.1 illustrates the multivariate techniques useful for conducting pre-
dictive validation studies. Highlighted techniques in Figure 4.1 depict the techniques of
classification presented in this section.
Discriminant analysis (DA; Hair et al., 1998; Glass & Hopkins, 1996, p. 184) is a widely
used method for predicting a categorical outcome such as group membership consisting
of two or more categories (e.g., medical diagnosis, occupation type, or college major).
Statistical Aspects of the Validation Process 107
DA was originally developed by Ronald Fisher (1935) for the purpose of classifying
objects into one or two clearly defined groups (Pedhazur, 1982, p. 692). The technique
has been generalized to accommodate classification into any number of groups (i.e.,
multiple discriminant analysis, or MDA). The goal of DA is to find uncorrelated lin-
ear combinations of predictor variables that maximize the between- to within-subjects
variance as measured by the sum-of-squares and cross-products matrices (Stevens,
2003). The sum-of-squares and cross-products matrix is a precursor to the variance–
covariance matrix in which deviation scores are not yet averaged (see Chapter 2 and the
Appendix for a review of the variance–covariance matrix). The resulting uncorrelated
(weighted) linear combinations are used to create discriminant functions, which are
variates of the predictor variables selected for their discriminatory power used in the
prediction of group membership. The predicted value of a discriminant function for
each examinee is a discriminant z-score. The discriminant scores for examinees are
created so that the mean score on the discriminant variable for one group differs maxi-
mally from the mean discriminant score of the other group(s).
Given that the goal of DA is to maximize the between- to within-subjects vari-
ance, the procedure has close connections with multivariate analysis of variance
(MANOVA). In fact, DA is sometimes used in conjunction with MANOVA to study
group differences on multiple variables. To this end, DA is a versatile technique that gen-
erally serves two purposes: (1) to describe differences among groups after a multivari-
ate analysis of variance (MANOVA) is conducted (descriptive discriminant analysis
[DDA]; Huberty, 1994) and (2) to predict the classification of subjects or examinees into
groups based on a combination of predictor variables or measures (predictive discrimi-
nant analysis [PDA]; Huberty, 1994). Note that since DA is based on the general linear
model (e.g., multiple linear regression and MANOVA), the assumptions required for the
correct use of DA are the same. In this chapter, we focus on PDA because it aligns with
predictive validation studies. Also noteworthy is that if randomization is part of the
research design when employing DA, causal inference is justified, providing the proper
experimental controls are included.
DA assumes that multivariate normality exists for the sampling distributions of the
linear combinations of the predictor variables. For a detailed exposition of screening for
assumptions requisite to using DA, see Tabachnick and Fidell (2007). When the assump-
tions for MLR (and DA) are untenable (particularly multivariate normality), logistic
regression can be used instead to accomplish the same goal sought in DA or MDA.
The specific mathematical details of DA and MDA involve matrix algebra and are not
presented here due to space limitations; readers are referred to Pedhazur (1982, pp. 692–
710) and Huberty (1994) for a complete treatment and examples. Using DA to predict
which classification group subjects or examinees fall into based on an optimized linear
combination of predictor variables is the focus of the present section.
To illustrate the concepts and interpretation of DA specific to predictive validity
studies we will use the GfGc data in two examples. In our first example, suppose we want
to determine an examinee’s academic success measured as successful matriculation from
10th to 11th grade based on their scores on fluid, crystallized, and short-term memory
108 PSYCHOMETRIC METHODS
acquired at the start of their freshman year. When conducting a DA, the process begins
by finding the discriminant function with the largest eigenvalue, resulting in maximum
discrimination between groups (Huberty, 1994; Stevens, 2003). An eigenvalue represents
the amount of shared variance between optimally weighted dependent (criterion) and
independent (predictor) variables. The sum of the eigenvalues derived from a correlation
matrix equals the number of variables. If the DA (1) involves more than a small number
of predictors and/or (2) the outcome includes more than two levels, a second eigenvalue
is derived. The second eigenvalue results in the second most discriminating function
between groups. Discriminant functions 1 and 2 are uncorrelated with one another, thereby
providing unique components of the outcome variable. Application of DA and MDA requires
that scores on the outcome variable be available or known ahead of time. In our example,
the outcome is successful matriculation from 10th to 11th grade (labeled as “matriculate”
in the GfGc dataset). These optimal weights serve as elements in a linear equation that is used
to classify examinees for which the outcome is not known.
Using the information on the outcome variable matriculate and scores on fluid,
crystallized, and short-term memory for examinees, we can derive an optimal set of
weights using DA and Equation 4.1a. The result of Equation 4.1a is the production of
the first discriminant function (recall that a second discriminant function is also cre-
ated based on a second equation). With the weights derived from fitting the equation
to the observed data, status on the outcome variable (Y; matriculation) in Equation
4.1a can be calculated for examinees whose status is unknown. You can see the utility of
this technique in predicting the outcome for examinees knowing certain characteristics
about them (e.g., information about different components of their intelligence). To
review, the difference between linear regression and discriminant analysis is that mul-
tiple linear regression (MLR) is used to predict an examinee’s future score on a criterion
measured on a continuous metric (such as intelligence or undergraduate grade point
average) from a set of predictors, whereas DA is used to predict the future classification
of examinees into distinct groups (e.g., for diagnostic purposes, education attainment,
or employment success).
Next, we can use the following SPSS syntax to conduct a discriminant analysis. Selected
parts of the output are used to illustrate how the technique works with fluid intelligence
total scores, crystallized intelligence total scores, and short-term memory total scores.
DISCRIMINANT
/GROUPS=matriculate(0 1)
/VARIABLES=fi_tot cri_tot stm_tot
/ANALYSIS ALL
/SAVE=CLASS SCORES PROBS
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE CROSSVALID
/PLOT=COMBINED MAP
/CLASSIFY=NONMISSING POOLED.
Statistical Aspects of the Validation Process 109
Schmelkin, 1991, p. 40: (1) valid positives (VP), (2) valid negatives (VN), (3) false
positives (FP), and (4) false negatives (FN). Valid positives and their percentages (Table
4.1e) are those examinees who were predicted to matriculate and did matriculate (i.e.,
VP summarized as 492; 97.4%). Valid negatives (Table 4.1e) are those examinees who
were predicted not to matriculate and did not matriculate (i.e., VN summarized as 448;
90.5%). False positives (Table 4.1e) are examinees who are predicted to matriculate but
did not actually matriculate (i.e., FP summarized as 47; 9.5%). False negatives consist of
examinees predicted not to matriculate but do actually matriculate (i.e., FP summarized
as 13; 2.6%). Figure 4.2 illustrates the information provided in the classification table by
graphing the relationship among the four possible outcomes in our example.
By creating the horizontal (X-axis) and vertical (Y-axis) lines in Figure 4.2, four areas
are represented (i.e., FN, VP, VN, and FP). Partitioning the relationship between crite-
rion and predictors allows for inspection and evaluation of the predictive efficiency of
a discriminant analysis. Predictive efficiency is an evaluative summary of the accuracy of
predicted versus actual performance of examinees based on using DA. The selection ratio
Wilks’ Lambda
Test of Function(s) Wilks’ Lambda Chi-square df Sig.
1 .397 921.265 3 .000
Note. This is the test of significance for the discriminant function.
Statistical Aspects of the Validation Process 111
Structure Matrix
Function
1
sum of crystallized intelligence tests 1 - 4 .999
sum of short term memory tests 1 - 3 .404
sum of fluid intelligence tests 1 - 3 .312
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions.
Variables ordered by absolute size of correlation within
function.
The default prior for group classification is .50/.50. The prior can be changed to
meet the requirements of the analysis.
Classification Function Coefficients
successfully move
from grade 10th to
11th grade
no yes
sum of fluid intelligence tests 1 - 3 .048 .051
sum of crystallized intelligence tests 1 - 4 .229 .405
sum of short term memory tests 1 - 3 .426 .409
(Constant) - -
-14.280 -28.265
0 5
Notes. These are Fisher’s linear discriminant functions. This is a method
of classification in which a linear function is defined for each group.
Classification is performed by calculating a score for each observation on
each group’s classification function and then assigning the observation to
the group with the highest score.
VP (492 or 97.4%)
Yc A
(13 or 2.6%) FN
(cutting score on D D + A = Base Rate (BR)
criterion; above Yc
line = successful (448 or 90.5%) VN
matriculation=1) FP (47 or 9.5%) C + B = 1 – BR
C B
-5.7
intercept
Xc
X
Examinees scoring<fi2_tot=31, cri1_tot=82, stm_tot=34 Examinees scoring≥fi2_tot=31, cri1_tot=82, stm_tot=34
Figure 4.2. Predictive efficiency from discriminant analysis. FN, false negative; FP, false posi-
tive; VN, valid negative; VP, valid positive. Total N = 1000 examinees.
pertains to those examinees (i.e., selected to the right of Xc on the X-axis) regardless of
their “true” status on the criterion. The base rate is the proportion of examinees who are
successful (i.e., above the horizontal Yc line on the Y-axis) regardless of their status (scores)
on the predictor(s). Taylor and Russell (1939) defined predictive efficiency as the number
of valid positives divided by the selection ratio (e.g., A/(A+B)). Using this formula in our
example, we find that the predictive efficiency is .91. Based on the bivariate normal distri-
bution of the relationship (i.e., the correlation) between a predictor and criterion, Taylor
and Russell developed tables that provide a way to tabulate the selection ratio as a function
of the degree to which validity coefficients vary. To aid in planning validity studies, the Taylor
and Russell tables provide for calculation of the success ratio based on manipulating the fol-
lowing: (1) selection ratio and base rate being constant but the correlation between Y and X
varies, (2) selection ratio and the correlation between Y and X are constant, and (3) base rate
and the correlation between Y and X are constant, but the selection ratio varies (Pedhazur &
Schmelkin, 1991, p. 42). Table 4.2 illustrates a portion of the Taylor–Russell tables.
As can be seen in Table 4.2, when conducting predictive validity studies where the
goal is selection or classification, relying only on the validity coefficient is insufficient.
For example, the complex interplay among false negative, valid positive, valid negative,
and false positive must be considered in relation to the goal of the validation study so
114 PSYCHOMETRIC METHODS
that any unintended consequences on examinees are minimized (e.g., see the AERA,
APA, & NCME, 1999, Standards for a review). Allen and Yen (1979, pp. 101–108) pro-
vide an excellent discussion with examples regarding the use of the Taylor and Russell
tables in relation to the four outcomes FN, VP, VN, and FP.
Finally, note that Table 4.1e refers to cross validation classification results. An
additional step in DA is a cross-validation analysis to evaluate the predictive accuracy of
the DA equation. The cross-validation procedure involves dividing the sample into two
parts: (1) the analysis sample used for estimating the discriminant function(s) or logistic
regression model and (2) the holdout sample used to validate the results. The purpose of
cross validation is to ensure that overfitting of the discriminant function has not occurred
by conducting a repeat analysis on a separate independent sample. The term overfitting
refers to the situation in which the solution from an analysis is so good that it is unlikely
to be able to be replicated in the population. To check if overfitting is a problem, cross
validation is often conducted using an independent random sample.
As mentioned previously, DA can be extended to the case where the outcome includes
multiple categories (i.e., MDA). To illustrate an MDA, the following SPSS syntax pro-
duces results displayed in Tables 4.3a–4.3e. The MDA analysis is performed by includ-
ing the “/GROUPS=depress(1 3)” line in the syntax. The only difference in the syntax
Table 4.3a. Eigenvalue and Overall Test of Significance for the Discriminant
Analysis
Wilks’ Lambda
Test of Function(s) Wilks’ Lambda Chi-square df Sig.
1 through 2 .446 804.727 6 .000
2 1.000 .271 2 .873
Statistical Aspects of the Validation Process 115
between the MDA and the two-group DA is that the depression group now has three
categories. The interpretation is much the same as that in the two-group classification
example output tables, except that there are now two discriminant functions (see Table
4.3b). An additional feature of the MDA is the discriminant function plot of the group
centroids to aid interpretation of the analysis (Figure 4.3).
DISCRIMINANT
/GROUPS=depression(1 3)
/VARIABLES=fi_tot cri_tot stm_tot
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE CROSSVALID
/PLOT=COMBINED MAP
/CLASSIFY=NONMISSING POOLED.
1.0
Probability of event (Dependent variable)
0
Low High
Level of the independent variable
Also, the variance of the binomial distribution is not constant across the score scale (i.e.,
the homogeneity of variance assumption in MLR is violated when using a dichotomous
variable). Another useful feature of logistic regression is that the predictor variables can
be ordinal, interval, or a mixed level of measurement.
Recall that in MLR the method of least squares is used to estimate the regression
coefficients in an analysis by using the sum of squared differences from the mean and
individual scores (i.e., the sum of squares as the fundamental element for deriving param-
eter estimates). Estimating the model coefficients in logistic regression involves using
the maximum likelihood method (see the Appendix and Chapter 10 on item response
theory in this text). The method of maximum likelihood estimation is iterative, mean-
ing that the algorithm moves through a process whereby parameter estimates are refined
or improved up to a certain point where any further improvement is negligible. The
maximum likelihood estimation process results in regression parameter estimates that
are most likely (i.e., maximally likely) to result based on the observed data. The result
of maximum likelihood estimation is a likelihood value. Also, when using maximum
likelihood estimation, we evaluate the fit of the regression model to the data. Figures 4.5a
and 4.5b illustrate (1) the situation where the data to model fit is good and (2) a poor
model to data fit using logistic regression.
In conducting a logistic regression, we need to know whether an event has occurred
(e.g., matriculate or not from one grade to another in an educational setting or clinically
1.0
Probability of event (Dependent variable)
z = 56.5
.50
0
Low .50 High
(< .50 = 0; > .50 = 1)
Level of the independent variable
1.0
Probability of event (Dependent variable)
.50
0
Low .50 High
Level of the independent variable
depressed or not clinically depressed in a psychological setting). Armed with this knowl-
edge, we can use this information as our dependent or criterion variable. Based on knowl-
edge of the outcome, the logistic regression procedure estimates the probability that an
event will or will not occur. If the probability is greater than .50, then the prediction is yes,
otherwise no. The logistic transformation is applied to the dichotomous dependent vari-
able and produces logistic regression coefficients according to Equation 4.2a.
To illustrate application of Equation 4.2a, as before in our discriminant analysis
example, we use a score of 30 (fluid intelligence total score), 48 (crystallized intelligence
total score), and 26 (short-term memory total score) for a single examinee in Equation
4.2b to predict successful matriculation from the 10th to 11th grade. The criterion vari-
able is labeled “matriculate” in the GfGc dataset. Figure 4.6 illustrates the location of the
examinee in relation to the logistic regression model.
The following syntax is used to conduct the logistic regression analysis. Tables 4.4a
through 4.4d provide the output from the analysis.
E B0 + B1X1+ . . . + BM XM
PROB(EVENT) = Yˆ I =
1 + E B0 + B1X1+ . . . + BM XM
Using the unstandardized weights in Table 4.4d and inserting these weights as illus-
trated in Equation 4.2b, we see that the result for the equation (i.e., the probability for
group membership = 1) for an examinee with this set of scores on the predictors is .97.
Furthermore, in Table 4.4d, we see that the only predictor variable that is statistically
significant is the crystallized intelligence test (cri1_tot; p < .001, odds ratio or Exp(B) =
1.25). The Wald test is similar to the t-test in MLR and is calculated as the squared B
divided by its standard error. Finally, the odds ratio, labeled as Exp(B), for cri1_tot is
1.25. To interpret, for an examinee scoring 48 on the language development test, the odds
of successfully matriculating increase by a factor of 1.25. An odds ratio of 1.0 is inter-
preted as an examinee having no greater than a 50% chance of successful matriculation.
Finally, odds ratios of 2.0 or higher are recommended in terms of practical importance
Statistical Aspects of the Validation Process 121
1.0
Probability of event (Dependent variable)
z = 56.5
.50
0
Low .50 High
(< .50 = 0; > .50 = 1)
Level of the independent variable
Figure 4.6. Location of an examinee based on the logistics regression model. This figure is
based on an examinee who scores 30 (fluid intelligence total score), 48 (crystallized intelligence
total score), and 26 (short-term memory total score). Note that the examinee is located just to the
right of the probability = .50 vertical line, indicating that the student is predicted to successfully
matriculate from 10th to 11th grade.
(Tabachnick & Fidell, 2007; Hosmer & Lemeshow, 2000). Using this odds ratio guide-
line, the other two predictor variables are not practically important (and not statistically
significant). Next we turn to the situation where the outcome has more than two catego-
ries, an extension of logistic regression known as multinomial logistic regression.
The preceding example addresses the case where the criterion has only two possible
outcomes. The logistic model can be extended to the case where there are three or more
levels in the criterion. To illustrate, we use a criterion variable with three levels or pos-
sible outcomes (e.g., low, moderate, severe depression). The criterion variable is labeled
“depression” in the GfGc dataset. The logistic regression model that is analogous to the
multiple discriminant analysis presented earlier is provided in the SPSS syntax below
(Tables 4.5a–4.5f). Notice that SPSS uses multinomial regression to conduct the analy-
sis where the criterion has more than two levels of the outcome. Tables 4.5a through 4.5f
are interpreted as in the previous section, the only difference being in Table 4.5e and 4.5f,
where the parameter estimates and classification tables now include three levels of the
Statistical Aspects of the Validation Process 123
The –2 log likelihood statistic is the global model-fit index used in evaluating the ade-
quacy of the logistic regression model fit to the data. The –2 log likelihood represents
the sum of the probabilities associated with the predicted and actual outcomes for each
examinee or case in the dataset (Tabachnick & Fidell, 2007). A perfect model-data fit
yields a –2 log likelihood statistic of zero; therefore the lower the number, the better the
model-data fit. The chi-square statistic represents a test of the difference between the
intercept only model (i.e., in SPSS the “constant” only model) versus the model with one
or more predictors included. In our example (see Table 4.4a in the previous section), the
chi-square is significant (p < .001), meaning that our three-predictor model is better than
the intercept only model; the –2 log likelihood statistic is 448.726 (see Table 4.4a). As
in MLR, the decision regarding the method for entry of the predictors into the equation
depends on the goal of the study. Variable-entry options include enter or direct method
(all predictors enter the equation simultaneously), stepwise, forward, and backward selec-
tion. For guidance regarding the decision about using a particular variable-entry method,
see Tabachnick and Fidell (2007, pp. 454–456) or Hosemer and Lemeshow (2000). The
Cox and Schnell R2 and Nagelkerke R2 represent the proportion of variance accounted for
in the dependent variable by the predictors. For a comparison and interpretation of the
R2 statistics produced in logistic regression versus MLR, see Tabachnick and Fidell (2007,
pp. 460–461). As in MLR, larger values of R2 are desirable and reflect a better regression
model. Collectively, upon review of the results of the logistic regression analysis using
the same data as in the discriminant analysis example, we see a high degree of agreement.
We next turn to a type of validity evidence that depends on the information con-
tained in the items comprising a test—content validity. Specifically, the items comprising
a test reflect a representative sample of a universe of information in which the investiga-
tor is interested.
estimate of performance in the domain. The previous statement holds if (1) the observed
scores are considered as being a representative sample from the domain, (2) the per-
formances are evaluated appropriately and fairly, and (3) the sample is large enough to
control for sampling error (Kane, 2006; Guion, 1977).
The content validity model of validation has been criticized on the grounds that it is
subjective and lends itself to confirmatory bias. The criticism of subjectivity stems from
the fact that judgments are made regarding the relevance and representativeness of
tasks to be included on a test (see Chapter 6 for a review of these issues). One attempt
to address the problem of subjectivity in the content validity model involves estimating
the content validity ratio (CVR; Lawshe, 1975). The CVR quantifies content valid-
ity during the test development process by statistically analyzing the performance of
expert judgments regarding how adequately a test or instrument samples behavior
from a universe of behavior it was designed to sample (Cohen & Swerdlik, 2010,
p. 173).
The issue of confirmatory bias in the content validity model stems from the fact that
the process or exercise one goes through to establish evidence for content validity is
driven by a priori ideas about what the content of the test item or tasks should be. To
minimize confirmatory bias, multiple subject matter or content experts are used along
with rating scales to reduce subjectivity in the content validity model. This information
can be used to derive the CVR. Used in isolation, the content validity model is par-
ticularly challenged when applied to cognitive ability or other psychological processes
that require hypothesis testing. Based on the challenges identified, the role content
validity plays in relation to the three components of validity is “to provide support for
the domain relevance and representativeness of the test or instrument” (Messick, 1989,
p. 17). Next, we turn to arguably the most comprehensive explanation of validity—
construct validity.
Although criterion and content validity are important components of validity, neither one
provides a way to address the measurement of “complex, multifaceted and theory-based
attributes such as intelligence, personality, leadership,” to name a few examples (Kane,
2006, p. 20). In 1955, Cronbach and Meehl introduced an alternative to the criterion
and content approaches to validity that allowed for the situation where a test purports to
measure an attribute that “is not operationally defined and for which there is no adequate
criterion” (p. 282). A particularly important point Cronbach and Meehl argued was that
even if a test was initially validated using criterion or content evidence, developing a
Statistical Aspects of the Validation Process 127
Given that construct validity is complex and multifaceted, establishing evidence of its
existence requires a comprehensive approach. This section presents four types of studies
useful for establishing evidence of construct validity: (1) correlational studies, (2) group
difference studies, (3) factor-analytic studies, and (4) multitrait–multimethod (MTMM)
Construct Validity
Evidential Basis Construct Validity +
Relevance/Utility
Social
Consequential Basis Value Implications
Consequences
Figure 4.7. Messick’s four facets of validity. From Messick (1988, p. 42). Copyright 1988 by
Taylor and Francis. Republished with permission of Taylor and Francis.
128 PSYCHOMETRIC METHODS
studies. The section ends with an example that incorporates an application of the various
components of validity.
What might a comprehensive and rigorous construct validation study look like?
Benson (1988) provides guidelines for conducting a rigorous, research-based construct
validation program of study. Benson’s guidelines (Table 4.6) propose three main compo-
nents: (1) a substantive/stage, (2) a structural/stage, and (3) an external stage. Finally,
Benson’s guidelines align with Messick’s (1995) unified conception of construct validity.
To illustrate how a researcher can apply the information in Table 4.6 to develop
a comprehensive validity argument, consider the following scenario. Suppose that aca-
demic achievement (labeled as X1; measured as reading comprehension) correlates .60
with lexical knowledge (Y) as one component of crystallized knowledge (i.e., in the GfGc
dataset). The evaluation of predictive validity is straightforward and proceeds by presen-
tation of the correlation coefficient and an exposition of the research design of the study.
The astute person will question whether there is another explanation for the correla-
tion of .60 between the crystallized intelligence subtest total score on lexical (i.e., word)
knowledge (Y) and reading comprehension (X1). This is a reasonable question since no
interpretation has occurred beyond presentation of the correlation (validity) coefficient.
A response to this question requires the researcher to identify and explain what additional
types of evidence are available to bolster their argument that crystallized intelligence is
related to an examinee’s academic achievement as measured by reading comprehension
ability. The explicative step beyond merely reporting the validity coefficient becomes
necessary when arguments are advanced that propose crystallized intelligence measures
academic achievement in a general or holistic sense. For example, the reading compre-
hension test (the proxy for academic achievement) may be measuring only an examinee’s
strength of vocabulary.
To this end, addressing alternative explanations involves inquiry into other kinds
of validity evidence (e.g., evidence provided from published validity-based studies). For
example, consider two other tests from the GfGc dataset: (1) language development and
(2) communication ability. Suppose that after conducting a correlation-based validity
study, we find that language development correlates .65 with reading comprehension and
communication ability correlates .40 with the comprehension test. Further suppose that
the mean reading comprehension score decreases for examinees who fail to produce pass-
ing scores on writing assignments in the classroom setting; and that writing assignments
involve correct application of the English language (call this measure X2). Also, say that
mean reading comprehension score increases for examinees on a measure of commu-
nication ability that was developed as an indicator of propensity to violence in schools
(call this measure X3). Under this scenario, a negative correlation between X1 (academic
achievement measured by reading comprehension) and X2 (in-class writing assignments)
eliminates it as a rival explanation for X1 (i.e., as a legitimate explanation for reading
achievement). Also, suppose that a negative correlation between X3 (number of violent
incidences by students on campus) and X1 (academic achievement measured by reading
comprehension) provides an additional aspect that deserves an explanation relative to
the word knowledge component of crystallized intelligence (Y).
Statistical Aspects of the Validation Process 129
In these examples, the correlation evidence exhibited in the two additional valid-
ity studies serves as evidence for eliminating measures X2 and X3 in the current study of
crystallized intelligence and academic achievement.
One way to establish construct validity evidence is to conduct a correlational study with
two goals in mind. The first goal is closely related to content validity and involves evalu-
ating the existence of item homogeneity (i.e., the items on the test tap a common trait or
attribute) for a collection of test items. If item homogeneity exists, then we have evidence
of a homogeneous scale. The second goal involves evaluating the relationship between
an existing criterion and the construct (represented by a collection of test items). From
the perspective of test users, the purpose of these approaches is to allow for the evaluation
of the quantity and quality of evidence relative to how scores on the test will be used. The
quantity and quality of evidence are evaluated by examining the following criteria:
1. The size of the correlation between the test item under study and the total test
score (e.g., the point–biserial correlation represents the association of a test item
with the total score—see Chapter 6 and the Appendix for a review—between
each item and the total score on the test).
2. The size of the correlation between the test under study and the criterion (for
criterion-related evidence).
3. Calculation of the proportion of variance (i.e., the correlation coefficient squared)
accounted for by the relationship between the test and the criterion.
4. Interpretation of the criterion validity coefficient in light of sampling error (e.g.,
the size and composition of the sample used to derive the correlation coefficients).
Ensuring that item homogeneity exists is an important first step in evaluating a test
for construct validity. However, when considered alone, it provides weak evidence. For
example, you may find through item analysis results from a pilot study that the items
appear to be appropriately related from a statistical point of view. However, relying on
item homogeneity in terms of the content of the items and the correlational evidence
between the items and the total score on the test can be misleading (e.g., the items may be
relatively inaccurate in terms of what the test is actually supposed to measure). Therefore,
a multifaceted approach to ensuring that test items accurately tap a construct is essential
(e.g., providing content plus construct validity evidence in a way that establishes a com-
plete argument for score validity; see Kline, 1986). A shortcoming of the correlational
approach to establishing construct validity evidence lies in the lack of uniformly accepted
criteria for what the size of the coefficient should be in order to provide adequate associa-
tional evidence. Also, the results of a correlational study must be interpreted in light of
previous research. For example, the range of correlation coefficients and proportions of
Statistical Aspects of the Validation Process 131
variance accounted for from previous studies should be provided to place any correlation
study in perspective.
Often, researchers are interested in how different groups of examinees perform relative
to a particular construct. Investigating group differences involves evaluating how test
scores differ between a group of examinees’ scores on a criterion who (1) are different on
some sociodemographic variable or (2) received some treatment expected to affect their
scores (e.g., in an experimental research study). Validity studies of group differences
posit hypothesized relationships in a particular direction (e.g., scores are expected to be
higher or lower for one of the groups in the validity study). If differences are not found,
one must explore the reasons for this outcome. For example, the lack of differences
between groups may be due to (1) inadequacy of the test or instrument relative to the
measurement of the construct of interest, (2) failure of some aspect of the research design
(e.g., the treatment protocol, sampling frame, or extraneous unaccounted for variables), or
(3) a flawed theory underlying the construct.
Factor analysis plays an important role in establishing evidence for construct validity. This
section presents only a brief overview to illustrate how factor analysis is used to aid in
construct validation studies. Chapter 9 provides a comprehensive foundation on the topic.
Factor analysis is a variable reduction technique with the goal of identifying the minimum
number of factors required to account for the intercorrelations among (1) a battery of items
comprising a single test (e.g., 25 items measuring the vocabulary component of verbal
intelligence of crystallized intelligence) or (2) a battery of tests theoretically representing
an underlying construct (e.g., the four subtests measuring crystallized intelligence in the
GfGc dataset). In this way, factor analysis is a variable reduction technique that takes a large
number of measured variables (e.g., items on tests or total scores on subtests) and reduces
them to one or more factors representing hypothetical unobservable constructs.
In psychometrics, factor analysis is used in either an exploratory or a confirmatory
mode. In the exploratory mode, the goal is to identify a set of factors from a set of test items
(or subtest total scores) designed to measure certain constructs manifested as examinee
attributes or traits. In exploratory factor analysis (EFA), no theory is posited ahead of
time (a priori); instead, the researcher conducts a factor analysis using responses to a large
set of test items (or subtests) designed to measure a set of underlying constructs (e.g., attri-
butes or traits of examinees manifested by their responses to test items). Exploratory factor
analysis is sometimes used as an analytic tool in the process of theory generation (e.g., in
the substantive and structural stages in Table 4.6 during the development of an instrument
targeted to measure a construct where little or no previous quantitative evidence exists).
132 PSYCHOMETRIC METHODS
statistical programs such as SPSS or SAS. Alternatively, one may use the variance–covariance
matrix when conducting factor analysis (e.g., when using structural equation modeling
[SEM], also known as covariance structure modeling). Using SEM to conduct factor analy-
sis requires using programs such as Mplus, LISREL, SPSS-AMOS, EQS, and SAS PROC
CALIS (to name only a few). Returning to our example, after running the factor-analysis
program, a table of factor loadings (Table 4.7) is produced, aiding in interpreting the
factorial composition of the battery of tests. A standardized factor loading is scaled on a
correlation metric (ranging between –1.0 and +1.0) and represents the size and strength
of an individual test on a factor. Below is the SPSS syntax that produces the factor load-
ings in Table 4.7.
FACTOR
/VARIABLES stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_
tot cri4_tot fi1_tot fi2_tot fi3_tot
/MISSING LISTWISE
/ANALYSIS stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_tot
cri4_tot fi1_tot fi2_tot fi3_tot
/PRINT UNIVARIATE CORRELATION SIG KMO EXTRACTION ROTATION
/PLOT EIGEN ROTATION
/CRITERIA FACTORS(3) ITERATE(25)
/EXTRACTION PAF
/CRITERIA ITERATE(25)
/ROTATION PROMAX(4)
/METHOD=CORRELATION.
Table 4.7. Factor Loadings for the 10 Subtests Comprising the GfGc Data
Factor loading
Test I II III
Gc—Vocabulary .87 .49 .44
Gc—Knowledge .83 .48 .43
Gc—Abstract Reasoning .83 .56 .62
Gc—Conceptual Reasoning .84 .45 .43
Gf—Graphic Orientation .51 .56 .82
Gf—Graphic Identification .53 .67 .80
Gf—Inductive & Deductive Reasoning .06 .16 .26
Stm—Short-term Memory—visual clues .69 .68 .54
Stm—Short-term Memory—auditory & visual .44 .78 .53
Stm—Short-term Memory—math reasoning .50 .80 .68
Note. Loadings are from the structure matrix produced from a principal axis factor analysis with promax (correlated
factors) rotation. In a principal axis factor analysis with promax (correlated factors), only elements of a structure
matrix may be interpreted as correlations with oblique (correlated) factors. See Chapter 9 on factor analysis for
details.
134 PSYCHOMETRIC METHODS
To interpret the results of our example factor analysis, refer to the shaded areas
to highlight the pattern of loadings relative to each subtest comprising the total scores
for examinees on crystallized intelligence, fluid intelligence, and short-term memory. In
Table 4.7, we see that the crystallized intelligence subtests group together as factor I (e.g.,
because the highest factor loadings in columns labeled I–III are in column I (e.g., .87, .83,
.83, .84). No other subtest (Gf or Stm) exhibits a higher loading than those displayed in
column I. The same scenario exists when you examine the size of the loadings for the Gf
and Stm subtests. In summary, the subtests representing the Gc, Gf, and Stm composite
(i.e., total scores) factor-analyze in line with GfGc theory; resulting in factor-analytic
evidence for the constructs of each type of intelligence. Also, produced in the results of
the factor analysis is the correlation between the three factors (Table 4.8). The correla-
tion coefficients between the composites are .61 between crystallized intelligence and
short-term memory; .59 between crystallized intelligence and fluid intelligence; and .72
between fluid intelligence and short-term memory. As expected from GfGc theory, these
factors are related.
Campbell and Fiske (1959) introduced a comprehensive technique for evaluating the
adequacy of tests as measures of constructs called multitrait–multimethod (MTMM).
The MTMM technique includes evaluation of construct validity while simultaneously
considering examinee traits and different methods for measuring traits. To review, a trait
is defined as “a relatively stable characteristic of a person . . . which is manifested
to some degree when relevant, despite considerable variation in the range of settings
and circumstances” (Messick, 1989, p. 15). Furthermore, interpretation of traits also
implies that a latent attribute or attributes accounts for the consistency in observed
patterns of score performance. For example, MTMM analysis is used during the struc-
tural and external stages of the validation process (e.g., see Table 4.6) in an effort to
evaluate (1) the relationship between the same construct and the same measurement
method (e.g., via the reliabilities along the diagonal in Table 4.9b); (2) the relationship
between the same construct using different methods of measurement (i.e., convergent
Statistical Aspects of the Validation Process 135
The generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972) provides
another way to systematically study construct validity. Generalizability theory is covered in
detail in Chapter 8 of this book. In generalizability theory, the analysis of variance is used to
study the variance components (i.e., the errors or measurement due to specific sources) attrib-
utable to examinees’ scores and the method of testing (Kane, 1982). Importantly, generaliz-
ability theory is not simply the act of using analysis of variance and calling it generalizability
theory. As Brennan (1983) notes, “there are substantial terminology differences, emphasis
and scope and the types of designs that predominate” (p. 2). Readers should refer to Chapter
8 to review the foundations of generalizability theory to understand the advantages it pro-
vides in conducting validation studies. Referring to Table 4.6, we see that a generalizability
theory validity study falls into the structural stage of the construct validation process.
In generalizability theory, the score obtained for each person is considered a ran-
dom sample from a universe of all possible scores that could have been obtained (Brennan,
1983, pp. 63–68). The universe in generalizability theory typically includes multiple
dimensions known as facets. In the study of score-based validity evidence, a facet in
generalizability theory can be represented as different measurement methods. In design-
ing a generalizability theory validation study, a researcher must consider (1) the theory
specific to the construct and (2) the universe to which score inferences are to be made.
To illustrate how generalizability theory works with our example intelligence test data,
we focus on the crystallized intelligence test lexical (word) knowledge. It is possible to
measure lexical knowledge using a variety of item formats. For example, in Table 4.9b,
we see that three types of item formats are used (multiple-choice, incomplete sentence,
and vignette). These item formats might be the focus of a generalizability-based validity
study where the question of interest is, “How generalizable are the results over differ-
ent item formats?” To answer this question, a researcher can design a G-study (i.e., a
generalizability study). To conduct a G-study focusing on the impact of item format, all
examinees are tested using the different item formats (i.e., every examinee is exposed to
every item format). Also, within the context of a G theory study, we assume that the item
formats are a random sample of all possible item formats from a hypothetical universe. In
this scenario, item format is a random facet within the G-study. The goal in our example
is to estimate the generalizability coefficient within the G-study framework. Equation 4.3
(Kane, 1982) provides the appropriate coefficient for the item format random facet.
The validity coefficient in Equation 4.3 is interpreted as the average convergent coef-
ficient based on randomly choosing different methods for measuring the same trait from a
universe of possible methods (Kane, 1982). The astute reader may recognize the fact that
Equation 4.3 may also be used to estimate score reliability, a topic covered in Chapter 7.
However, the difference between interpreting Equation 4.3 as a validity coefficient versus
a reliability coefficient involves the assumptions applied. To interpret Equation 4.3 as a
reliability coefficient, the item format facet must be fixed. In this way, the reliability coef-
ficient is not based on randomly chosen methods but only represents score reliability spe-
cific to the methods included in the G-study design. For example, a researcher may want
Statistical Aspects of the Validation Process 137
r2P
r2 =
r2P + r2PF + s 2E
to study the impact of different test forms (e.g., an evaluation of parallel test forms) using
the same item format. In this case, the study focuses on how score reliability changes
relative to the different test forms but with the item format fixed to only one type. As
you see from this brief overview, generalizability theory provides a comprehensive way to
incorporate validity and reliability of test scores into validation studies.
Although there are several approaches to establishing evidence of construct validity
of a test and the scores it yields, the driving factor for selecting a technique depends on
the intended use of the test and any inferences to be drawn from the test scores. When
developing a test, researchers should therefore be sensitive regarding what type of evi-
dence is most useful for supporting the inferences to be made from the resulting scores.
This chapter extended the information presented in Chapter 3 on validity and the valida-
tion process. The information in this chapter focused on techniques for estimating and
interpreting content and construct validity. Establishing the validity evidence of tests and
test scores was presented as an integration of the three components criterion, content,
and construct validity. This idea aligns with Messick’s conceptualization of construct vali-
dation as a unified process. The four guidelines for establishing evidence for the validity
of test scores are: (1) evidence based on test response processes, (2) evidence based on
the internal structure of the test, (3) evidence based on relations with other variables, and
(4) evidence based on the consequences of testing. Content validity was introduced, and
examples were provided regarding the role it plays in the broader context of the validity
evidence. Construct validity was introduced as the unifying component of validity. Four
types of construct validation studies were introduced and examples provided. Ideally, the
information provided in Chapters 3 and 4 provide you with a comprehensive perspective
on validity as it relates to psychometric methods and research in the behavioral sciences
in general.
138 PSYCHOMETRIC METHODS
Multinomial regression. Type of regression where the dependent variable is not restricted
to only two categories.
Multiple discriminant analysis. Technique used to describe differences among multiple
groups after a multivariate analysis of variance (MANOVA) is conducted. MDA is
applicable to descriptive discriminant analysis and predictive discriminant analysis
(Huberty, 1994, pp. 25–30).
Multitrait–multimethod. An analytic method that includes evaluation of construct valid-
ity relative to multiple examinee traits in relation to multiple (different) methods for
measuring such traits (Campbell & Fiske, 1959).
Multivariate analysis of variance. Technique used to assess group differences across
multiple dependent variables on a continuous scale or metric level of measurement
(Hair et al., 1998, p. 327).
Odds ratio. The ratio of the probability of an event occurring to the probability of an
event not occurring; the dependent variable in logistic regression.
Predictive discriminant analysis. Technique used to predict the classification of subjects
or examinees into groups based on a combination of predictor variables or measures
(Huberty, 1994, pp. 25–30).
Predictive efficiency. A summary of the accuracy of predicted versus actual performance
of examinees based on using discriminant analysis or other regression techniques.
Selection ratio. The proportion of examinees selected based on their scores on the crite-
rion being above an established cutoff.
Structural equation modeling. A multivariate technique that combines multiple regression
(examining dependence relationships) and factor analysis (representing unmeasured
concepts or factors comprised of multiple items) to estimate a series of interdependent
relationships simultaneously (Hair et al., 1998, p. 583).
Success ratio. The ratio of valid positives to all examinees who are successful on the
criterion.
Sum of squares and cross–products matrices. A row-by-column matrix where the diagonal
elements are sums of squares and the off-diagonal elements are cross-products.
Variate. Linear combination that represents the weighted sum of two or more independent
or predictor variables that comprise the discriminant function.
5
Scaling
This chapter introduces scaling and the process of developing scaling models. As a foun-
dation to modern psychometrics, three types of scaling approaches are presented along
with their application. The relationship between scaling and psychometrics is provided.
Finally, commonly encountered data layout structures are presented.
5.1 Introduction
In Chapters 3 and 4 establishing validity evidence for scores obtained from tests was
described as a process incorporating multiple forms of evidence (e.g., through criterion,
content, and construct components—with construct validity representing a framework that
is informed by criterion and content elements). In this chapter, scaling and scaling mod-
els are introduced as essential elements to the measurement and data acquisition pro-
cess. The psychological and behavioral sciences afford many interesting and challenging
opportunities to formulate and measure constructs. In fact, the myriad possibilities often
overwhelm researchers. Recall from Chapter 1 that the primary goal of psychological
measurement is to describe the psychological attributes of individuals and the differences
among them. Describing psychological attributes involves some form of measurement or
classification scheme. Measurement is broadly concerned with the methods used to pro-
vide quantitative descriptions of the extent to which persons possess or exhibit certain
attributes. The development of a scaling model that provides accurate and reliable acqui-
sition of numerical data is essential to this process.
The goal of this chapter is to provide clarity and structure for researchers as they
develop and use scaling models. The first section in this chapter introduces scaling as
a process, provides a short history, and highlights its importance. The second section
constitutes the majority of the chapter; it introduces three types of scaling models and
141
142 PSYCHOMETRIC METHODS
provides guidance on when and how to use them. The chapter closes with a brief discus-
sion of the type of data structures commonly encountered in psychometrics.
Scaling is the process of measuring objects or subjects in a way that maximizes preci-
sion, objectivity, and communication. When selecting a scaling method, order and equal-
ity of scale units are desirable properties. For example, the Fahrenheit thermometer is a
linear scale that includes a tangible graphic component—the glass tube containing mer-
cury sensitive to temperature change. Alternatively, measuring and comparing aspects of
human perception requires assigning or designating psychological objects (e.g., words,
sentences, names, and pictures), then locating individuals on a unidimensional linear
scale or multidimensional map. Psychological objects are often presented to respondents
in the form of a sentence or statement, and persons are required to rank objects in terms
of similarity, order, or preference. In Chapter 1, the development of an effective scaling
protocol was emphasized as an essential step in ensuring the precision, objectivity, and
effective communication of the scores obtained from the scale or instrument.
A scaling model provides an operational or relational framework for assigning num-
bers to objects, thereby facilitating the transformation from qualitative constructs into
measurable metrics. Scaling is the process of using the measurement model to produce
numerical representations of the objects or attributes being measured. The scaling pro-
cess includes a visual interpretation in the form of a unidimensional scale or multi-
dimensional map. For scaling to be effective, the researcher needs to utilize a process
known as explication. This process involves conceptualizing and articulating a new or
undefined concept based on identifying meaningful relations among objects or variables.
Related to explication, Torgerson (1958, pp. 2–15) cites three interrelated issues essential
to the scaling process:
Notice that in applied psychometric work these three points provide a unified
approach to measurement, scaling research design, and analysis. Attention to these issues
is crucial because the accuracy of the results obtained from the scaling process affects score
interpretation. For example, lack of careful attention to the first point directly affects score
interpretation and ultimately the validation process as discussed in Chapters 3 and 4.
History provides important insights regarding how scaling has proven integral to the
evolution of psychological measurement. Such a perspective is useful for providing a
Scaling 143
foundation and frame of reference for work in this area. As a precursor to modern psycho-
metrics, Stanley Smith Stevens’s chapter “Mathematics, Measurement, and Psychophys-
ics” in the Handbook of Experimental Psychology (Stevens, 1951b) provides an extensive
and unified treatment of psychological scaling. Stevens’s seminal work provided a cogent
foundation for the emerging discipline of psychological measurement, today known as
psychometrics (i.e., mind or mental measurement). The term psychometrics (i.e., mind
measuring) is based on the relationship between f (i.e., the magnitude of the stimulus)
and y (i.e., the probability that a subject detects or senses the stimulus, as in Figure 5.1).
Figure 5.1 displays an absolute threshold measured by the method of constant stimuli
for a series of nine stimulus intensities. Stimulus intensity is plotted on the X-axis. In Fig-
ure 5.1, an absolute threshold intensity of 9.5 corresponds to the proportion of trials yield-
ing a “yes, I sense the stimulus” response 50% (i.e., probability of .50) of the time. That is,
to arrive at a proportion of “yes” responses occurring 50% of the time, cross-reference the
Y-axis with the X-axis and you see that a stimulus intensity of 9.5 corresponds to a prob-
ability on the Y-axis of 50%. Figure 5.2 illustrates the relationship between the psychomet-
ric function in Figure 5.1 and the normal curve (i.e., the standard normal distribution).
Stevens focused on the interconnectivity among science, mathematics, and psycho-
physics in modeling empirical (observable) events and relations using mathematical sym-
bols and rules in conjunction with well-conceived scales. Stevens’s work provided much of
the foundation for modern psychometrics and was based on the idea that “when descrip-
tion gives way to measurement, calculation replaces debate” (Stevens, 1951b, p. 1).
Psychometric methods have evolved substantially since Stevens’s time and now
include an expanded philosophical ideology that has moved far beyond classic psycho-
physics (i.e., the mathematical relationship between an observable physical stimulus and
a psychological response). In fact, psychometric methods now consist of a broad array
of powerful scaling, modeling, and analytic approaches that facilitate the investigation
1.0
.9
Probability of yes response, Ψ
.8
.7
.6
.5
.4
.3
.2
Absolute threshold
.1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Absolute threshold, ϕ
z-score –3 –2 –1 0 1 2 3
raw score 10 20 30 40 50 60 70
1.00
Probability of yes response, Ψ
.50
0
50% 100%
z-score –3 –2 –1 0 1 2 3
Absolute threshold, Ф
Figure 5.2. Relationship between the psychometric function and the normal curve.
Formally, the term scaling is the process of measuring stimuli by way of a mathemati-
cal representation of the stimulus–response curve (Birnbaum, 1998; Guilford, 1954;
Torgerson, 1958). Once the transformation from qualitative constructs into measur-
able metrics is accomplished, developing a mathematical representation of the rela-
tionship between a stimulus and response is a crucial step, allowing measurements to
Scaling 145
be used to answer research questions. Here the term stimulus broadly means (1) the
ranking of preference, (2) the degree of agreement or disagreement on an attitudinal
scale, or (3) a yes/no or ordered categorical response to a test item representing a con-
struct such as achievement or ability. In psychophysical scaling models, the goal is
to locate stimuli along a continuum, with the stimuli, not persons, being mapped onto
a continuum. For example, a stimulus is often directly measurable, with the response
being the sensory-based perception in either an absolute or a relative sense (e.g., reac-
tion time). Examples where psychophysical scaling models are useful include studies
of human sensory factors such as acoustics, vision, pain, smell, and neurophysiology.
Conversely, when people are the focus of scaling, the term psychological scaling is
appropriate. Psychological scaling models where people are the focus are classified
as response-centered (see Table 5.1). Some examples of how psychological scaling
occurs in measurement include tests or instruments used to measure a person’s ability,
achievement, level of anxiety or depression, mood, attitude, or personality. Next we
turn to a discussion of why scaling models are important to psychometrics specifically
and research in general.
Developing an effective scaling model is essential for the measurement and acquisition
of data. For a scaling model to be effective, accuracy, precision of measurement, and
objectivity are essential elements. A scaling model provides a framework for acquiring
scores (or numerical categories) on a construct acquired from a series of individuals,
objects, or events. Scaling models are developed based on (1) the type of measurement
(e.g., composites consisting of the sum of two or more variables, an index derived as a
linear sum of item-level responses, or fundamental meaning that values exhibit properties
of the real number system) and (2) the type of scale (i.e., nominal, ordinal, interval, or
ratio). Scaling methods that produce models are categorized as stimulus-, response-, or
subject-centered (Torgerson, 1958, p. 46; Crocker & Algina, 1986, pp. 49–50). Table 5.1
provides an overview of each type of scaling approach.
The process of developing a scaling model begins with a conceptual plan that produces
measurements of a desired type. This section presents three types of scaling models—
stimulus-centered, response-centered, and subject-centered (Nunnally & Bernstein,
1994; Torgerson, 1958, p. 46)—relative to the type of measurements they produce. Two
of the models, response-centered and stimulus-centered, provide a statistical framework
for testing the scale properties (e.g., if the scale actually conforms to the ordinal, interval,
or ratio level of measurement) based on the scores obtained from the model. Alterna-
tively, in the subject-centered approach, scores are derived by summing the number of
correct responses (e.g., in the case of a test of cognitive ability or educational achieve-
ment) or by averaging scores on attitudinal instruments (e.g., Likert-type scales). In the
subject-centered approach, test scores are composed of linear sums of items (producing
a total score for a set of items) and are assumed to exhibit properties of order and equal
intervals (e.g., see Chapter 2 for a review of the properties of measurement and associ-
ated levels of measurement). At this juncture, you may ask whether you should analyze
subject-centered data using ordinal- or interval-based techniques. The position offered
here is the same as the one Frederic Lord and Melvin Novick (1968, p. 22) provided:
If scores provide more useful information for placement or prediction when they are treated
as interval data, they should be used as such. On the other hand, if treating the scores as
interval-level measurements actually does not improve, or lessens their usefulness, only the
rank order information obtained from this scale should be used.
some form of stimulus and response. Mosier (1940) suggested that the theorems of psy-
chophysics could be applied to psychometrics by means of transposing postulates and
definitions in a logical and meaningful way. For example, researchers in psychophysics
model the response condition as an indicator of a person’s invariant (i.e., unchanging)
attribute. Response conditions stem from sensory perception of a visual, auditory, or
tactile stimulus. These person-specific attributes vary in response to the stimulus but are
invariant or unchanging from person to person (Stevens, 1951a). Conversely, psychome-
tricians treat the response condition as indicative of an attribute that varies from person
to person (e.g., knowledge on an ability or achievement test). However, the critical con-
nection between psychophysics and psychometrics is the stimulus–response relationship in the
measurement of perceptions, sensation, preferences, judgments, or attributes of the persons
responding.
To summarize, the main difference between psychophysics and psychometrics is
in the manner each mathematically models the invariance condition. In psychophysics,
the attribute varies within persons for the stimulus presented but is invariant from per-
son to person. In psychometrics, responses are allowed to vary from person to person,
but the attribute is invariant in the population of persons. In the mid- to late 20th cen-
tury, psychometrics incorporated the fundamental principles of classic psychophysics
to develop person- or subject-oriented, response-based measurement models known as
item response theory or latent trait theory, which involves studying unobserved attri-
butes (see Chapter 10).
Louis Leon Thurstone (1887–1955) developed a theory for the discriminate modeling of
attitudes by which it is possible to construct a psychological scale. Thurstone’s law of
comparative judgment (1927) provided an important link between normal distribution
(Gaussian or cumulative normal density function) statistical theory and the psychophysi-
cal modeling tradition by defining a discriminal process as a reaction that correlates
with the intensity of a stimulus on an interval scale. Thurstone’s law uses the variability
of judgments to obtain a unit of measurement and assumes that the errors of observations
are normally distributed. The assumption of normality of errors allows for application
of parametric statistical methods to scaling psychological attributes. Although the com-
parative judgment model was formulated for use on preferential or paired comparison
data, it is applicable to any ordinal scaling problem. Thurstone’s method is different from
methods previously introduced in that it is falsifiable, meaning that the results are able to
be subjected to a statistical test of model-data fit. For example, responses by subjects to
stimuli must behave in a certain way (i.e., response patterns are expected to conform to a
particular pattern); otherwise the model will not “fit” the data. Application of Thurstone’s
law of comparative judgment requires that equally often noticed differences in stimuli by
persons are in fact equal. The law is provided in Equation 5.1.
Scaling 149
(
S1 – S2 = X 12 √ σ12 + σ22 − 2R σ1σ2 )
• S1 – S2 = linear distance between two points on a psycho-
logical continuum.
• x12 = standard deviation of the observed proportion,
PR1 > R2 of judgments.
• σ12 = relative discriminal dispersion of stimulus 1.
• σ22 = relative discriminal dispersion of stimulus 2.
• r = correlation between the two discriminal devia-
tions involved in the judgments.
S1 – S2 = Z JKs 2
allowing for a natural framework for measuring psychological attributes on a latent con-
tinuum. In Thurstone’s law, the process of response or discrimination functions indepen-
dently of stimulus magnitudes; therefore, there is no objective criterion for the accuracy
of each judgment. For example, the judgments are not proportions of correct judgments;
rather, they represent a choice between two stimuli. For an applied example of an appli-
cation of Thurstone’s equal-interval approach to measuring attitudes, see Gable and
Wolfe (1993, pp. 42–49). The exposition includes example item generation and selec-
tion through locating persons on response continuum using Thurstone’s equal-interval
approach.
Ordinal scaling approaches involve rank-ordering objects or people from highest to low-
est (e.g., on measure of preference or on how similar pairs of objects are). The rank-
ordering approach to scaling provides data in the form of dominance. For example, in
preference scaling, a particular stimulus dominates over another for a respondent (i.e.,
a person prefers one thing over another). Therefore, in the rank-ordering approach,
dominance relative to one stimulus over another is dictated by greater than or less than
inequalities based on rank-order values. Rank-order approaches to scaling are ordinal in
nature, and two commonly used methods are (1) paired comparisons and (2) direct
rankings. The method of paired comparisons (Tables 5.2a and 5.2b) involves counting
the votes or judgments for each pair of objects by a group of respondents. For example,
objects may be statements that subjects respond to. Alternatively, subjects may rank-
order pairs of objects by their similarities. To illustrate, in Table 5.2a pairs of depres-
sion medications are presented and subjects are asked to rank-order the pairs from most
to least similar in terms of their effectiveness based on their experience. The asterisk
denotes the respondent’s preferred drug. The votes or judgments are inversely related to a
Scaling 151
Table 5.2b
Similarity matrix
Prozac Paxil Zoloft Cymbalta
Prozac
Paxil 5
Zoloft 4 1
Cymbalta 6 3 2
ranking; for example, the category or statement receiving the highest vote count receives
the highest ranking (in traditional scaling methods a value of 1 is highest). The rankings
are then compiled into a similarity matrix as shown in Table 5.2b.
Direct ranking involves providing a group of people a set of objects or stimuli (e.g.,
pictures, names of well-known people, professional titles, words), and having the people
rank-order the objects in terms of some property (Tables 5.3a and 5.3b). The property
may be attractiveness, reputation, prestige, pay scale, or complexity. Table 5.4 extends
direct ranking to rating the similarity between pairs of words. For extensive details on
a variety of scaling approaches specific to order, ranking, and clustering, see Guilford
(1954) and Lattin, Carroll, and Green (2003). Additionally, Dunn-Rankin, Knezek, Wal-
lace, and Zhang (2004) provide excellent applied examples and associated computer pro-
grams for conducting a variety of rank-based scaling model analyses.
Next we turn to an important scaling model, the Guttman scaling model, whose
focus is on locating subjects along a continuum based on the strength of their response
to a stimulus (e.g., a test item). This model is one of the first to appear in psychological
measurement.
Table 5.3a
Pairs Similarity rank
sister–brother 1
sister–niece 3
sister–nephew 6
brother–niece 5
brother–nephew 4
nephew–niece 2
Table 5.3b
Similarity matrix
Sister Brother Niece Nephew
Sister
Brother 1
Niece 3 5
Nephew 6 4 2
responses to items. In turn, items are scaled based on the amount or magnitude of the
trait manifested in persons. The Guttman (1941) and Aiken (2002) scaling model was one
of the first approaches that provided a unified response-scaling framework. In Guttman’s
technique, statements (e.g., test items or attitudinal statements) are worded in a way that
once a person responds at one level of strength or magnitude of the attribute, the person
should (1) agree with attitude statements weaker in magnitude or (2) correctly answer
test items that are easier. Based on these assumptions, Guttman proposed the method of
scalogram analysis (Aiken, 2002, p. 36) for evaluating the underlying dimensionality of a
set of items comprising a cognitive test or attitudinal instrument. For example, the unidi-
mensionality and efficacy of a set of items can be evaluated based on a comparison of the
expected to actual response patterns to test items for a sample of subjects.
Scaling 153
A result of applying the Guttman scaling approach is that persons are placed or
located in perfect order in relation to the strength of their responses. In practice, pat-
terns of responses that portray perfect Guttman scales are rare. For this reason, the
Guttman approach also provides an equation to derive the error of reproducibility
based on expected versus actual item response patterns obtained from a sample of
persons (i.e., a test of fit of the model based on the responses). Overall, the frame-
work underlying the Guttman approach is useful in the test or instrument develop-
ment process where person response profiles of attitude, ability, or achievement are of
interest relative to developing items that measure attributes of progressively increasing
degree or difficulty. For detailed information on Guttman scaling, see Guttman (1944),
Torgerson (1958), and Aiken (2002).
One of the most widely accepted models for scaling subjects (i.e., people) and items (i.e.,
stimuli) on object preference or similarity is the unidimensional unfolding technique
(Coombs, 1964). Unfolding was developed to study people’s preferential choice (i.e.,
behavior). Central to the technique is the focus on analysis of order relations in data that
account for as much information as possible. The order relations in unfolding techniques
are analyzed based on distances. By quantifying distances rigorously, interval level of
measurement is attained for nonmetric-type data. This approach differs from the scaling
of test scores based on an underlying continuous construct or trait (e.g., in intelligence
or achievement testing). The term preference refers to the manner in which persons prefer
one set of objects over another set modeled as an order relation on the relative proximity
of two points to the ideal point.
Unfolding is based on representational measurement, which is a two-way process
“defined by (1) some property of things being measured and (2) some property of the
measurement scale” (Dawes, 1972, p. 11). The goal of unfolding is to obtain an interval
scale from ordinal relations among objects. Unfolding theory is a scaling theory designed
to construct a space with two sets of points, one for persons and one for the set of
objects of choice. By doing so, unfolding uses all of the possible data in rank-order tech-
niques. To this end, the unfolding model is the most sophisticated approach to scaling
preference data. The “things” being measured in the unfolding model are objects and
may be physical in nature such as (1) an image or picture, a weight, actions or services,
or (2) they may be sensory perceptions such as smell or taste, or (3) psychological, such
as word meaning, or mathematical concepts. The “property of the scale” is distance
or location along a straight line. Taken together, a two-way correspondence model is
established: (1) the property of the things being measured (the empirical part) and (2)
the measurement scale (the formal relational system part). Based on such two-way cor-
respondence, the unfolding model qualifies as a formal measurement model residing
somewhere between an ordinal and interval level of measurement by Stevens’s (1951a)
classification system.
154 PSYCHOMETRIC METHODS
J-scale
A B C D
AB AC AD BC BD CD
J-scale
A B C D
(b) Letter pairs indicating the location of the midpoints of stimuli on the J-scale
A B C D
I-scale or axis
J-scale
X
A B C D
(d) Folding the J-scale at point (person) X
on the J-scale. For example, in (d) in Figure 5.3, when the J-scale is “folded” up into an
“I”-axis (called the individual scale), we see the response pattern and relational proxim-
ity for person (X) located in region two of the J-scale. After folding the J-scale, the “I”
scale represents the final rank order of person X. This result is interpreted as the relative
strength of preference expressed for a particular object or pair of objects. Each person
mapped onto an unfolding model will have a location on the J-scale and will therefore
have a corresponding I-scale that provides a rank order. Finally, when there are more
than four objects and more than a single dimension is present (as is sometimes the case),
the unidimensional unfolding model has been extended to the multidimensional case
by Bennett and Hayes (1960) and Lattin et al. (2003). Readers interested in the multidi-
mensional approach to unfolding and extensions to nonmetric measurement and metric
multidimensional scaling (MDS) are referred to Lattin et al. (2003) for applied examples.
item 01
Fluid
intelligence
test 1
item 10
item 01
Fluid
Fluid Intelligence intelligence
(Gf) test 2
item 20
item 01
Fluid
intelligence
test 3
item 20
item 01
Crystallized
intelligence
test 1
item 25
item 01
Crystallized
intelligence
test 2
item 25
General Intelligence Crystallized
(G) Intelligence (Gc)
item 01
Crystallized
intelligence
test 3
item 15
item 01
Crystallized
intelligence
test 4
item 15
item 01
Short-term
memory
test 1
item 20
item 01
Short-Term Memory Short-term
memory
(Stm)
test 2
item 10
item 01
Short-term
memory
test 3
item 15
P1 P2 P3 P4 P5
Lower Higher
Intelligence
1 2 3 4 5 6 7 8 9 10
never always
justified justified
1 2 3 4 5
Figure 5.7. A Likert-type item for the measurement of attitude toward the use of intelligence tests.
Figure 5.8 illustrates the semantic differential scale. This scale (Osgood, Tannenbaum,
& Suci, 1957) is an example of an ordered categorical scale (Figure 5.8). It measures a
person’s reaction to words and/or concepts by eliciting ratings on bipolar scales defined
with contrasting adjectives at each end (Heise, 1970). According to Heise, “Usually, the
position on the scale marked 0 is labeled ‘neutral,’ the 1 positions are labeled ‘slightly,’ the
2 positions ‘quite,’ and the 3 positions ‘extremely’” (p. 235).
Yet another type of ordered categorical scale is the behavior rating scale. Figure 5.9
illustrates a behavior rating scale item that measures student participation in class. We
see that we are measuring a student’s frequency of participation in class. The behavior we
are measuring is “class participation.” After acquiring data from a sample of students on
such a scale, we can evaluate individual differences among students according to their
participation behavior in class.
Ideally, items that comprise ordered categorical, summated rating, or Likert-type
scales have been developed systematically by first ensuring that objective ratings of
Intelligence tests
Figure 5.8. A semantic differential scale for the measurement of attitude toward intelligence
tests.
5 4 3 2 1
Figure 5.9. A behavior rating scale for the measurement of student participation in class.
160 PSYCHOMETRIC METHODS
similarity, order, and/or value exist for the set of items relative to the construct or attribute
being measured. Second, the unidimensionality of the set of items should be examined
to verify if the items actually measure a single underlying dimension (e.g., see Chapter
9). The step of verifying the dimensionality of a set of items usually occurs during some
form of pilot or tryout study. If a set of items exhibits multidimensionality (e.g., it taps
two dimensions rather than one), the analytic approach must provide for the multidi-
mensional nature of the scale. The topic of dimensionality and its implications for scale
analysis and interpretation will be covered in detail in Chapter 9 on factor analysis and in
Chapter 10 on item response theory. Finally, although the assumption of equal intervals
(i.e., widths between numbers on an ordinal scale) is often made in practice, this assump-
tion often cannot be substantiated from the perspective of fundamental measurement.
Given this apparent quandary, the question regarding how one should treat scores based
on index measurement—at an interval or ordinal level—often arises. Lord and Novick
(1968) provide an answer to this question by stating that one should treat scores acquired
from index-type measurement as interval level:
If scores provide more useful information for placement or prediction when they are treated
as interval data, they should be used as such. On the other hand, if treating the scores as
interval-level measurements actually does not improve, or lessen their usefulness, only the
rank order information obtained from this scale should be used. (p. 22)
Summated rating scales and Likert-type scales are not grounded in a formal mea-
surement model, so statistical testing of the scale properties of the index scores is not
possible (Torgerson, 1958). However, in using summated rating and Likert scaling pro-
cedures, the scaling model yields scores that are assumed to exhibit properties of order
and approximately equal units. Specifically, the following assumptions are applied: (1)
category intervals are approximately equal in length, (2) category labels are preset sub-
jectively, and (3) the judgment phase usually conducted during item or object devel-
opment as a precursor to final scale is replaced by an item analysis performed on the
responses acquired from a sample of subjects. Therefore, Likert scaling combines the
steps of judgment scaling and preference scaling into a single step within an item
analysis. Importantly, such assumptions should be evaluated based on the distribu-
tional properties of the actual data. After assumptions are examined and substantiated,
subject-centered scaling models often provide useful scores for a variety of psychologi-
cal and educational measurement problems.
Organizing data in a way that is useful for analysis is fundamental to psychometric meth-
ods. In fact, without the proper organizational structure, any analysis of data will be
unsuccessful. This section presents several data structures that are commonly encoun-
tered and concludes with some remarks and guidance on handling missing data.
Scaling 161
The most basic data matrix consists of N persons/subjects (in the rows) by k stimuli/
items (in the columns). This two-way data matrix is illustrated in Table 5.7. The entire
matrix is represented symbolically using an uppercase bold letter X. The data (i.e., sca-
lar) and information may take the form of 1 or 0 (correct/incorrect), ordinal, multiple
categorical (unordered), or interval on a continuous scale of, say, 1 to 100. The first
subscript denotes the row (i.e., the subject, person, or object being measured) and the
second subscript the column (e.g., an exam or questionnaire item or variable); that is, xij,
denotes the response of subject i to item j. Scalars are integers, and each scalar in a matrix
(rows × columns) is an element (Table 5.7).
A more complex data arrangement is the two-dimensional matrix with repeated mea-
surement occasions (time points) (Table 5.8). Still another data matrix commonly encoun-
tered in psychometrics is a three-dimensional array. Matrices of this type are encountered
in the scaling and analysis of preferences where multiple subjects are measured on mul-
tiple attributes (e.g., preferences or attitudes) and multiple objects (e.g., products or ser-
vices). Using the field of market research as an example, when a company manufactures a
product or offers a service in a for-profit mode, we find that it is essential that the company
evaluate its marketing effectiveness related to its product or service. Such research informs
the research and development process, so that the company remains financially solvent. To
effectively answer the research questions and goals, some combination of two- and three-
dimensional matrices may be required for a thorough analysis. Usually, the type of data
matrix is multivariate and involves people’s or subjects’ judgment of multiple attributes
of the product or service in question. Such matrices include multiple dependent variables
and repeated measurements (e.g., ratings or responses on an attitude scale) on the part of
subjects who are acting as observers or judges.
Incomplete data poses unique problems for researchers on the level of measurement, research
design, and statistical analysis. Regardless of the reason for the incomplete data matrix,
researchers have multiple decision points to consider regarding how to properly proceed.
The missing data topic is complex and beyond the scope of this text. Excellent information
and guidance on the topic is available in Enders (2011) and Peters and Enders (2002).
This chapter began with connecting ideas from Chapters 3 and 4 on validity and the
validation process to the role of scaling and developing scaling models. Also, we were
reminded that essential to any analytic process is ensuring the precision, objectivity, and
effective communication of the scores acquired during the course of instrument devel-
opment or use. The development of a scaling model that provides accurate and reliable
acquisition of numerical data is essential to this process. The goal of this chapter has been
to provide clarity and structure to aid researchers in developing and using scaling models
in their research. To gain perspective, a short history of scaling was provided. The chap-
ter focused on three types of scaling models, stimulus-, subject-, and response-centered.
Next, guidance on when and how to use these models was provided along with examples.
The chapter closed with a brief discussion of the type of data structures or matrices com-
monly encountered in psychometrics and a brief mention of the problem of missing data.
Difference limen. The amount of change in a stimulus required to produce a just notice-
able difference.
Direct rankings. Involve providing a group of people a set of objects or stimuli (e.g., pic-
tures, names of well-known people, professional titles, words) and having the people
rank-order the objects in terms of some property.
Discriminal process. A reaction that correlates with the intensity of a stimulus on an
interval scale.
Element. A scalar in a row-by-column matrix.
Error of reproducibility. An equation to test the Guttman scaling model assumptions that
is based on expected versus actual item response patterns obtained from a sample
of persons.
Index measurement. Measurement that focuses on the property of the attribute being
measured, resulting in a numerical index or scale score.
Item response theory. A theory in which fundamental principles of classic psychophysics
were used to develop person-oriented, response-based measurement.
Judgment scaling. Scaling that produces absolute responses to test items such as yes/
no or correct/incorrect.
Just noticeable difference. The critical amount of intensity change when a stimulus above
or below a threshold is provided to a subject that produces an absolute threshold.
Multidimensional map. Map used in multiple dimensional scaling to graphically depict
responses in three-dimensional space.
Nonmetric measurement. Categorical data having no inherent order that are used in
unidimensional and multidimensional scaling.
Paired comparisons. Involve counting the votes or judgments for each pair of objects by a
group of respondents. For example, objects may be statements that subjects respond to.
Alternatively, subjects may rank-order pairs of objects by their similarities.
Person response profiles. Used when, for example, the measurement of attitude, ability,
or achievement is of interest relative to developing items that measure attributes of
progressively increasing degree or difficulty.
Preference scaling. Scaling that involves the relative comparison of two or more attri-
butes such as attitudes, interests, and values.
Psychological objects. Words, sentences, names, pictures, and the like that are used to
locate individuals on a unidimensional linear scale or multidimensional map.
Psychological scaling. The case in which people are the objects of scaling, such as
where tests are developed to measure a person’s level of achievement or ability.
Psychometrics. A mind-measuring function based on the relationship between f (i.e., the
magnitude of the stimulus) and Y (i.e., the probability that a subject detects or senses
the stimuli).
Psychophysical scaling. Stimulus is directly measurable, with the response being the
sensory perception in either an absolute or relative sense.
164 PSYCHOMETRIC METHODS
Psychophysics. The study of dimensions of physical stimuli (usually intensity) and the
related response to such stimuli known as sensory perception or sensation.
Response-centered scaling. Response data are used to scale subjects along a psycholog-
ical continuum while simultaneously subjects are also scaled according to the strength
of the psychological trait they possess. Examples of scaling techniques include
Guttman scaling, unidimensional and multidimensional unfolding, item response theory,
latent class analysis, and mixture models.
Scaling. The process by which a measuring device is designed and calibrated and the
manner by which numerical values are assigned to different amounts of a trait or
attribute.
Scaling model. Scaling that begins with a conceptual plan that produces measurements
of a desired type. Scaling models are then created by mapping a conceptual frame-
work onto a numerical scale.
Sensory threshold. A critical point along a continuous response curve over a direct physical
dimension, where the focus of this relationship is often the production of scales of human
experience based on exposure to various physical or sensory stimuli.
Stimulus-centered scaling. Scaling that focuses on responses to physical stimuli in rela-
tion to the stimuli themselves. The class of research is psychophysics with problems
associated with detecting physical stimuli such as tone, visual acuity, brightness, or
other sensory perception.
Subject-centered scaling. Tests of achievement or ability or other psychological con-
structs where, for example, a subject responds to an item or statement indicating
the presence or absence of a trait or attribute. Attitude scaling includes a subject
responding to a statement indicating the level of agreement, as in a Likert scale.
Thurstone’s law of comparative judgment. A discriminal process is defined as a reac-
tion that correlates with the intensity of a stimulus on an interval scale and uses the
variability of judgments to obtain a unit of measurement and assumes the phi-gamma
hypothesis (i.e., normally distributed errors of observations).
Unidimensional scale. A set of items or stimuli that represent a single underlying con-
struct or latent dimension.
Unidimensional unfolding technique. A technique involving the representation of per-
sons (labeled as i) and stimuli or objects (labeled as j) in a single dimension repre-
sented on a number line. In psychological or educational measurement, data are
sometimes acquired based on respondents providing global responses to statements
such as (1) concept A is more similar to concept B than to C, or (2) rate the similarity
of word meanings A and B on a 10-point scale. Unfolding provides a way to repre-
sent people and stimuli jointly in space such that the relative distances between the
points reflect the psychological proximity of the stimuli to the people or their ideals in
a single dimension.
6
Test Development
This chapter provides foundational information on test and instrument development, item
analysis, and standard setting. The focus of this chapter is on presenting a framework and
process that, when applied, produces psychometrically sound tests, scales, and instruments.
6.1 Introduction
165
166 PSYCHOMETRIC METHODS
The following guidelines describe the major components and technical considerations
for effective test and/or instrument construction (Figure 6.1). In addition to providing a
coherent approach, application of the following framework provides evidence for argu-
ments regarding the adequacy of validity evidence relative to the purported use of scores
obtained from using tests and/or measurement instruments.
Conduct pilot test with representative Conduct item analyses and factor
sample analysis
f1 i item 1
fi1 item 10
fi2 item 1
fi3 item 1
ci1 item 25
ci2 item 1
crystallized intelligence test 2
ci2 item 25
Crystallized Intelligence (Gc)
ci3 item 1
crystallized intelligence test 3
ci3 item 15
ci4 item 1
crystallized intelligence test 4
ci4 item 15
stm1 item 1
short-term memory test 1
stm1 item 20
stm2 item 1
Short-Term Memory (Stm) short-term memory test 2
stm2 item 10
stm3 item 1
stm3 item 15
Group/class-level
Modification of • Pretest at outset of course CR • Informs instructional plan using stu-
instruction dent achievement scores
Instructional • Posttest at end of course CR • Knowledge required for standard of
value or success acceptable course attainment
• Critical review and CR • Within and between comparison of
evaluation of course for course domain to courses in other
improvement schools
Program value • Evaluation of progress CR • Educational achievement over time
across courses in subject- relative to established expectations of
matter area improvement or progress
Note. CR, criterion-referenced; NR, norm-referenced.
170 PSYCHOMETRIC METHODS
memory (Figure 6.2). Table 6.2 (introduced in Chapter 1) provides a review of the con-
structs and associated subtests for our GfGc example.
Our next task is to specify how much emphasis (weight) to place on each subtest
within the total test structure. To accomplish this task, we use a test blueprint. Table 6.3
provides an example of a table of specifications based on Figure 6.2. Table 6.3 is known
as a table of specifications or test blueprint. Note in Table 6.3 the two-way framework
for specifying how the individual components work in unison in relation to the total test.
In Table 6.3, each of the subtests within these components of intelligence is clearly
identified, weighted by influence, and aligned with a cognitive skill level as articulated by
Bloom’s taxonomy (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956), Ebel and Frisbie’s
relevance guidelines (1991, p. 53), and Gagné and Driscoll’s (1988) learning outcomes
framework. For a comparison of the three frameworks, see Table 6.4.
Millman and Greene (1989, p. 309) provide one other approach to establishing a
clear purpose for a test that focuses on the testing endeavor as a process. Millman and
Greene’s approach includes consideration of the type of inference to be made (e.g., indi-
vidual attainment, mastery, or achievement) cross-referenced by the domain to which
score inferences are to be made.
Reviewing Table 6.1 reveals the myriad options and various decisions to be con-
sidered in test development. For example, test scores can be used to compare exam-
inees or persons to each other (e.g., a normative test) or to indicate a particular level
of achievement (e.g., on a criterion-based test). With regard to the test of intelligence
used throughout this book, scores are often used in a normative sense where a person is
indexed at a certain level of intelligence relative to scores that have been developed based
on a representative sample (i.e., established norms). This type of score information is
also used in diagnosis and/or placement in educational settings. Because placement and
selection are activities that have a high impact on people’s lives, careful consideration is
crucial to prevent misclassification. For example, consider a child who is incorrectly clas-
sified as being learning disabled based on his or her test score. The implication of such
a misclassification results in the child being placed in a class that is at an inappropriate
educational level.
Another (perhaps extreme) example in the domain of intelligence is the case where
an incarcerated adult may be incorrectly classified in a way that requires him or her to
remain on death row. If during the process of test development, inadequate attention is
paid to what criteria are important for accurate classification or selection, a person or
persons might be placed in an incorrect educational setting or be required to serve in a
capacity that is unfitting for their actual cognitive ability.
172 PSYCHOMETRIC METHODS
D. Analysis
E. Synthesis
H. • Motor skills
Note. From Ebel and Frisbie (1991, p. 53). Reprinted with permission from the authors.
Finally, careful selection of the domains that test scores are intended to be linked
to will minimize the risk of inappropriate inferences and maximize the appropriateness
of score inferences and use (an important issue in the process of test validation; see
Chapters 3 and 4).
The criterion score approach to using test scores is perhaps best exemplified in high-
stakes educational testing. For example, students must earn a certain score in order to qual-
ify as “passing,” resulting in their matriculating to the next grade level. This is an example
of using criterion-based test scores for absolute decisions. Notice that a student’s perfor-
mance is not compared to his or her peers, but is viewed against a standard or criterion.
clinical adult and/or school psychologists, licensed professional counselors, and oth-
ers in psychiatry provide invaluable expert judgment based on their first-hand experi-
ence with the construct a test purports to be measuring. The actual manner of collecting
information from psychologists may involve numerous iterations of personal interviews,
group meetings to ensure adequate content coverage, or written survey instruments. The
input gleaned from subject-matter experts is an essential part of test development and the
validity evidence related to the scores as they are used following publication of the test.
The process of interviewing key constituents is iterative and involves a cyclical approach
(with continuous feedback among constituents) until no new information is garnered
regarding the construct of interest. Closely related to expert judgment is a comprehensive
review of the related literature; subject-matter experts make an important contribution
by providing their expertise on literature reviews.
Content analysis is sometimes used to generate categorical subject or topic areas.
Applying content analysis involves a brainstorming session in which questions are posed
to subject-matter experts and others who will ultimately be using the test with actual
examinees. The responses to the open-ended questions are used to identify and then cat-
egorize subjects or topics. Once the topic areas are generated, they are used to guide the
test blueprint (e.g., see Table 6.3).
Another approach to identifying attributes relative to a construct is to acquire infor-
mation based on direct observations. For example, through direct observations con-
ducted by actively practicing clinical or school psychologists, professional counselors
and licensed behavioral therapists often provide a way to identify critical behaviors or
incidents specific to the construct of interest. In this approach, extreme behaviors can be
identified, offering valuable information at extreme ends of the underlying psychologi-
cal continuum that can then be used to develop the score range to be included on the
distribution of scores or normative information. Finally, instructional objectives serve an
important role in test development because they specify the behaviors that students are
expected to exhibit upon completion of a course of instruction. To this end, instructional
objectives link course content to observable measurable behaviors.
To acquire the sample, some form of sampling technique is required. There are two
general approaches to sampling—nonprobability (nonrandom) and probability (ran-
dom). In nonprobability sampling, there is no probability associated with sampling a
person or unit. Therefore, no estimation of sampling error is possible. Conversely, prob-
ability samples are those that every element (i.e., person) has a nonzero chance of select-
ing and the elements are selected through a random process; each element (person) must
have at least some chance of selection although the chance is not required to be equal. By
instituting these two requirements, values for an entire population can be estimated with
a known margin of error. Two other types of sampling techniques (one nonprobability
and the other probability) are (1) proportionally stratified and (2) stratified random
sampling. In proportionally stratified sampling, subgroups within a defined population
are identified as differing on a characteristic relevant to a researcher or test developer’s
goal. Using a proportionally stratified sampling approach helps account for these char-
acteristics that differ among population constituents, thereby preventing systematic bias
in the resulting test scores. Using the stratified random sampling approach gives every
member in the strata of interest (e.g., the demographic characteristics) a proportionally
equal chance of being selected in the sampling process. The explicit details of conducting
the various approaches of random and nonrandom sampling protocols are not presented
here. Readers are referred to excellent resources such as Levy and Lemeshow (1991) and
Shadish, Cook, and Campbell (2002) to help develop an appropriate sampling strategy
tailored to the goal(s) of their work.
eliciting responses specific to content or ability than others. A detailed presentation of the
numerous types of item formats is beyond the scope of this book. For a summary of the
types of multiple-choice item formats available and when they are appropriate for use, see
Haladyna (2004, p. 96). Table 6.5 provides Haladyna’s recommendations.
Haladyna (2004, p. 99) provides a general set of item-writing guidelines aided
by an extensive discussion with 31 guidelines. The guidelines are grouped according
to (1) content guidelines, (2) style and format concerns, (3) writing item stems, and
(4) writing choice options. Some important points highlighted by Haladyna include
the following:
1. Items should measure a single important content as specified in the test specifi-
cations or blueprint.
2. Each test item should measure a clearly defined cognitive process.
3. Trivial content should be avoided.
4. Items should be formatted (i.e., style considerations) in a way that is not distract-
ing for examinees.
5. Reading comprehension level should be matched to the examinee population.
6. Correct grammar is essential.
7. The primary idea of a question should be positioned on the stem rather than in
the options.
8. Item content must not be offensive or culturally biased.
The following items provide examples from the fluid and crystallized intelligence
subtests used throughout this book.
A sweater that normally sells for 90 dollars is reduced by 20% during a sale. What is the sale
price of the sweater?
A. 71 dollars
B. 75 dollars
C. 72 dollars
D. 76 dollars
Scoring rule: 1 point awarded for correct response, 0 points awarded for incorrect response.
Time limit is 30 seconds on this item.
Scoring rule: To earn 2 points, the following answer options are acceptable: (a) to describe, (b) to
outline, (c) to explain in detail, (d) point awarded for correct response, 0 points awarded for incor-
rect response. To earn 1 point, the following answer options are acceptable: (a) to explain with
accuracy, (b) to mark, (c) portray, (d) to characterize. The criteria for earning 0 points include the
following answer options: (a) ambiguous, (b) to be vague, (c) nonsense, (d) to portray.
Note that the scoring rule produces a polytomous score of 0, 1, or 2 points for an exam-
inee, yielding an ordinal level of measurement (i.e., on the crystallized intelligence exam-
ple item). Also, in tests of cognitive ability, scoring rules are often more complex than the
preceding example. For example, there are additional scoring rule components: (a) dis-
continue rules specific to how many items an examinee fails to answer correctly in a row
(e.g., the examiner stops the test if the examinee earns 0 points on 5 items in a row), and
(b) reverse rules (e.g., a procedure for reversing the sequence of previously completed
items administered if an examinee earns a low score such as 0 or 1 on certain items that
subject-matter experts have deemed that the examinee should earn maximum points).
Item: 3-6-7-11-13-17-18
Scoring rule: 1 point awarded for correct response, 0 points awarded for incorrect response. To
earn 1 point, the series of numbers must be required in exact sequence.
178 PSYCHOMETRIC METHODS
1 2 3 4 5
Intelligence tests
1 2 3 4 5 6 7 8 9 10
never always
justified justified
The following list includes important considerations when writing items to measure
attitude:
collectively responding to an item. Taken together, the item analysis and examinee feed-
back are the two most useful activities that should occur during the pilot test.
In test construction, the goal is to produce a test or instrument that exhibits adequate evi-
dence of score reliability and validity relative to its intended uses. Several item and total
test statistics are derived to guide the selection of the final set of items that will comprise
the final version of the test or instrument. Key statistics that are derived in evaluating
test items specifically include item-level statistics (e.g., proportion correct, item valid-
ity, and discrimination) and total test score parameters such as mean proportion correct
and variance. Item analysis of attitudinal or personality instruments includes many but
not necessarily all of the indexes provided here. The decision about which item analysis
indexes are appropriate is dictated by the purpose of the test and how the scores will be
used. Table 6.6a illustrates item-level statistics for crystallized intelligence test 2 (measur-
ing lexical reasoning) based on 25 items scored on a 0 (incorrect) and 1 (correct) metric
for the total sample of N = 1000 examinees. Item analyses are presented next based on
the SPSS syntax below.
RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13
cri2_14 cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21
cri2_22 cri2_23 cri2_24 cri2_25
/SCALE(‘ALL VARIABLES’) ALL
/MODEL=ALPHA
/STATISTICS=DESCRIPTIVE SCALE
/SUMMARY=TOTAL MEANS VARIANCE.
Table 6.6a.
Mean SD N
Table 6.6b.
Summary Item Statistics
Table 6.6c.
Total Scale Statistics
Table 6.6d.
Reliability Statistics
Cronbach’s
Alpha Based on
Cronbach’s Alpha Standardized Items N of Items
.891 .878 25
1.00 .00
.90 .20
.80 .40
.70 .60
.60 .80
.50 1.00
.40 .80
.30 .60
.20 .40
.10 .20
.00 .00
Note. Assumes that examinees have been divided into upper and lower cri-
terion groups of 50% each. Adapted from Sax (1989, p. 235). Copyright
1989 by Wadsworth Publishing Company. Adapted by permission.
relationship between item proportion correct and discrimination. Ebel and Frisbee (1991,
p. 232) provide guidelines (Table 6.8) for screening test items based on the D-index.
On objective test items such as multiple choice, guessing is a factor that must be
considered. To establish an optimal proportion correct value that accounts for guessing,
the following information is required: (1) the chance level score based on the number of
response alternatives and (2) the number of items comprising the test. Consider the sce-
nario where the test item format is multiple choice; there are four response alternatives,
and a perfect score on the test is 1.0 (i.e., 100% correct). Equation 6.1 provides a way to
establish the optimal proportion correct value for a test composed of 30 multiple-choice
items with four response alternatives.
In Equation 6.1, the chance score for our multiple-choice items with four response
alternatives is derived as 1.0 (perfect score) divided by 4, resulting in .25 (i.e., a 25%
chance due to guessing). Taking one-half of the difference between a perfect score for
the total test yields a value of .375. Next, adding the chance-level value (.25) to .375
186 PSYCHOMETRIC METHODS
yields .625 or ~63%. The interpretation of the result in the previous sentence is that 63%
of the examinees are expected to answer the items on the test correctly. This approach
is less than optimal because it does not account for the differential difficulty of the indi-
vidual items comprising the total test. A revised approach presented by Fred Lord (1952)
accounts for differential difficulty among test items. Table 6.9 provides a comparison of
Lord’s work to the results obtained using Equation 6.1.
Correlation-based indexes of item discrimination are used more often in test con-
struction than the D-index. Correlation-based indexes are useful for test items that are
constructed on at least an ordinal level of measurement (e.g., Likert-type or ordered
categorical response formats) or higher (e.g., interval-level scores such IQ scores). Foun-
dational to the correlation-based item discrimination indexes is the Pearson correlation
coefficient that estimates the linear relationship between two variables. For item dis-
crimination indexes, the two variables that are correlated include the response scores to
individual items and the total test score.
The point–biserial correlation is used to estimate the relationship between a test item
scored 1 (correct) or 0 (incorrect) and the total test score. The formula for deriving the
point–biserial correlation is provided in Equation 6.2 (see also the Appendix).
Test Development 187
X S - Xµ P
RPBIS = .
SY Q
Table 6.10. BILOG-MG Point–Biserial and Biserial Coefficients for the 25-Item
Crystallized Intelligence Test 2
Pearson r
Name N #right PCT LOGIT (pt.–biserial) Biserial r
ITEM0001 1000 0.00 0.00 99.99 0.00 0.00
ITEM0002 1000 995.00 99.50 –5.29 0.02 0.11
ITEM0003 1000 988.00 98.80 –4.41 0.09 0.30
ITEM0004 1000 872.00 87.20 –1.92 0.31 0.49
ITEM0005 1000 812.00 81.20 –1.46 0.37 0.54
ITEM0006 1000 726.00 72.60 –0.97 0.54 0.72
ITEM0007 1000 720.00 72.00 –0.94 0.57 0.76
ITEM0008 1000 826.00 82.60 –1.56 0.31 0.45
ITEM0009 1000 668.00 66.80 –0.70 0.48 0.62
ITEM0010 1000 611.00 61.10 –0.45 0.52 0.67
ITEM0011 1000 581.00 58.10 –0.33 0.51 0.64
ITEM0012 1000 524.00 52.40 –0.10 0.55 0.69
ITEM0013 1000 522.00 52.20 –0.09 0.67 0.85
ITEM0014 1000 516.00 51.60 –0.06 0.62 0.77
ITEM0015 1000 524.00 52.40 –0.10 0.53 0.67
ITEM0016 1000 482.00 48.20 0.07 0.56 0.71
ITEM0017 1000 444.00 44.40 0.22 0.60 0.76
ITEM0018 1000 327.00 32.70 0.72 0.57 0.74
ITEM0019 1000 261.00 26.10 1.04 0.49 0.66
ITEM0020 1000 241.00 24.10 1.15 0.46 0.64
ITEM0021 1000 212.00 21.20 1.31 0.53 0.75
ITEM0022 1000 193.00 19.30 1.43 0.47 0.68
ITEM0023 1000 164.00 16.40 1.63 0.46 0.69
ITEM0024 1000 122.00 12.20 1.97 0.37 0.59
ITEM0025 1000 65.00 6.50 2.67 0.34 0.65
Note. No point–biserial/biserial coefficient is provided for item 1 because all examinees responded correctly to the
item. LOGIT, logistic scale score based on item response theory; PCT, percent correct.
The biserial correlation coefficient is used when both variables are on a continuous metric
and are normally distributed but one variable has been artificially reduced to two discrete
categories. For example, the situation may occur where a cutoff score or criterion is used to
separate or classify groups of people on an attribute (e.g., mastery or nonmastery). An unde-
sirable result that occurs when using the Pearson correlation on test scores that have been
dichotomized for purposes of classifying masters and nonmasters (e.g., when using a cutoff
score) is that the correlation estimates and associated standard errors are incorrect owing to
the truncated nature of the dichotomized variable. To address this problem, mathematical
corrections are made for the dichotomization of the one variable, thereby resulting in a correct
Pearson correlation coefficient. Equation 6.3 provides the formula for the biserial correlation.
Test Development 189
X S - X µ . PQ
RBIS =
SY Z
If a researcher is tasked with constructing a test for mastery decisions (as in the case of a
criterion-referenced test), the phi coefficient (see the Appendix) can be used to estimate
the discriminating power of an item. For example, each item score (0 or 1) can be cor-
related with the test outcome (mastery or nonmastery) using cross tabulation or contin-
gency table techniques, as shown in Figure 6.4.
To illustrate how the table in Figure 6.4 works, if examinees largely answer correctly
(the value in cell A is large) and the nonmasters largely answer incorrectly (the value in
cell D is large), the item discriminates well between the levels of achievement specified.
This interpretation is directly related to the false-positive (i.e., the resulting probability of
a test that classifies examinees in the mastery category when they are in fact a nonmaster)
and false-negative (i.e., resulting in the probability of the test classifying examinees in
the nonmastery category when they are in fact a master) outcomes. (See the Appendix
for more information on using contingency table analysis in the situation of making deci-
sions based on group classification.)
Mastery Decision
Mastery Nonmastery
Item +1 A B
Score 0 C D
One strategy designed to improve test score reliability is to select items for the final test
form based on the item reliability, while also simultaneously considering the item valid-
ity index. Item reliability and validity indexes are a function of the correlation between
each item and the variability (i.e., standard deviation) of each item score for a sample of
examinees. The item reliability index is a statistic designed to provide an indication of
a test’s internal consistency, as reflected at the level of an individual item. For example,
the higher the item reliability index, the higher the reliability of scores on the total test.
When using the item reliability index to evaluate individual test items, the total test score
(that the item under review is a part of) serves as the criterion.
To calculate the item reliability index, one needs two components—the propor-
tion correct for the item and the variance of the item. The variance of a dichotomously
scored test item is the proportion correct times the proportion incorrect (piqi). Using
these components, we can derive the item reliability index as PI QI rIX, where pi is the
proportion correct for an item, qi is 1 minus the proportion correct for an item, and riX
is the point–biserial correlation between an item and the total test score. Remember that
taking the square root of the variance (piqi) yields the standard deviation, so the item
reliability index is weighted by the variability of an item. This fact is helpful in item
analysis because the greater the variability of an item, the greater influence it will have
on increasing the reliability of test scores. To illustrate, using the values in Table 6.10, the
item reliability index for item number four on crystallized intelligence test 2 (measuring
lexical knowledge—actual usage of a word in the English language), and multiplying the
Test Development 191
standard deviation for item number four (.39) times the item point–biserial correlation
(.31), results in an item reliability index of .12 (see the underlined values in Table 6.11).
Alternatively, the item validity index is expressed as siriY, where si is the standard
deviation of an item and riY is the correlation between an item and an external criterion
(e.g., an outcome measure on a test of ability, achievement, or short-term memory). The
item validity index is a statistic reflecting the degree to which a test measures what it pur-
ports to measure as reflected at the level of an individual item—in relation to an external
measure (criterion). In item analysis, the higher the item validity index, the higher the
criterion-related validity of scores on the total test. Returning to the crystallized intel-
ligence test 2 (lexical knowledge), consider the case where a researcher is interested in
refining the lexical knowledge subtest in a way that maximizes its criterion validity in
relation to the external criterion of short-term memory. Again, using item number 4 on
Table 6.11. Item Reliability and Validity Indexes for Crystallized Intelligence Test 2
and the Total Score for Short-Term Memory Tests 1–3
Point–biserial Point–biserial
short-term crystallized Item reliability Item validity
Mean SD memory tests 1–3 intelligence test 2 indexa indexb
item 1 1.00 0.07 0.03 0.00 0.00 0.00
item 2 .99 0.11 0.10 0.02 0.00 0.01
item 3 .87 0.33 0.18 0.09 0.03 0.06
item 4 .81 0.39 0.28 0.31 0.12 0.11
item 5 .73 0.45 0.40 0.37 0.17 0.18
item 6 .72 0.45 0.39 0.54 0.24 0.17
item 7 .83 0.38 0.23 0.57 0.22 0.09
item 8 .67 0.47 0.19 0.31 0.15 0.09
item 9 .61 0.49 0.25 0.48 0.23 0.12
item 10 .58 0.49 0.25 0.52 0.26 0.12
item 11 .52 0.50 0.32 0.51 0.26 0.16
item 12 .52 0.50 0.39 0.55 0.28 0.19
item 13 .52 0.50 0.33 0.67 0.34 0.16
item 14 .52 0.50 0.30 0.62 0.31 0.15
item 15 .48 0.50 0.36 0.53 0.27 0.18
item 16 .44 0.50 0.27 0.56 0.28 0.13
item 17 .33 0.47 0.29 0.60 0.28 0.14
item 18 .26 0.44 0.36 0.57 0.25 0.16
item 19 .24 0.43 0.31 0.49 0.21 0.13
item 20 .21 0.41 0.21 0.46 0.19 0.09
item 21 .19 0.40 0.37 0.53 0.21 0.15
item 22 .16 0.37 0.23 0.47 0.17 0.09
item 23 .12 0.33 0.28 0.46 0.15 0.09
item 24 .07 0.25 0.21 0.37 0.09 0.05
item 25 .03 0.17 0.19 0.34 0.06 0.03
a
Item reliability = the point–biserial correlation multiplied by the item standard deviation.
b
Item validity = the point–biserial correlation defined as the correlation between an item and the criterion score, the
total score for short-term memory, multiplied by the item standard deviation.
192 PSYCHOMETRIC METHODS
the crystallized intelligence test 2, the item validity index is calculated by multiplying the
item standard deviation (.39) by the point–biserial correlation of the item with the short-
term memory total score (i.e., total score expressed as the sum of the three subtests). The
resulting item validity index is .11 (see the underlined values in Table 6.11).
Using the item reliability and validity indexes together is helpful in constructing a
test that meets a planned (e.g., in the test blueprint) minimum level of score variance
(and reliability), while also considering criterion-related validity of the test. An important
connection to note is that the total test score variance is expressed as the sum of the item
reliabilities (see Chapter 2 on the variance of a composite). To aid in test construction,
Figure 6.5 is useful because a researcher can inspect the items that exhibit optimal bal-
ance between item reliability and item validity. In this figure, the item reliability indexes
are plotted in relation to the item validity indexes.
For test development purposes, items farthest from the upper left-hand corner of the
graph in Figure 6.5 should be selected first for inclusion on the test. The remaining items
can be included, but their inclusion should be defensible based on the purpose of the test as
articulated in the test specifications (e.g., in consideration of content and construct validity).
The goal of the item analysis section of this chapter was to introduce the statistical
techniques commonly used to evaluate the psychometric contribution items make in
producing a test that exhibits adequate evidence of score reliability and validity relative
to its intended uses. To this end, several item and total test statistics were derived to guide
the selection of the final set of items that will comprise the final version of the test. Key
statistics derived in evaluating test items included item-level statistics such as the mean
and variance of individual items, proportion correct for items, item reliability, item valid-
ity, and item discrimination indexes.
Figure 6.5. Relationship between item validity and item reliability on crystallized intelligence
test 2.
Test Development 193
Tests are sometimes used to classify or select persons on the basis of score performance
along some point along the score continuum. The point along the score continuum is
known as the cutoff score, whereas the practice of establishing the cutoff score is known
as standard setting. Establishing a single cutoff score results in the score distribution of
the examinees being divided into two categories. The practice of standard setting com-
bines judgment, psychometric considerations, and the practicality of applying cutoff
scores (AERA, APA, & NCME, 1999). Hambleton and Pitoniak (2006, p. 435) state that
“the word standards can be used in conjunction with (a) the content and skills candi-
dates are viewed as needing to attain, and (b) the scores they need to obtain in order to
demonstrate the relevant knowledge and skills.” Cizek and Bunch (2006) offer further
clarification by stating that
the practice of standards setting should occur early in the test development process so as to
(a) align with the purpose of the test, test items, task formats, (b) when there is opportunity
to identify relevant sources of evidence bearing on the validity of categorical classifications,
(c) when evidence can be systematically gathered and analyzed, and (d) when the standards
can meaningfully influence instruction, examinee preparation and broad understanding of
the criteria or levels of performance they represent. (p. 6)
For example, in educational achievement testing, students are required to attain a certain
level of mastery prior to matriculating to the next grade. In licensure or occupational test-
ing, examinees are required to meet a particular standard prior to a license or certification
being issued.
Central to all methods for establishing cutoff scores for determining a particular
level of mastery is the borderline examinee. A borderline examinee is defined as a hypo-
thetical examinee used by subject-matter experts as a reference point to make a judgment
regarding whether such an examinee of borderline ability or achievement would answer
the test item under review correctly. More recently, the establishment of the No Child
Left Behind program (NCLB, 2001) and the Individuals with Disabilities Act (IDEA,
1997) has resulted in multiple score performance categories (e.g., Basic, Proficient, and
Advanced). In this scenario, two cutoff scores are required to partition the score distribu-
tion into three performance categories.
An example of the impact standard setting has in relation to the decision-making
process is perhaps no more profound than when intelligence tests have been used as one
criterion in the decision to execute (or not) a person convicted of murder. For example,
in Atkins vs. Virginia (2002), a person on death row was determined to have a full-scale
intelligence score of 59 (classifying him as mentally retarded) on the Wechsler Adult
Intelligence Scale—III (Wechsler, 1997b). Based on the person’s score, the Supreme Court
overturned the sentence by ruling that the execution of mentally retarded persons is
“cruel and unusual” and therefore prohibited by the Eighth Amendment to the United
States Constitution (Cizek & Bunch, 2006, p. 6).
194 PSYCHOMETRIC METHODS
Numerous schemes have been suggested regarding the classification of standard-setting meth-
ods. Standard-setting methods are classified as norm-referenced or criterion-referenced,
depending on the purpose and type of test being used. The norm-referenced approach is
a method of deriving meaning from test scores by evaluating an examinee’s test score and
comparing it to scores from a group of examinees (Cohen & Swerdlik, 2010, p. 656). For
example, a certification or licensure test may be administered on an annual or quarterly
basis, and the purpose of the test is to ensure that examinees meet a certain standard (i.e.,
a score level such as 80% correct) relative to one another. This approach is appropriate if
the examinee population is stable across time and the test meets the goals (i.e., is prop-
erly aligned with content-based standards of practice) for the certification or licensing
organization or entity.
Alternatively, criterion-referenced methods are absolute in nature because they focus
on deriving meaning from test scores by evaluating an examinee’s score with reference
to a set standard (Cohen & Swerdlik, 2010, p. 644). For example, the method is driven
according to the knowledge and skills an examinee must possess or exhibit in order
to pass a course of study. Central to the criterion-referenced method is the point that
adequate achievement by an examinee is based solely on the examinee and is in no way
relative to how other examinees perform.
Finally, the standards-referenced method, a modified version of criterion-
referenced method, has recently emerged in high-stakes educational achievement testing.
The standards-referenced method is primarily based on the criterion-referenced method
(e.g., examinees must possess a certain level of knowledge and skill prior to matriculating
to the next grade). Normative score information is also created by the testing organization
in charge of test development and scoring for (1) educational accountability purposes
(e.g., NCLB) and (2) statewide reporting of school district performance. The follow-
ing sections introduce four common approaches to establishing cutoff scores. Readers
seeking comprehensive information on the variety of approaches currently available for
specific testing scenarios should refer to Cizek and Bunch (2006) and Zieky, Peirie, and
Livingston (2008).
Test Development 195
The Nedelsky method (Nedelsky, 1954) is one of the first standard-setting methods intro-
duced, and was developed in an educational setting for setting cutoff scores on criterion-
referenced tests composed of multiple-choice items. However, because the method
focuses on absolute levels of performance, the Nedelsky method is also widely used in
setting standards in the area of certification and licensure testing. A useful aspect of the
method is that subject-matter experts (SMEs) must make judgments about the level of
severity of the incorrect response alternatives—in relation to how an examinee with border-
line passing ability will reason through the answer choices. A subject-matter expert partici-
pating in the standard-setting exercise is asked to examine a question and to eliminate the
wrong answers that an examinee of borderline passing ability would be able to recognize
as wrong. For example, the item below is an example of the type of item contained in a
test of crystallized intelligence. In this item, participants using the Nedelsky method are
asked to evaluate the degree of impact selecting a certain response alternative will have
relative to successfully finding their way back to safety.
If you lost your way in a dense forest in the afternoon, how might you
find your way back to a known area?
In the Nedelsky method, the subject-matter expert might decide that a borderline
examinee would be able to eliminate answer choices C and D because the options might
leave the stranded person lost indefinitely. Answer B is a reasonable option, but a path
may or may not be present, whereas option A, using the sun, is the best option—although
it is possible that the sun may not be shining. Establishing a cutoff score on a test for
an examinee based on a set of test items similar to the example above proceeds as fol-
lows. First, the probability of a correct response is calculated for each item on the test.
For example, the probability of a correct response by an examinee is 1 divided by the
number of remaining response alternatives—after the examinee has eliminated the wrong-
answer choices. So, in the example item above, the borderline examinee is able to elimi-
nate answer choices C and D, leaving the probability of a correct response as 1 divided
by 2 or 50%. After the probabilities for each test item are calculated, they are summed to
create an estimate of the cutoff score.
The Nedelsky method has at least two drawbacks. First, if a borderline examinee can
eliminate all but two answer choices or perhaps all of the answer choices, then the prob-
ability of a correct response is either .5 or 1.0. No probabilities between .5 and 1.0 are
196 PSYCHOMETRIC METHODS
possible. Second, test item content can be substantially removed from what test exam-
inees are actually used to seeing in practice. For this reason, using actual item responses
from a pilot test is very useful to aid subject-matter experts in the procedure. Requiring
actual pilot data with item responses is problematic for any cutoff score technique that
focuses only on test items, giving little consideration to practical reality.
In the Ebel method (Ebel & Frisbie, 1991), subject-matter experts classify test items into
groups based on each item’s difficulty (easy, medium, or hard) and relevance or impor-
tance (essential, important, acceptable, and questionable). Next, subject-matter experts
select the probability that a borderline examinee will respond to each item correctly.
The same probability is specified for all items in the group of items that examinees are
expected to answer correctly. Cutoff scores are derived by taking each respective group
of items (e.g., 15 items) and multiplying the subject-matter expert’s specified probability
for those 15 items. This step is repeated for each group of items, and then the sum of the
products for each group of items is derived. To obtain the group’s cutoff score, subject-
matter experts’ cutoff scores are averaged using the mean or possibly a trimmed mean if
desired. A disadvantage of the Ebel method is that if there are 15 items in each grouping,
a subject-matter expert must only make 15 judgments about the probability of a border-
line examinee responding correctly regardless of the number of total test items. However,
a strength of the method is that subject-matter experts must consider the relevance and
difficulty of each test item.
The Angoff method (Angoff, 1984) was introduced in the early 1980s and is the most
commonly used approach to standard setting, although it is used mainly for certifica-
tion and licensing tests. The Angoff method (and variations of it) is the most researched
standard-setting method (Mills & Melican, 1988). In this method, subject-matter experts
are asked to (1) review the test item content and (2) make judgments about the propor-
tion of examinees in a target population that would respond to a test item correctly. The
target population or examinee group of interest is considered to be minimally competent,
which means that they are perceived as being barely able to respond correctly (or pass)
to a test item. This process is repeated for every item on the test. Finally, the sum of the
item scores represents the score for a minimally acceptable examinee. In a variation of
the Angoff method, for each test item, subject-matter experts are asked to state the prob-
ability that an acceptable number of persons (not just a single person) can be identified
as meeting the requisite qualifications as delineated by established standards for certi-
fication, licensure, or other type of credential. The probability is expressed as the pro-
portion of minimally acceptable examinees who respond correctly to each test item. In
Test Development 197
Table 6.12. Modified Angoff Method with Eight Raters, Two Ratings Each
Item number
Rater 1 2 3 4 5 6 7 8 9 10 Mean SD
1a 100 90 100 100 90 80 80 80 70 70 86.00 11.14
1b 90 90 90 90 90 90 80 70 70 60 82.00 10.77
2a 100 100 100 90 90 90 80 80 80 70 88.00 9.80
2b 90 100 90 100 90 90 80 80 70 70 86.00 10.20
3a 90 100 90 90 90 80 90 70 80 80 86.00 8.00
3b 100 100 100 90 80 90 80 70 80 80 87.00 10.05
4a 100 90 100 90 90 80 80 80 80 70 86.00 9.17
4b 90 90 100 90 100 80 80 70 70 70 84.00 11.14
5a 90 100 90 100 100 90 90 80 80 80 90.00 7.75
5b 90 90 90 100 90 80 80 80 70 80 85.00 8.06
6a 100 100 100 90 90 80 90 80 80 70 88.00 9.80
6b 90 90 100 80 80 80 80 90 80 80 85.00 6.71
7a 90 90 90 90 90 80 80 70 70 70 82.00 8.72
7b 90 100 100 80 80 100 90 80 80 80 88.00 8.72
8a 90 90 80 90 90 80 80 70 70 70 81.00 8.31
8b 90 80 80 80 80 70 70 80 80 70 78.00 6.00
Mean(a) 95.00 95.00 93.75 92.50 91.25 82.50 83.75 76.25 76.25 72.50 85.88 8.25
Mean(b) 91.25 92.50 93.75 88.75 86.25 85.00 80.00 77.50 75.00 73.75 84.38 7.01
Note. Total number of items on crystallized intelligence test 2 is 25. Ratings are in 10-percentage point increments.
Totals in the shaded area represent rater average and standard deviation across 10 items.
preparing or training subject-matter experts to use the Angoff method, considerable time
is required to ensure that subject-matter experts thoroughly understand and can apply
the idea of a minimally acceptable examinee.
The modified Angoff method involves subject-matter experts contributing multiple
judgments over rounds or iterations of the exercise of assigning proportions of mini-
mally acceptable examinees. Table 6.12 provides an example of results based on the first
10 items on crystallized intelligence test 2.
To interpret Table 6.12, we can examine the rater averages across trials 1 and 2
(indexed as “a” and “b”). For example, using the trial 1 ratings, we observe a recom-
mended average passing proportion of 85.88 across all raters. The proportion of 85.88
yields 8.58 items correctly passed out of the 10 items. Finally, another adaption of the
Angoff method is available for standard setting based on constructed response-type test
items. Readers interested in this adaption are encouraged to see Hambleton and Plake
(1995) for the methodological details.
The Angoff method proceeds by requesting subject-matter experts to assign a
probability to each item on a test that is expressed as the probability that a borderline
examinee will respond correctly to the item. If the test is composed of multiple-choice
items and a correct response yields 1 point, then the probability that an examinee will
respond correctly to an item is defined as the examinee’s expected score. By summing
the expected scores on all items on the test, one obtains the expected score for the
198 PSYCHOMETRIC METHODS
entire test. Using the probability correct for each item, one can find the expected score
for a borderline examinee on the total test. The subject-matter expert’s cutoff score is
determined by summing his or her judgments about the probability that a borderline
examinee will respond correctly to each item. The Angoff method is well established and
thoroughly researched. A disadvantage of the method is in not having actual pilot test
responses available to help subject-matter experts to become grounded in the practical
reality of examinees; as usual, judgments about examinee performance can be very dif-
ficult to estimate subjectively.
The Angoff method is also applicable to constructed response items with a slight
modification. To illustrate, suppose a test item is of such a form that an examinee is
required to construct a response that is subsequently scored on a score range of 1–10
points. Next, subject-matter experts are asked to estimate the average score that a group
of borderline examinees would obtain on the item. Furthermore, the score can be a non-
integer (e.g., subject-matter experts may estimate that the average score for a group of
borderline examinees is 6.5 on a scale of 1 to 10). Another subject-matter expert might
estimate the average score to be 5.5. Deriving an estimate of the cutoff score proceeds by
first summing the cutoff scores of the individual subject-matter experts and then taking
the average of the group of subject-matter experts.
The bookmark method is used for test items scaled (scores) using item response theory
(IRT). (IRT is covered in detail in Chapter 7.) The protocol for establishing cutoff scores
using the bookmark method proceeds as follows. First, subject-matter experts are pro-
vided a booklet comprising test items that are ordered in difficulty from easy to hard. The
subject-matter expert’s task is to select the point in the progression of items where an
examinee is likely to respond correctly from a probabilistic standpoint. In the bookmark
method, the probability often used for the demarcation point where easy items shift to
hard items is .67. For example, the demarcation point establishes a set of easy items that
a borderline examinee answers correctly with a probability of .67. Conversely, the remain-
ing “harder” group of items would not be answered correctly by a probability of less than
.67. An advantage of IRT scoring is that item difficulty (expressed as a scale score) and
examinee ability (expressed as an ability scale score) are placed on a common scale. So,
once a bookmark point is selected, an examinee’s expected score at a cutoff point is easily
determined.
An advantage of the bookmark method is that multiple cutoff scores are able to be
set in a set of test items (e.g., gradations of proficiency level such as novice, proficient,
and advanced). Also, the method works for constructed response test items as well as for
multiple-choice items. Subject-matter experts often find that working with items ordered
by increasing difficulty makes their task more logical and manageable. Of course, all
of the test items must be scored and calibrated using IRT prior to establishing the cut-
off score. Therefore, a substantial pilot-testing phase of the items is necessary. Another
Test Development 199
potential challenge of using this method is that subject-matter experts not familiar with
IRT will likely have difficulty understanding the relationship between the number of
items answered correctly and the cutoff score on the test. For example, if the cutoff score
is selected as 19, one may think that 18 items must be answered correctly. However, the
relationship is different in IRT, where a transformation of item difficulty and person abil-
ity occurs and as a result the raw number correct cutoff score rarely matches the number
of questions preceding the bookmark.
Chapter 6 has reviewed several established methods for establishing cutoff scores.
The information presented is a part of a basic overview of a body of work that is substan-
tial in breadth and depth. Readers desiring more information on setting cutoff scores and
standard setting more generally are encouraged to consult the book Standard Setting by
Cizek and Bunch (2006) and Cutscores: A Manual for Setting Standards of Performance on
Educational and Occupational Tests by Zieky et al. (2008).
This chapter presented three major areas of the test and instrument development pro-
cess: test construction, item analysis, and standard setting. The topic of test construction
includes establishing a set of guidelines that a researcher follows to sequentially guide
his or her work. Additionally, the information on test and instrument construction pro-
vided was aimed at guiding the effective production of tests and instruments that maxi-
mize differences between persons (i.e., interindividual differences). The second section
of this chapter provided details on various techniques used for item analysis with applied
examples. The utility of each item analysis technique was discussed. The third section
introduced the topic of standard setting and described the four approaches that have been
used extensively and that most closely align with the focus of this book.
account for these characteristics that differ among population constituents, thereby
preventing systematic bias in the resulting test scores.
Sampling. The selection of elements, following prescribed rules from a defined popula-
tion. In test development, the sample elements are the examinees or persons taking
the test (or responding to items on an instrument).
Score validity. A judgment regarding how well test scores measure what they purport to
measure. Score validity affects the appropriateness of the inferences made and any
actions taken.
Standard setting. The practice of establishing a cutoff score.
Standards-referenced method. A modified version of the criterion-referenced method
used in high-stakes educational achievement testing. The standards-referenced method
is primarily based on the criterion-referenced method (e.g., examinees must possess
a certain level of knowledge and skill prior to matriculating to the next grade).
Stratified random sampling. Every member in the stratum of interest (e.g., the demo-
graphic characteristics) has an equal chance of being selected (and is proportionally
represented) in the sampling process.
Subject-matter expert. A person who makes decisions about establishing a cutoff score
for a particular test within the context of a cutoff score study.
Table of specifications. A two-way grid used to outline the content coverage of a test.
Also known as a test blueprint.
Tetrachoric correlation. Useful in the test construction process when a researcher wants
to create artificial dichotomies from a variable (item) that is assumed to be normally
distributed (e.g., perhaps from a previously developed theory verified by empirical
research). This correlation has proven highly useful for factor analyzing a set of
dichotomously scored test items that are known to represent an underlying construct
normally distributed in the population.
7
Reliability
This chapter introduces reliability—a topic that is broad and has important implications
for any research endeavor. In this chapter, the classical true score model is introduced
providing the foundation for the conceptual and mathematical underpinnings of reliability.
After the foundations of reliability are presented, several approaches to the estimation of
reliability are provided. Throughout the chapter, theory is linked to practical application.
7.1 Introduction
Broadly speaking, the term reliability refers to the degree to which scores on tests or other
instruments are free of errors of measurement. The degree to which scores are free from
errors of measurement dictates their level of consistency or reliability. Reliability of mea-
surement is a fundamental issue in any research endeavor because some form of mea-
surement is used to acquire data. The process of data acquisition involves the issues of
measurement precision (or imprecision) and the manner by which it is reported in rela-
tion to test scores. As you will see, reliability estimation is directly related to measurement
precision or imprecision (i.e., error of measurement). Estimating the reliability of scores
according to the classical true score model involves certain assumptions about a person’s
observed, true, and error scores. This chapter introduces the topic of reliability in light of
the assumptions of the true score model, how it is conceptualized, requisite assumptions
about true and error scores, and how various coefficients of reliability are derived.
Two issues central to reliability are (1) the consistency or degree of similarity of
at least two scores on a set of test items and (2) the stability of at least two scores on a
set of test items over time. Different methods of estimating reliability are based on spe-
cific assumptions about true and error scores and, therefore, address different sources
of error. The assumptions explicitly made regarding true and error scores are integral to
203
204 PSYCHOMETRIC METHODS
correctly reporting and interpreting score reliability. Although the term reliability is used
in a general sense in many instances, reliability is clearly a property of scores rather than
measurement instruments or tests. It is the consistency or stability of scores that provides
evidence of reliability when using a test or instrument in a particular context or setting.
This chapter is organized as follows. First, a conceptual overview of reliability is
presented followed by an introduction to the classical true score model—a model that
serves as the foundation for classical test theory. Next, several methods commonly used
to estimate reliability are presented using the classical test theory approach. Specifically, we
present three approaches to estimating reliability: (1) the test–retest method for estimating
the stability of scores over time, (2) the internal consistency method based on the model of
randomly parallel tests, and (3) the splithalf method—also related to the model of paral-
lel tests. A subset of the dataset introduced in Chapter 2 that includes three components
of the theory of generalized intelligence—fluid (Gf ), crystallized (Gc), and short-term
memory (Gsm)—is used throughout the chapter in most examples. As a reminder, the
dataset used throughout this chapter includes a randomly generated set of item responses
based on a sample size N = 1,000 persons. For convenience, the data file is available in
SPSS (GfGc.sav), SAS (GfGc.sd7), or delimited file (GfGc.dat) formats and is download-
able from the companion website (www.guilford.com/price2-materials).
Table 7.1. General and Specific Origins of Test Score Variance Attributable
to Persons
General: Enduring traits or attributes
1. Skill in an area tested such as reading, mathematics, science
2. Test-taking ability such as careful attention to and comprehension of instructions
3. Ability to respond to topics or tasks presented in the items on the test
4. Self-confidence manifested as positive attitude toward testing as a way to measure ability, achieve-
ment, or performance
excessively high or low. In the physical sciences, consider the process of measuring the
precise amount of heat required to produce a chemical reaction. Such a reaction may be
affected systematically by an improperly calibrated thermometer being used to measure
the temperature—resulting in a systematic shift in temperature by the amount or degree
of calibration error. In the case of research conducted with human subjects, systematic
error may occur owing to characteristics of the person, the test, or both. For example,
in some situations persons’ test scores may vary in a systematic way that yields a consis-
tently lower or higher score over repeated test administrations. With regard to the crys-
tallized intelligence dataset used in the examples throughout this book, suppose that all
of the subtests on the total test were developed for a native English-speaking population.
206 PSYCHOMETRIC METHODS
Xi = Ti + Ei
Although Equation 7.1 makes intuitive sense and has proven remarkably useful his-
torically, six assumptions are necessary in order for the equation to become practical
for use. Before introducing the assumptions of the true score model, some connections
between probability theory, true scores, and random variables are reviewed in the next
section (see the Appendix for comprehensive information on probability theory and ran-
dom variables).
Random variables are associated with a set of probabilities (see the Appendix). In the true
score model, test scores are random variables and, therefore, can take on a hypothetical
set of outcomes. The set of outcomes is expressed as a probability (i.e., expressed as a fre-
quency) distribution as illustrated in Table 7.2. For example, when a person takes a test,
the score he or she receives is considered a random variable (expressed in uppercase let-
ter X in Equation 7.1). The one time or single occasion a person takes the test, he or she
receives a score, and this score is one sample from a hypothetical distribution of possible out-
comes. Table 7.2 illustrates probability distributions based on a hypothetical set of scores
for three people. In the distribution of scores in Table 7.2, we assume that the same per-
son has taken the same test repeatedly and that each testing occasion is an independent
event. The result is a distribution of scores for each person with an associated probability.
The probabilities expressed in Table 7.2 are synonymous with the relative frequency for
a score based on the repeated testing occasions. The implication of Table 7.2 for the true
score model or classical test theory is that the mean (or expectation) of the hypothetical
observed score distribution for a person based on an infinitely repeated number of independent
trials represents his or her true score within the classical true score model.
To clarify the role of the person-specific probability distribution, consider the follow-
ing example in Table 7.2. Tabulation of the probability of a person’s raw score (expressed
as a random variable) multiplied by the probability of obtaining a certain score (due to
probability theory) demonstrates that person C appears to possess the highest level of
crystallized intelligence for the 25-item test. Furthermore, by Equation 7.6, person C’s
true score is 14.02. Notice that for person C the probability (i.e., expressed as the rela-
tive frequency) of scoring a 15 is .40—higher than the other two persons. Person A has
a probability of .40 scoring a 13. Person B has a probability of .45 scoring an 8. Clearly,
person C’s probability distribution is weighted more heavily toward the high end of the
score scale than person A or B.
Although a person’s true score is an essential component of the true score model,
true score is only a hypothetical entity owing to the implausibility of conducting an infi-
nite number of independent testing occasions. True score is expressed as the expectation
of a person’s observed score over repeated independent testing occasions. Therefore, the
score for each person taking the test represents a different random variable regarding his or
her person-specific probability distribution (e.g., Table 7.2). The result is that such per-
sons have their own probability distribution—one that is specific to their hypothetical
distribution of observed scores (i.e., each person has an associated score frequency or
probability given their score on a test). In actual testing situations, the interest is usually
in studying individual differences among people (i.e., measurements over people rather
than on a single person). The true score model can be extended to accommodate the
study of individual differences by administering a test to a random sample of persons
from a population. Ideally, this process could be repeated an infinite number of times
(under standardized testing conditions), resulting in an observed score random variable
taking on specific values of score X. In the context described here, the error variance
over persons can be shown to be equal to the average, over persons (group-level), of
the error variance within persons (hypothetical repeated testing occasions for a single
person; Lord & Novick, 1968, p. 35). Formally, this is illustrated in Equation 7.5 in the
next section.
In the Appendix, equations for the expectation (i.e., the mean) of continuous and
discrete random variables are introduced along with examples. In the true score model,
total test scores for persons are called composite scores. Formally, such composite scores
are defined as the sum of responses (response to an item as a discrete number) to individ-
ual items. At this point, readers are encouraged to review the relevant parts of Chapter 2
and the Appendix before proceeding through this chapter; this will reinforce key founda-
tional information essential to understanding the true score model and reliability estima-
tion. Next, we turn to a presentation of the assumptions of the true score model.
Reliability 209
In the true score model, the human traits or attributes being measured are assumed to
remain constant regardless of the number of times they are measured. Imagine for a
moment that a single person is tested an infinite number of times repeatedly. For exam-
ple, say Equation 7.1 is repeated infinitely for one person and the person’s true state of
knowledge about the construct remains unchanged (i.e., is constant). This scenario is
illustrated in Figure 7.1.
Table 7.3 illustrates observed, true, and error scores for 10 individuals. Given this
scenario, the person’s observed score would fluctuate owing to random measurement
error. The hypothetical trait or attribute that remains constant and that observed score
fluctuates about is represented as a person’s true score or T. Because of random error
during the measurement process, a person’s observed score X fluctuates over repeated
trials or measurement occasions. The result of random error is that differences between
a person’s observed score and true score will fluctuate in a way that some are positive
3
xi4 ei3
4
ei4 xi5
5
ei5
6
xi6 ei6
7
xi8 ei7
xi7
∞
ei8
–5 –4 –3 –2 –1 0 1 2 3 4 5
µerror = 0
Figure 7.1. True score for a person. Adapted from Magnusson (1967, p. 63). Copyright 1967.
Reprinted by permission of Pearson Education, Inc. New York, New York.
210 PSYCHOMETRIC METHODS
Table 7.3. Crystallized Intelligence Test Observed, True, and Error Scores
for 10 Persons
Person (i) Observed score (X) True score (T) Error score (E)
A 12.00 = 13.00 + –1.00
B 14.50 = 12.00 + 2.50
C 9.50 = 11.00 + –1.50
D 8.50 = 10.00 + –1.50
E 11.50 = 9.00 + 2.50
F 7.00 = 8.00 + –1.00
G 17.00 = 17.25 + –0.25
H 17.00 = 16.75 + 0.25
I 10.00 = 9.00 + 1.00
J 8.00 = 9.00 + –1.00
Mean 11.50 11.50 0.00
Standard deviation 3.43 3.11 1.45
Variance 11.75 9.66 2.11
Sum of cross products 96.50
Covariance 9.65
Note. Correlation of observed scores with true scores = .91. Correlation of observed scores with error scores = .42.
Correlation of true scores with error scores = 0. True score values are arbitrarily assigned for purposes of illustration.
Variance is population formula and is calculated using N. Partial credit is possible on test items. Covariance is the
average of the cross products of observed and true deviation scores.
and some are negative. Over an infinite number of testing occasions, the positive and nega-
tive errors cancel in a symmetric fashion, yielding an observed score equaling true score for a
person (see Equations 7.5 and 7.6).
Notice that in Table 7.4, all of the components are in place to evaluate the reliability
of scores based on errors of measurement.
In the situation where score changes or shifts occur systematically, the difference
between observed and true scores will be either systematically higher or lower by the fac-
tor of some constant value. For example, all test takers may score consistently lower on a
test because the examinees are non-English speakers, yet the test items were written and/
or developed for native English-speaking persons. Technically, such systematic influences
on test scores are not classified as error in the true score model (only random error is assumed
by the model). The error of measurement for a person in the true score model is illustrated
in Equation 7.2. Alternatively, in Figure 7.2, the relationship between observed and true
Note. rTE = 0.0; rOE = .42; rOT = .91; rXX = .82 (which is the reliability coef-
ficient expressed as the square of rOT = .91); r2OE = .42; rOT = .91. The cor-
relation between true and error scores is actually .003 in the above example.
Reliability 211
Ei = Xi – Ti
scores is expressed as the regression of true score on observed score (e.g., the correlation
between true and observed score is .91 and .912 = .82 or the reliability coefficient).
Next, in Equation 7.3, the mean of the distribution of error is expressed as the
expected difference between the observed score and true score for a person over infinitely
repeated testing occasions (e.g., as in Table 7.3).
Because X and T are equal in the true score model (inasmuch as the mean observed
score distribution over infinite occasions equals a person’s true score distribution), the
mean error over repeated testing occasions is also zero (Table 7.3; Figure 7.1; Equation 7.4;
18.00
16.00
14.00
True score
12.00
10.00
8.00
Observed score
Figure 7.2. Regression line and scatterplot of true and observed scores for data in Table 7.3.
212 PSYCHOMETRIC METHODS
m EI = e(EI ) = e(X I - TI )
e = (Ei) = 0
• e = expectation operator.
• (Ei) = e xpected value of random variable Ei over an indefi-
nite number of repeated trials.
Lord & Novick, 1968, p. 36; Crocker & Algina, 1986, p. 111). Also, since the error com-
ponent is random, then from classical probability theory (e.g., Rudas, 2008), the mean
error over repeated trials equals zero (Figure 7.1). Accordingly, the first assumption in
the true score model is that the mean error of measurement over repeated trials or test-
ing occasions equals zero (Equation 7.4). The preceding statement is true for (a) an infinite
number of persons taking the same test—regardless of their true score and (b) for a single
person’s error scores on an infinite number of parallel repeated testing occasions.
Assumption 1: The expectation (population mean) error for person i over an infinite
number of trials or testing occasions on the same test is zero.
m E = eJ e XJ
AND
m E = e J(0)
A main caveat regarding Equation 7.5 is that for a random sample of persons from a
population, the average error may not actually be zero. The discrepancy between true score
theory and applied testing settings may be due to sampling error or other sources of error.
Also, in the true score model, one is hypothetically drawing a random sample of error
scores from each person in the sample of examinees. The expected value or population
mean of these errors may or may not be realized as zero.
Assumption 2: True score for person i is equal to the expectation (mean) of their
observed scores over infinite repeated trials or testing occasions (Equation 7.6;
Table 7.2).
TI = e( X I ) = m XI
The fact that a person’s true score remains constant, yet unknown, over repeated
testing occasions makes using Equation 7.1 for the estimation of reliability with empiri-
cal data intractable because without knowing a person’s true score, deriving errors of
measurement is impossible. To overcome the inability of knowing a person’s true score,
items comprising a test are viewed as different parallel parts of a test, enabling estimation of
the reliability coefficient. Given that items serve as parallel components on a test, reli-
ability estimation proceeds in one of two ways. First, the estimation of reliability can
proceed by evaluating the internal consistency of scores by using a sample of persons
tested once, with test items serving as component pieces (each item being a “micro test”)
within the overall composite or total test score. Second, the estimation of reliability can
proceed by deriving the stability of scores as the correlation coefficient for a sample of
persons tested twice with the same instrument or on a parallel form of a test. Later in this
chapter, several methods for estimating the reliability of scores are presented based on the
true score model—all of which are based on the assumption of parallel tests.
Assumption 3: In the true score model, the correlation between true and error
scores on a test in a population of persons equals zero (Equation 7.8; Table 7.4;
Figure 7.3).
TJ = e( X J ) = m X J
Equation 7.8. Correlation between true and error scores in the true
score model
rTE = 0
A consequence of the absence of correlation between true and error scores (Assump-
tion 3, Equation 7.8) is that deriving the observed score variance is accomplished by
summing true score variance and error variance (as linear components in Equation 7.9).
This assumption implies that persons with low or high true scores do not exhibit system-
atically high or low errors of measurement because errors are randomly distributed (as in
Figure 7.3). To illustrate the relationships between true and error scores, we return to the
data in Table 7.3. In Table 7.4, we see that the correlation between true and error scores is
zero (readers should calculate this for themselves by entering the data into SPSS or Excel
18.00
16.00
14.00
True score
12.00
10.00
8.00
Figure 7.3. Correlation of true score with error score from data in Table 7.3.
216 PSYCHOMETRIC METHODS
and conducting a correlation analysis). Next, because true score and error scores are
uncorrelated, observed score variance is simply the sum of true and error score variance.
To verify this statement, return to Table 7.3 and add the variance of true scores (9.66) to
the variance of error scores (2.11) and you will see that the result is 11.75—the observed
score variance. Formally, the additive, linear nature of observed score variance in the true
score model is illustrated in Equation 7.9.
rE1E2 = 0
• rE1E2 = p
opulation correlation between random errors of
measurement for test 1 and parallel test 2.
Intuitively, Assumption 4 should be clear to readers at this point based on the pre-
sentation thus far regarding the nature of random variables as having no relationship (in
this case zero correlation between errors of measurement on two parallel tests).
Assumption 5: Error scores on one test are uncorrelated with true scores on another
test (Equation 7.11). For example, the error component on one intelligence test is not
correlated with true score on a second, different test of intelligence.
Reliability 217
rE1T2 = 0
• rE1T2 = p
opulation correlation between the error on test 1
and true score on test 2 are uncorrelated.
Assumption 6: Two tests are exactly parallel if, for every population, their true
scores and error scores are equal (Lord & Novick, 1968; Equation 7.12). Further, all
items on a test are assumed to measure a single construct. This assumption of measur-
ing a single construct is called unidimensionality and is covered in greater detail in
Chapters 8 and 9 on factor analysis and item response theory. If two tests meet the
assumptions of parallelism, they should be correlated with other external or criterion-
related test scores that are parallel based on the content of the test. The parallel tests
assumption is difficult to meet in practical testing situations because in order for the
assumption to be tenable, the testing conditions that contribute to error variability pre-
sented in Table 7.1 (e.g., fatigue, environment, etc.) must vary in the same manner in
each of the testing scenarios. Also, part of Assumption 6 is that every population of
persons will exhibit equal observed score means (i.e., mean expressed the degree of
measurement precision expressed as how close scores are to one another) and
variances (i.e., as a measure of error) on parallel tests.
X1 = T + E1
X2 = T + E2
s2E1 = s2E2
As previously stated, the model of parallel tests is important because it allows the
true score model to become functional with empirical data. In fact, without the model
of parallel tests, the true score model would be only theoretical because true scores are
not actually measureable. Also, without knowing true scores, calculation of error scores
would not be possible, making the model ineffective in empirical settings. To illustrate the
importance of the model of parallel tests relative to its role in estimating the coefficient
of reliability, consider Equations 7.13 and 7.14 (Crocker & Algina, 1986, pp. 115–116).
å X1 X 2
rX1 X2 =
Ns X1 sX 2
• r X1X2 = correlation between scores on two parallel tests.
• x1 = observed deviation score on test 1.
• x2 = observed deviation score on test 2.
• sx 1 = observed standard deviation on test 1.
• s x2 = observed standard deviation on test 2.
• N = sample size.
r X1 X2 =
å (T1 + E1)(T2 + E2 ) = s2T
Ns X1 sX2 s2X
• rX 1 X 2 = c oefficient of reliability expressed as the correlation
between parallel tests.
• t1 = true score on test 1 in deviation score form.
• t2 = true score on test 2 in deviation score form.
• x1 = observed score on test 1 in deviation score form.
• x2 = observed score on test 2 in deviation score form.
• sX1 = observed score on test 1.
• s X2 = observed score on test 2.
• N = sample size.
s2
• 2T = the coefficient of reliability expressed as the ratio of
sX true score variance to observed score variance.
Reliability 219
The first two lines in Equation 7.12 can be substituted into the numerator of
Equation 7.13 yielding an expanded numerator in Equation 7.14. Notice in Equation
7.14 that x, t, and e are now lowercase letters in the numerator. The lowercase let-
ters represent deviation scores (as opposed to raw test scores). A deviation score is
defined as follows: X - X ; T - T; E - E ; where raw scores are subtracted from their respec-
tive means.
The final bullet point in Equation 7.14, the coefficient of reliability expressed as the
ratio of true score variance to observed score variance, is the most common definition of
reliability in the true score model.
Returning to the example data in Table 7.3, notice that the assumption of exactly parallel
tests is not met because, although the true and observed score means are equivalent, their
standard deviations (and therefore variances) are different. This variation on the model
of parallel tests is called tau-equivalence, meaning that only the true (i.e., tau) scores are
equal (Lord & Novick, 1968, pp. 47–50). Essential tau-equivalence (Lord & Novick,
1968, pp. 47–50) is expressed by further relaxing the assumptions of tau-equivalence,
thereby allowing true scores to differ by an additive constant (Lord & Novick, 1968;
Miller, 1995). Including an additive constant in no way affects score reliability since
the reliability coefficient is estimated using the covariance components of scores and is
expressed in terms of the ratio of true to observed score variance (or as the amount of
variance explained as depicted in Figure 7.1).
Finally, the assumption of congeneric tests (Lord & Novick, 1968, pp. 47–50;
Raykov, 1997, 1998) is the least restrictive variation on the model of parallel tests because
the only requirement is that true scores be perfectly correlated on tests that are designed
to measure the same construct. The congeneric model also allows for either an additive
and/or a multiplicative constant between each pair of item-level true scores so that the
model is appropriate for estimating reliability in datasets with unequal means and vari-
ances. Table 7.5 summarizes variations on the assumptions of parallel tests within the
classical true score model.
To illustrate the relationship among observed, true, and error scores, we return to using
deviation scores based on a group of persons—a metric that is convenient for deriving
the covariance (i.e., the unstandardized correlation presented in Chapter 2) among these
score components. Recall that in Equation 7.1 the definition of observed score is the
sum of the true score and error score. Alternatively, Equation 7.15 illustrates the same
220 PSYCHOMETRIC METHODS
x=t+e
• x = o
bserved score on a test derived as a raw score minus
the mean of the group scores.
• t = true score on a test derived as a true score minus the
mean of the group of true scores.
• e = e rror score derived as an error score minus the mean of
the group error scores.
elements in Equation 7.1 as deviation scores. In the previous section, a deviation score
was defined as X - X ; T - T; E - E ; where raw scores are subtracted from their respective
means. An advantage of working through calculations in deviation score units is that the
derivation includes the standard deviations of observed, true, and error scores—elements
required for deriving the covariance among the score components. The covariance is
expressed as the product of observed and true deviation scores divided by the sample size
(N). For the data in Table 7.3, the covariance is 9.65: COV OT = éë å (XO - XO )( XT - X T )/N ùû
(as an exercise, you should use the data in Table 7.3 and apply it to the equation in
this sentence to derive the covariance between true and observed scores). Notice that in
Reliability 221
Equation 7.14 the covariance is incorporated into the derivation of the reliability index by
including the standard deviations of observed and true scores in the denominator.
Next, recall that the true score model is based on a linear equation that yields a com-
posite score for a person. By extension and analogy, a composite score is also expressed as
the sum of the responses to individual test items (e.g., each test item is a micro-level test).
Working with the covariance components of total or composite scores (e.g., observed,
true, and error components) provides a unified or connecting framework for illustrating
how the true score model works regarding the estimation of reliability with individual and
group-level scores in the true score model and classical test theory.
The reliability index (Equation 5.16; Crocker & Algina, 1986, pp. 114–115; Kelley,
1927; Lord & Novick, 1968) is defined as the correlation between observed scores
and true scores. From the example data in Table 7.4 we see that this value is .91.
The square of the reliability index (.91) is .82—the coefficient of reliability (see
Table 7.4). Equation 7.16 illustrates the calculation of the reliability index working
with deviation scores. Readers can insert the score data from Table 7.3 into Equation
7.16, then work through the steps and compare the results reported in Table 7.4 pre-
sented earlier.
The observed score variance variable s2X can be expressed as the sum of the random true
score variance sT2 plus the random observed score error variance s2E. Computing the
observed score variance as a linear sum using separate, independent components is pos-
sible because true score errors are uncorrelated with observed score errors. Next, using
the component pieces of true score error and observed score error, the coefficient of reli-
ability can be conceptually expressed in Equation 7.17 as the ratio of true score variance
to observed score variance.
Returning to the data in Table 7.3, we can insert the variance components from the
table in Equation 7.17 to calculate the reliability coefficient. For example, the true score
variance (9.66) divided by the observed score variance (11.75) equals .82, the coefficient
of reliability (Table 7.4). The type of reliability estimation just mentioned uses the vari-
ance to express the proportion of variability in observed scores explained by true scores.
To illustrate, notice that the correlation between true scores and error scores in Table 7.4
is .91. Next, if we square .91, a value of .82 results, or the reliability coefficient. In linear
regression terms, the reliability (.82) is expressed as the proportion of variance in true
scores explained by variance in observed scores (see Figure 7.2).
222 PSYCHOMETRIC METHODS
r T =
å (T + E) T
N s X sT
=
å T 2 + å TE
N s X sT
r =
å T2 +
å TE
T
N s X sT N s X sT
åT 2
sT2 = , then
N
sT2
rXT = , simplifying to
s X sT
sT
rXT =
sX
Finally,
r2XT = t he index of reliability squared is the coefficient of
reliability.
Reliability 223
sT2 sT2
r2XT = =
s 2X sT2 + s E2
Equation 7.17 illustrates that the squared correlation between true and observed
scores is the coefficient of reliability. Yet another way to think of reliability is in terms
of the lack of error variance. For example, we may think of the lack of error variability
expressed as 1 - (s E2 / sO2 ). Referring to the data in Table 7.3, this value would be 1 − .18 =
.82, or the coefficient of reliability. Finally, reliability may be described as the lack of cor-
2
relation between observed and error scores, or 1 - rOE, which, based on the data in Table
7.3, is .82 or the coefficient of reliability.
Earlier in this chapter it was stated that individual items on a test can be viewed as par-
allel components of a test. This idea is essential to understanding how reliability coeffi-
cients are estimated within the model of parallel tests in the true score model. Specifically,
test items serve as individual, yet parallel, parts of a test providing a way to estimate the
coefficient of reliability from a single test administration. Recall that a score on an indi-
vidual item is defined by a point value assigned based on a person’s response to an item
(e.g., 0 for incorrect or 1 for correct). In this sense, an item is a “micro-level” testing unit,
and an item score is analogous to a “micro-level test.” The variance of each item can be
summed to yield a total variance for all items comprising a test. Equations 7.18a and
7.18b illustrate how the variance and covariance of individual test items can be used to
derive the total variance of a test.
Based on Equation 7.18a, we see that total test variance for a composite is deter-
mined by the variance and covariance of a set of items. In Table 7.6, the total variance
is the sum of the variances for each item (1.53), plus 2 times the sum of the individual
covariance values (1.08), equaling a total test variance of 2.61.
224 PSYCHOMETRIC METHODS
s TEST
2
= ås I2 + 2å rIK sI sK, I > K
= 1.53 + 1.08
= 2.61
If we replace the “items” in Table 7.6 with “total test scores” (i.e., the total score being
based on the sum of items comprising a test), the same concept and statistical details will
apply regarding how to derive the total variance for a set of total test scores. Next, we
turn to the use of total test scores that are useful as individual components for deriving
a composite score.
In the true score model, total test scores are created by summing the item response
values (i.e., score values yielding points awarded) for each person. The total score for
a test derived in this manner is one form of a composite score. Another form of com-
posite score is derived by summing total test scores for two or more tests. In this case,
a composite score is defined as the sum of individual total test scores. Returning to the
data used throughout this book, suppose that you want to create a composite score
for crystallized intelligence by summing the total scores obtained on each of the four
subtests for crystallized intelligence. The summation of the four total test scores yields
a composite score that represents crystallized intelligence. Equation 7.19 illustrates
the derivation of a composite score for crystallized intelligence (labeled CIQ). The
composite score, CIQ, represents the sum of four subtests, each representing a different
measure of crystallized intelligence.
Given that composites are based on item total scores (for a single test) or total test
scores (for a linear composite comprised of two or more tests), these composites for-
mally serve as parallel components on a test. Applying the definition of parallel test com-
ponents, reliability estimation proceeds according to the technique(s) appropriate for
accurately representing the reliability of scores given the type of study. Specifically, the
estimation of reliability may proceed by one or more of the following techniques. First,
you may derive the stability of scores using the test–retest method. Second, you may
derive the equivalence of scores based on parallel test forms. Third, you may derive the
internal consistency of scores by using a sample of persons tested once with test items
serving as parallel pieces within the overall composite using the split-half reliability
method or by deriving the internal consistency of scores using the Küder–Richardson for-
mula 20 (KR20) or (21) or Cronbach’s coefficient alpha. Each of the internal consistency
techniques is based on there being as many parallel tests as there are items on the test. To
derive the variance of the composite score, Equation 7.20a is required. Equation 7.20b
illustrates the application of Equation 7.20a with data from Table 7.7.
Based on Equation 7.20b, the total variance of the composite using the data in Table
7.7 is 214.92.
To conclude this section, recall that earlier in this chapter individual test items com-
prising a test were viewed as parallel parts of a test. The requirements for parallel tests or
measurements include (1) equal mean true scores, (2) equal (item or test) standard devi-
ations, and (3) equal item (or test) variances. Specifically, test items (or total test scores)
s2 Q = sCRYSTALLIZED1
2
+ sCRYSTALLIZED2
2
+ sCRYSTALLIZED3
2
+ s2CRYSTALLIZED4 + å rIJ sI sJ
I¹ J
2
• sCIQ = variance of a composite score expressed as
crystallized intelligence based on the sum of
individual total test scores.
2
• sCRYSTALLIZED1 = variance of the crystallized intelligence test 1.
2
• sCRYSTALLIZED2 = variance of the crystallized intelligence test 2.
• sCRYSTALLIZED3 = variance of the crystallized intelligence test 3.
2
• sCRYSTALLIZED4
2
= variance of the crystallized intelligence test 4.
• å rIJ sI sJ = s um of k(k − 1) covariance terms (i.e., k =
I¹ J intelligence tests 1–4), where i and j represent
any pair of tests.
sCIQ
2
= 47.12 + 24.93 + 12.40 + 21.66 + 108.81
= 214.92
Reliability 227
Variance–covariance matrix
47.12 28.64 15.93 25.48
— 24.93 11.69 14.71
— — 12.40 12.36
— — — 21.66
serve as individual, yet parallel, parts of a test, providing a way to estimate the coefficient
of reliability from a single test administration. Equation 7.21 provides a general form for
deriving true score variance of a composite. Equations 7.20a and 7.21 are general because
they can be used to estimate the variance of a composite when test scores exhibit unequal
standard deviations and variances (i.e., the equations allow for the covariation between
all items whether equal or unequal).
s2 Q = sTRUE_SCORE_CRYSTALLIZED1
2
+ sTRUE_SCORE_CRYSTALLIZED2
2
+ sTRUE_SCORE_CRYSTALLIZED3
2
+ s 2TRUE_SCORE_CRYSTALLIZED4 + å rIJ sI sJ
I¹ J
228 PSYCHOMETRIC METHODS
Using the foundations of the CTT model, in the next section, we review several
techniques for estimating the coefficient of reliability in specific research or applied
situations.
For personality, attitude, or interest inventories, test–retest coefficients are usually lower,
and the recommended range is between .80 and .90.
The final challenge to the test–retest method is related to chronological age. For
example, although research has established that adult intelligence is stable over time
(Wechsler, 1997b), this is not the case with the intelligence of children.
Split-Half Methods
Often it is not possible or desirable to compose and administer two forms of a test, as
discussed earlier. Here we describe a method for deriving the reliability of total test scores
based on parallel half tests. The split-half approach to reliability estimation involves
dividing a test composed of a set of items into halves that, to the greatest degree pos-
sible, meet the assumptions of exact parallelism. The resulting scores on the respective
half tests are then correlated to provide a coefficient of equivalence. The coefficient of
equivalence is actually the reliability based on one of the half tests. However, remember that
owing to the assumption of parallel test halves, we can apply a formula for deriving the
reliability of scores on the total test using the Spearman–Brown formula. For tests com-
posed of items with homogeneous content (a.k.a. item homogeneity; Coombs, 1950),
the split-half method proceeds according to the following steps. First, after scores on the
total test are obtained, items are assigned to each half test in either (a) a random fashion
or (b) according to order of item difficulty. This process yields one parallel subtest that is
composed of odd-numbered items, and a second half test is composed of even-numbered
items. The split-half technique described allows one to create two parallel half tests that
are of equal difficulty and have homogeneous item content.
Earlier it was stated that two parallel half tests can be created with the intent to tar-
get or measure the same true scores with a high degree of accuracy. One way to ascertain
if two tests are parallel is to ensure that the half tests have equal means and standard
deviations. Also, the test items in the two half tests should have the same content (i.e.,
exhibit item homogeneity). A high level of item homogeneity ensures that, as the corre-
lation between the two half tests approaches 1.0, the approximation to equal true scores
is as accurate as possible. If, however, the two half tests comprise items with partially
heterogeneous content, then certain parts of the two half tests will measure different
true scores. In this case, the two half tests should be created based on matching test halves,
where test items have been matched on difficulty and content. Table 7.8 provides example
data for illustrating the split-half and Guttman (1946) methods for estimating reliability
based on half tests. Rulon’s formula (1939) (equivalent to Guttman’s formula) does not
assume equal standard deviations (and variances) on the half test components. Finally,
when the variances on the half tests are approximately equal, the Rulon formula and
Guttman’s equation yield the same result as the split-half method with the Spearman–
Brown formula.
The SPSS syntax for computing the split-half reliability based on the model of paral-
lel tests (not strictly parallel) is provided below.
RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13 cri2_14
cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21 cri2_22
cri2_23 cri2_24 cri2_25
/SCALE('ALL VARIABLES') ALL
/MODEL=PARALLEL.
2(rII¢)
rXX ¢ =
1 + rII¢
• rii΄ = correlation between half tests.
• rxx΄ = s plit-half reliability based on the Spearman–Brown
formula.
Equation 7.23. Rulon’s formula for total test score reliability based
on the correlation between parallel split-halves
é æ s2 + sHALF
2
TEST2 ö
ù
rXX ¢ = 2 ê1 - ç HALF TEST12 ÷ ú
êë è sTOTAL TEST ø úû
The SPSS syntax for computing the Guttman model of reliability is as follows:
RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13 cri2_14
cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21 cri2_22
cri2_23 cri2_24 cri2_25
/SCALE('ALL VARIABLES') ALL
/MODEL=GUTTMAN.
The Guttman model provides six lower-bound coefficients (i.e., expressed as lambda
coefficients). The output for the Guttman reliability model is provided in Table 7.10. The
lambda 3 (L3) is based on estimates of the true variance of scores on each item and is also
expressed as the average covariance between items and is analogous to coefficient alpha.
Guttman’s lambda 4 is interpreted as the greatest split-half reliability.
Coefficient Alpha
The first and most general technique for the estimation of internal consistency reliability
is known as coefficient alpha and is attributed to L. J. Cronbach (1916–2001). In his work
(1951), Cronbach provided a general formula for deriving the internal consistency of scores.
Coefficient alpha is a useful formula because of its generality. For example, alpha is effective
for estimating score reliability for test items that are scored dichotomously (correct/incorrect),
or for items scored on an ordinal level of measurement (e.g., Likert-type or rating scale items)
and even for essay-type questions that often include differential scoring weights. For these
reasons, coefficient alpha is reported in the research literature more often than any other
coefficient. The general formula for coefficient alpha is provided in Equation 7.24. Table 7.11
includes summary data for 10 persons on the 25-item crystallized intelligence test 2 used in
the previous section on split-half methods.
The total test variance for the crystallized intelligence test 2 is 19.05 (defined as the
sum of the squared deviations from the mean) for 10 persons in this example data. Read-
ers are encouraged to conduct the calculation of coefficient alpha using the required parts
of Equation 7.24 by accessing the raw item-level Excel file: “Reliability_Calculation_
Examples.xlsx” on the companion website (www.guilford.com/price2-materials). Knowing
that the test is composed of 25 items, the total test variance is 19.05 and the sum of the
K æ S sˆ I2 ö
aˆ = 1 -
K - 1 çè sˆ 2X ÷ø
• â = coefficient alpha.
• k = number of items.
• sˆ I2 = variance of item i.
• sˆ 2X = total test variance.
234 PSYCHOMETRIC METHODS
item-level variances is 4.13, we can insert these values into Equation 7.23 and derive the
coefficient alpha as .82.
The SPSS syntax and SAS source code that produces output using the data file .sav is
provided on the next page. The dataset may be downloaded from the companion website
(www.guilford.com/price2-materials).
Reliability 235
SPSS program syntax for coefficient alpha using data file Coefficient_Alpha_
Reliability_N_10_Data.SAV
RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13
cri2_14 cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21
cri2_22 cri2_23 cri2_24 cri2_25
/SCALE('ALL VARIABLES') ALL
/MODEL=ALPHA
/STATISTICS=DESCRIPTIVE SCALE
/SUMMARY=TOTAL.
SAS source code for coefficient alpha using SAS data file alpha_reliability_data
Correla- Correla-
Deleted tion with tion with
Variable Total Alpha Total Alpha Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒ
CRI2_01 -.106391 0.824370 -.117827 0.821956 cri2_01
CRI2_02 0.201823 0.814864 0.176187 0.809214 cri2_02
CRI2_03 0.280976 0.812357 0.269409 0.805024 cri2_03
CRI2_04 0.765257 0.791034 0.766489 0.781409 cri2_04
CRI2_05 -.106391 0.824370 -.139250 0.822857 cri2_05
CRI2_06 0.765257 0.791034 0.766489 0.781409 cri2_06
CRI2_07 0.443376 0.805534 0.423210 0.797949 cri2_07
CRI2_08 0.690412 0.798951 0.664913 0.786412 cri2_08
CRI2_09 0.529629 0.800271 0.518662 0.793454 cri2_09
CRI2_10 -.273526 0.838547 -.252589 0.827561 cri2_10
CRI2_11 0.498087 0.802297 0.516984 0.793534 cri2_11
CRI2_12 0.322139 0.811395 0.307313 0.803299 cri2_12
CRI2_13 0.699294 0.794074 0.689362 0.785216 cri2_13
CRI2_14 0.765257 0.791034 0.766489 0.781409 cri2_14
CRI2_15 0.271781 0.814019 0.293075 0.803948 cri2_15
CRI2_16 0.846512 0.783933 0.851875 0.777130 cri2_16
CRI2_17 0.692078 0.791202 0.700026 0.784693 cri2_17
CRI2_18 0.611315 0.796471 0.625044 0.788351 cri2_18
CRI2_19 -.177627 0.834356 -.192228 0.825069 cri2_19
CRI2_20 0.199188 0.815987 0.203809 0.807980 cri2_20
CRI2_21 0.020153 0.825438 -.003166 0.817071 cri2_21
CRI2_22 0.443376 0.805534 0.470516 0.795731 cri2_22
CRI2_23 0.199188 0.815987 0.215163 0.807471 cri2_23
CRI2_24 0.280976 0.812357 0.297003 0.803769 cri2_24
CRI2_25 -.030557 0.822068 -.048887 0.819032 cri2_25
238 PSYCHOMETRIC METHODS
In reality, tests rarely meet the assumptions required of strictly parallel forms. Therefore,
a framework is needed for estimating composite reliability when the model of strictly
parallel tests is untenable. Estimating the composite reliability of scores in the case of
essentially tau-equivalent or congeneric tests is accomplished using the variance of the
composite scores and all of the covariance components of the subtests (or individual
items if one is working with a single test). An estimate is provided that is analogous to
coefficient alpha and is simply an extension from the item-level data to subtest level data
structures. Importantly, alpha provides a lower bound to the estimation of reliability in the
situation where tests are nonparallel. The evidence that coefficient alpha provides a lower
bound estimate of reliability is established as follows. First, there will be at least one
subtest of those comprising a composite variable that exhibits a variance greater than or
equal to its covariance with any other of the subtests. Second, for any two tests that are
not strictly parallel, the sum of their true score variances is greater than or equal to twice
their covariance. Finally, the sum of the true score variance for nonparallel tests (k) will
be greater than or equal to the sum of their k(k – 1) covariance components divided by
(k – 1). Application of the inequality yields Equation 7.25.
K æ S sˆ I2ö
rCC¢ ³ 1- 2 ÷
K - 1 çè sˆ C ø
• rCC΄ = reliability of the composite.
• S sˆ I2 = variance for subtest i.
• sˆ C2 = total composite test variance.
K æ S PQö
R 20 = 1- 2 ÷
K - 1 çè sˆ X ø
test multiplied by the proportion of persons responding incorrectly to each item on the
test. Comparing Equation 7.24 for coefficient alpha, we see that the numerator within
the brackets involves summation of the variance of all test items. The primary difference
between the two equations is that in KR20 the variance for dichotomous items is based
on multiplying proportions, whereas in coefficient alpha the derivation of item variance
is not restricted to multiplying the proportion correct times the proportion incorrect for
an item because items are allowed to be scored on an ordinal or interval level of mea-
surement (e.g., Likert-type scales or continuous test scores on an interval scale). Finally,
where all test items are of equal difficulty (e.g., the proportion correct for all items are
equal), the KR21 formula applies and is provided in Equation 7.27.
For a detailed exposition of the KR20, KR21, and coefficient alpha formulas with sam-
ple data, see the Excel file titled “Reliability_Calculation_Examples.xlsx” located on the
companion website (www.guilford.com/price2-materials).
K é mˆ (K - mˆ ) ù
K R21 = ê1-
K -1 ë K sˆ X2 úû
• k = number of items.
• m̂ = total score on the test.
• sˆ 2X = total test score variance.
240 PSYCHOMETRIC METHODS
Another useful and general approach to estimating the reliability of test scores is the
analysis of variance (Hoyt, 1941). Consider the formulas for coefficient alpha, KR20 and
KR21. Close inspection reveals that the primary goal of these formulas is the partitioning
of (1) variance attributed to individual items and (2) total variance collectively contrib-
uted by all items on a test. Similarly, in the analysis of variance (ANOVA), one can parti-
tion the variance among persons and items, yielding the same result as coefficient alpha.
The equation for the ANOVA method (Hoyt, 1941) is provided in Equation 7.28.
To illustrate Equation 7.28 using example data, we return to the data used in the
examples for coefficient alpha. Restructuring the data file as presented in Table 7.14
ensures the correct layout for running ANOVA in SPSS. Note that Table 7.14 only pro-
vides a partial listing of the data (because there are 25 items on the test) used in the
example results depicted in Table 7.15.
The data layout example in Table 7.14 continues until all persons, items, and scores
are entered. Next, the following SPSS syntax is used to produce the mean squares required
for calculation of the reliability coefficient.
Inserting the mean squares for persons and the person by items interaction yields a
reliability coefficient of .82—the same value as that which resulted using the formula for
coefficient alpha. Applying the person and person by item mean squares to the ANOVA
approach yields rXX¢ = .847 – .156/.847 = .82.
MSPERSONS - MSPERSONS*ITEMS
rXX ¢ =
MSPERSONS
An important aspect of score reliability for certain types of research relates to how change over
time affects the reliability of scores (Linn & Slinde, 1977; Zimmerman & Williams, 1982;
Rogosa, Brandt, & Zimowski, 1982). For example, consider the case where a difference score
based on fluid intelligence and crystallized intelligence is of interest for diagnostic reasons.
Although the primary research question may be about whether the change in score level is
statistically different, a related question focuses on how reliability is affected by the change in
score level. To address the previous question, we consider the reliability of change scores as
a function of (1) the reliability of the original scores used for computation of the difference
242 PSYCHOMETRIC METHODS
score, and (2) the correlation between the scores obtained on the two tests. Based on these
two components, the usefulness of calculating the reliability of change scores depends on the
psychometric quality of the measurement instruments.
The research design of a study plays a crucial role in the application and interpreta-
tion of the reliability of change scores. For example, if groups of subjects selected for a
study are based on a certain range of pretest score values, then the difference score will
be a biased estimator of reliable change (e.g., due to restricted range of pretest scores).
Elements of the research design also play an important role when using change scores.
For example, random assignment to study groups provides a way to make inferential
statements that are not possible when studying intact groups. Equation 7.29 provides the
formula estimating the reliability of difference scores based on pretest to posttest change.
Note that Equation 7.29 incorporates all of the elements of reliability theory presented
thus far in this chapter. Within the true score model, one begins with the fact that it is
theoretically possible to calculate a difference score. Given this information, the usual
true score algebraic manipulation (i.e., true scores to observed scores) applies. Equation
7.29 illustrates the reliability of difference scores.
To illustrate the use of Equation 7.29, we use crystallized (serving as test 1) and fluid
intelligence (serving as test 2) subtest total scores. In Equation 7.30, application of our
score data is applied.The following information is obtained from the GfGc.sav dataset
and is based on the total sample (N = 1,000).
To ensure the existence of highly reliable difference scores, the following conditions should
be present. Both tests (i.e., scores) should exhibit high reliability but be correlated with each
other at a low to moderate level (e.g., .30–.40). This situation produces reliability of differ-
ence scores that are high. Finally, the psychometric quality of the tests used to derive dif-
ference scores for the analysis of change is crucial to produce reliable change scores. The
concept of the reliability of change scores over time can also be extended beyond the analysis
of discrepancy between different constructs (e.g., crystallized and fluid intelligence presented
here) or basic pretest to posttest analyses to analyze change over time. For example, analytic
techniques such as longitudinal item response theory (IRT; covered in Chapter 10) and hier-
archical linear and structural equation modeling provide powerful frameworks for the analy-
sis of change (Muthen, 2007; Zimmerman, Williams, & Zumbo, 1993; Raudenbush, 2001;
Card & Little, 2007).
244 PSYCHOMETRIC METHODS
The standard error of measurement (SEM; ŝE) provides an estimate of the discrepancy
between a person’s true score and observed score on a test of interest. Measurement error
for test scores is often expressed in standard deviation units, and the SEM indexes the stan-
dard deviation of the distribution of measurement error. Formally, the SEM (ŝE) is defined as
the standard deviation of the discrepancy between a person’s true score and observed score
over infinitely repeated testing occasions. Gulliksen (1950b, p. 43) offered an intuitive defi-
nition of the SEM as “the error of measurement made in substituting the observed score for
the true score.” Equation 7.31 illustrates the standard error of measurement.
sˆ E = s X 1 − rˆ XX ′
When applying Equation 7.31 to score data, sample estimates rather than population
parameters are typically used to estimate the SEM.
The SEM provides a single index of measurement error for a set of test scores. It can
be used for establishing confidence limits and developing a confidence interval around
a person’s observed score given the person’s estimated true score. Within classical test theory,
a person’s true score is fixed (or constant), and it is the observed and error scores that ran-
domly fluctuate over repeated testing occasions (Lord & Novick, 1968, p. 56). One can
derive confidence limits and an associated interval for observed scores using the SEM.
However, because a person’s true score is of primary interest in the true score model, one
should first estimate the true score for a person prior to using Equation 7.31 to derive
confidence intervals.
Two problems occur when not accounting for true score: (1) a regression effect (i.e.,
the imperfect correlation between observed and true scores, which produces a regression
toward the group mean), and (2) the impact of heteroscedastic (nonuniform) errors
across the score continuum (Nunnally & Bernstein, 1994, p. 240). Consequently, sim-
ply using the SEM has the effect of overcorrecting owing to larger measurement error in
observed scores as compared to true scores. Confidence intervals established without
estimating true scores will lack symmetry (i.e., lack the correct precision across the score
scale) around observed scores. To address the issue of regression toward the mean due
to errors of measurement, Stanley (1970), Nunnally and Bernstein (1994), and Glutting,
McDermott, and Stanley (1987) note that one should first estimate true scores for a per-
son and then derive estimated true score–based confidence intervals that can be used
with observed scores. This step, illustrated in Equation 7.32, overcomes the problem
of lack of symmetry from simply applying the SEM to derive confidence intervals for
observed scores.
As an example, consider estimating a true score for a person who obtained an
observed score of 17. Returning to Tables 7.3 and 7.4, we see that the mean is 11.50, the
Tˆ = rˆ XX ′ ( X I − X J ) + X J
standard deviation of observed scores is 4.3, and the reliability is .82. Application of this
information to Equation 7.33 provides the following result.
As noted earlier, lack of symmetry for confidence intervals derived with an SEM
without first estimating true scores neglects accounting for a regression effect. The regres-
sion effect causes biased scores either upward or downward, depending on their location
relative to the group mean. For example, high observed scores are typically further away
from the mean of the group (i.e., they exhibit an upward bias effect), and low scores are
typically biased downward lower than the actual observed score. For these reasons, it is
correct to establish confidence intervals or probable ranges for a person’s observed score given
their (fixed or regressed) true score. Using the estimated true score for a person, one can
apply Equation 7.33 to Equation 7.34a to derive a symmetric confidence interval for true
scores that can be applied to a person’s observed scores. Equation 7.34a can be expressed
as sˆ X .T to show that applying the SEM to estimated true scores yields the prediction of
= .82(5.5) + 11.5
= 4.51 + 11.5
= 16.01
sˆ X.T = sX 1 − rˆ XX ′
sˆ X.T = sX 1 − rˆ XX ′
= 4.3 1 − .82
= 4.3(.42)
= 1.82
observed scores from true scores. The resulting confidence intervals will be symmetric
about a person’s true score but asymmetric about their observed score. This approach
to developing confidence intervals is necessary in order to account for regression toward the
mean test score.
Equation 7.35a provides the following advantages. First, Stanley’s method is based
on a score metric that is expressed in estimated true score units (i.e., Tˆ − T′, the T¢ = pre-
dicted true score) (Glutting et al., 1987). Second, as Stanley demonstrated (1970), his
Tˆ ± (Z )( sˆ X.T )( rˆ XX ′ )
Tˆ ± (1.96)(1.82)(.82)
= (1.96)(1.5)
= 16.01 ± 2.94
= 13.07 − 18.95
method adheres to the classical true score model assumption that states, for a population
of examinees, errors of measurement exhibit zero correlation with true scores.
Interpretation
To facilitate understanding that a person’s true score will fall within a confidence interval
based on that person’s observed score, consider the following scenario. First, using the
previous example, let’s assume that a person’s true score is 16, the reliability is .82, and the
standard error of measurement is 1.82. Next, let’s assume that this person is repeatedly
tested 1,000 times. Of the 1,000 repeated testing occasions, 950 (95%) would lie within
2.94 points of their true score (e.g., between 13.07 and 18.95). Fifty scores would fall
outside of the interval 13.07 to 18.95. Finally, if a confidence interval is derived for each
of the person’s 1,000 observed scores, 950 of the intervals would be generated around
observed scores between 13.07 and 18.95 (each interval would contain the person’s true
score). From the previous explanation, we see that 5% of the time the person’s true score
would not fall within the interval 13.07 to 18.95. However, there is a 95% chance that the
confidence interval generated around the observed score of 16 will contain the person’s
true score.
A common alternate approach to establishing confidence limits and intervals offered
by Lord and Novick (1968, pp. 68–70) does not always meet the classical true score
model requirement of zero correlation between true and error scores—unless the reliabil-
ity of the test is perfect (i.e., 1.0). Lord and Novick’s (1968, p. 68) approach is expressed
in obtained score units (e.g., Tˆ − T) and is provided in Equation 7.36a.
Reliability 249
Continuing with Lord and Novick’s approach, we will next illustrate the probability
that a person’s true score will fall within a confidence interval based on their observed
score. Again, we assume that a person’s true score is 16 and that the standard error of
measurement is 1.82. Next, let’s assume that this person is repeatedly tested 1,000 times.
Of the 1,000 repeated testing occasions, 950 (95%) would lie within 3.25 points of their
true score (e.g., between 12.76 and 19.26). Notice that the confidence interval is wider
in Lord and Novick’s method (see Equation 7.36a) because the product of the z-ordinate
and the estimated standard error is multiplied by the square root of the reliability. Fifty
scores would fall outside of the interval 12.76 to 19.26. Finally, if a confidence interval
was derived for each of the person’s 1,000 observed scores, 950 of the intervals would be
generated around observed scores between 12.76 and 19.26 (each interval would contain
the person’s true score). It is apparent from the previous explanation that 5% of the time
the person’s true score would not fall within the interval 12.76 to 19.26. However, there
is a 95% chance that the confidence interval generated around the observed score of 16
will contain the person’s true score.
Tˆ ± (Z )( σˆ X.T ) ρˆ XX ′
• T̂ = estimated true score.
• z = standard normal deviate (e.g., 1.96).
• ŝX.T = standard error of measurement as the prediction of
observed score from true score.
• r̂XX ′ = square root of coefficient of reliability or the reli-
ability index.
Tˆ ± (1.96)(1.82)(.91)
= (1.96)(1.66)
= 16.01 ± 3.25
= 12.76 − 19.26
250 PSYCHOMETRIC METHODS
The standard error of prediction is useful for predicting the probable range of scores on
one form of a test (e.g., Y), given a score on an alternate parallel test (e.g., X). For exam-
ple, using the crystallized intelligence test example throughout this chapter, one may be
interested in what score one can expect to obtain on a parallel form of the same test. To
derive an error estimate to address this question, Equation 7.37a is required.
σˆ Y .X = σY 1 − ρXX
2
′
σY .X = 4.3 1 − .822
= 4.3 .327
= 4.3(.572)
= 2.46
Tˆ ± (1.96)(2.46)
= 4.82
= 16.01 ± 4.82
= 11.19 − 20.83
Reliability 251
Applying the same example data as in Equations 7.32 and 7.33 to Equation 7.37a
yields the error estimate in Equation 7.37b.
Next, we can apply the standard error of prediction derived from Equation 7.37c to
develop a 95% confidence interval.
Interpretation
Using the standard error of prediction, the probability that a person’s true score will fall
within a confidence interval based on that person’s observed score is illustrated next.
Again we assume that a person’s true score is 16, the standard deviation of test X is 4.3, and
the reliability estimate is .82. Next, we assume that this person is repeatedly tested 1,000
times. Of the 1,000 repeated testing occasions, 950 (95%) would lie within 4.82 points of
the person’s true score (e.g., between 11.19 and 20.83). Notice that the confidence interval
is wider in the previous examples. Fifty scores would fall outside of the interval 11.19 to
20.83. Finally, if a confidence interval was derived for each of the person’s 1,000 observed
scores, 950 of the intervals would be generated around observed scores between 11.19
and 20.83 (each interval would contain the person’s true score). It is apparent from the
previous explanation that 5% of the time the person’s true score would not fall within the
interval 11.19 to 20.83. However, there is a 95% chance that the confidence interval gener-
ated around the observed score of 16 will contain the person’s true score.
a comprehensive (presented next in Chapter 8) framework that allows for many types of
applied testing scenarios. Reliability information may also be reported in terms of error
variance or standard deviations of measurement errors. For example, when test scores are
based on classical test theory, the standard error of measurement should be reported along
with confidence intervals for score levels. For IRT, information on functions should be
reported because they provide the magnitude of error across the score range. Also, when a
test is based on IRT, information on the individual item characteristic functions should be
reported along with the test characteristic curve. The item characteristic and test functions
provide essential information regarding the precision of measurement at various ability
levels of examinees. Item response theory will be covered thoroughly in Chapter 10.
Whenever possible, reporting conditional errors of measurement is also encouraged
because errors of measurement are not uniform across the score scale and this has implica-
tions for the accuracy of score reporting (AERA, APA, & NCME, 1999, p. 29). For approaches
to estimating conditional errors of measurement see Kolen, Hanson, and Brennan (1992),
and for conditional reliability, see Raju, Price, Oshima, and Nering (2007).
When comparing and interpreting reliability information obtained from using a test
for different groups of persons, consideration should be given to differences in variability
of the groups. Also, the techniques used to estimate the reliability coefficients should be
reported along with the sources of error. Importantly, it is essential to present the theo-
retical model by which the errors of measurement and reliability coefficients were derived
(e.g., classical test theory, IRT, or generalizability theory). This step is critical because
interpretation of reliability coefficients varies depending on the theoretical model used
for estimation.
Finally, test score precision should be reported according to the type of scale by
which they have been derived. For example, raw scores or IRT-based scores may reflect
different errors of measurement and reliability coefficients than standardized or derived
scores. This is particularly true at different levels of a person’s ability or achievement.
Therefore, measurement precision is substantially influenced by the scale in which the
test scores are reported.
Reliability refers to the degree to which scores on tests or other instruments are free from
errors of measurement. This dictates their level of consistency, repeatability, or reliability.
Reliability of measurement is a fundamental issue in any research endeavor because some
form of measurement is used to acquire data—and no measurement process is error free.
Identifying and properly classifying the type and magnitude of error is essential to esti-
mating the reliability of scores. Estimating the reliability of scores according to the clas-
sical true score model involves certain assumptions about a person’s observed, true, and
error scores. Reliability studies are conducted to evaluate the degree of error exhibited
in the scores on a test (or other instrument). Reliability studies involving two separate
test administrations include the alternate form and test–retest methods or techniques.
Reliability 253
The internal consistency approaches are based on covariation among or between test
item responses and involve a single test administration using a single form. The inter-
nal consistency approaches include (1) split-half techniques with the Spearman–Brown
correction formula, (2) coefficient alpha, (3) the Küder–Richardson 20 formula, (4) the
Küder–Richardson 21 formula, and (5) the analysis of variance approach. The reliability
of scores used in the study of change is an issue important to the integrity of longitudinal
research designs. Accordingly, a formula was presented that provides a way to estimate
the reliability of change scores.
It is also useful to view how “unreliable” test scores are. The unreliability of scores is
viewed as a discrepancy between observed scores and true scores and is expressed as the
error of measurement. Three different approaches to deriving estimates of errors of mea-
surement and associated confidence intervals were presented, along with the interpretation
of each using example data. The three approaches commonly used are (1) the standard
error of measurement, (2) the standard error of estimation, and (3) the standard error of
prediction.
Confidence limits. Either of two values that provide the endpoints of a confidence interval.
Congeneric tests. Axiom specifying that a person’s observed, true, and error scores on
two tests are allowed to differ.
Constant error. Error of measurement that occurs systematically and constantly due to char-
acteristics of the person, the test, or both. In the physical or natural sciences, this type of
error occurs by an improperly calibrated instrument being used to measure something
such as temperature. This results in a systematic shift based on a calibration error.
Deviation score. A raw score subtracted from the mean of a set of scores.
Essential tau-equivalence. Axiom specifying that a person’s observed score random vari-
ables on two tests are allowed to differ but only by the value of the linking constant.
Generalizability theory. A highly flexible technique for studying error that allows for the
degree to which a particular set of measurements on an examinee are generalizable
to a more extensive set of measurements.
Guttman’s equation. An equation that provides a derivation of reliability estimation
equivalent to Rulon’s method that does not necessarily assume equal variances on the
half-test components. This method does not require the use of the Spearman–Brown
correction formula.
Heteroscedastic error. A condition in which nonuniform or nonconstant error is exhibited
in a range of scores.
Internal consistency. Determines whether several items that propose to measure the
same general construct produce similar scores.
Item homogeneity. Test items composed of similar content as defined by the underlying
construct.
Küder–Richardson Formula 20 (KR-20). A special case of coefficient alpha that is
derived when items are measured exclusively on a dichotomous level.
Küder–Richardson Formula 21 (KR-21). A special case of coefficient alpha that is
derived when items are of equal difficulty.
Measurement precision. How close scores are to one another and the degree of mea-
sure of error on parallel tests.
Parallel tests. The assumption that when two tests are strictly equal, true score, observed,
and error scores are the same for every individual.
Random error. Variability of errors of measurement function in a random or nonsystem-
atic manner.
Reliability. The consistency of measurements based on repeated sampling of a sample
or population.
Reliability coefficient. The squared correlation between observed scores and true scores.
A numerical statistic or index that summarizes the properties of scores on a test or
instrument.
Reliability index. The correlation between observed scores and true scores.
Reliability 255
Rulon’s formula. A split-half approach to reliability estimation that uses difference scores
between half tests and that does not require equal error variances on the half tests.
This method does not require the use of the Spearman–Brown correction formula.
Spearman–Brown formula. A method in which tests are correlated and corrected back
to the total length of a single test to assess the reliability of the overall test.
Split-half reliability. A method of estimation in which two parallel half tests are created,
and then the Spearman–Brown correction is applied to yield total test reliability.
Standard error of estimation. Used to predict a person’s score on one test (Y) based
on his or her score on another parallel test (X). Useful for establishing confidence
intervals for predicted scores.
Standard error of measurement. The accuracy with which a single score for a person
approximates the expected value of possible scores for the same person. It is the
weighted average of the errors of measurement for a group of examinees.
Standard error of prediction. Used to predict a person’s true score from his or her
observed score. Useful for establishing confidence intervals for true scores.
Tau-equivalence. Axiom specifying that a person has equal true scores on parallel forms
of a test.
True score. Hypothetical entity expressed as the expectation of a person’s observed score
over repeated independent testing occasions.
True score model. A score expressed as the expectation of a person’s observed score
over infinitely repeated independent testing occasions. True score is only a hypo-
thetical entity due to the implausibility of actually conducting an infinite number of
independent testing occasions.
Validity. The degree to which evidence and theory support the interpretations of test
scores entailed by proposed use of a test or instrument. Evidence of test validity is
related to reliability, such that reliability is a necessary but not sufficient condition to
establish the validity of scores on a test.
8
Generalizability Theory
This chapter introduces generalizability theory—a statistical theory about the dependabil-
ity of measurements. In this chapter, the logic underlying generalizability is introduced
followed by practical application of the technique. Emphasis is placed on the advan-
tages generalizability theory provides for examining single and multifaceted measurement
problems.
8.1 Introduction
In Chapter 7, reliability was introduced within the classical test theory (CTT) frame-
work. In CTT, a person’s true score is represented by his or her observed score that is a
single measurement representative of many possible scores based on a theoretically infi-
nite number of repeated measurements. The CTT approach to reliability estimation is
based on the variation in persons’ (or examinees) observed scores (Xi) being partitioned
into true (Ti) and error (Ei) components. The true component is due to true differences
among persons, and the error part is an aggregate of variation due to systemic and ran-
dom sources of error. In generalizability theory, a person’s observed score, true score,
and error score are expressed as Xpi, Ipi, and Epi, respectively, where p represents persons
(examinees) and i represents items. For any person (p) and item (i), Xpi is a random vari-
able expressed as the expectation over replications (i.e., the long-run average over many
repeated measurements). Based on the long-run average, a person’s true score is repre-
sented by the random variable.
Aggregating systemic and random sources of error in CTT is less than ideal because
we lose important information about the source of systematic and/or random error and
the impact each has on measurement precision. For example, variation (differences)
in item responses arise from (1) item difficulty, (2) person performance, and (3) the
257
258 PSYCHOMETRIC METHODS
interaction between persons and items confounded by other sources of systematic and
random error. Classical test theory provides no systematic way to handle these complexi-
ties. Another example where CTT is inadequate is when observers rate examinees on
their performance on a task. Typically, this type of measurement involves multiple rat-
ers on a single task or multiple tasks. As an example, consider the situation where test
items are used to assess level of performance on a written or constructed response using
a quality-based rating scale. In this case, it is the quality of the written response that is
being assessed. CTT does not provide a framework for teasing apart multiple sources
of error captured in (Ei). Generalizability theory extends CTT (Cronbach et al., 1972;
Brennan, 2010) by providing a framework for increasing measurement precision by esti-
mating different sources of error unique to particular testing or measurement conditions.
Generalizability theory is easily extended to complex measurement scenarios where CTT
is inadequate. Throughout this chapter the usefulness of generalizability theory is illus-
trated through examples.
The next section presents the ways that generalizability theory extends CTT and
introduces types of score-based decisions that are available when using generalizability
theory and two types of G-studies: generalizability (G) and decision (D) studies.
Generalizability theory extends CTT in four ways. First, the procedure estimates the size
of each source of error attributable to a specific measurement facet in a single analysis.
By identifying specific error sources, the reliability or dependability of measurement can
be optimized using this information (e.g., score reliability in generalizability theory
is labeled a G coefficient). Second, generalizability theory estimates the variance com-
ponents that quantify the magnitude of error from each source. Third, generalizability
theory provides a framework for deriving relative and absolute decisions. Relative deci-
sions include comparing one person’s score or performance with others (e.g., as in ability
and achievement testing). Absolute decisions focus on an individual’s level of perfor-
mance regardless of the performance of his or her peers. For example, absolute decisions
implement a standard (i.e., a cutoff score) for classifying mastery and nonmastery, as in
certification and licensure examinations or achievement testing where a particular level
of mastery is required prior to progressing to a more challenging level. Fourth, generaliz-
ability theory includes a two-part analytic strategy; G-studies and D-studies. The purpose
of conducting a G-study is to plan a D-study that will have adequate generalizability to the
universe of interest. To this end, all of the relevant sources of measurement error are identi-
fied in a G-study. Using this information, a D-study is designed in a way that maximizes
the quality and efficiency of measurement and will accurately generalize to the target
universe. Finally, G-studies and D-studies feature either (1) nested or crossed designs
and (2) may include random or fixed facets of measurement, or both, within a single
analysis. This chapter focuses primarily on crossed designs illustrated with examples.
Additionally, descriptions of random and fixed facets are provided with examples of when
each is appropriate.
G-study
D-study
8.6 G
eneral Steps in Conducting a Generalizability
Theory Analysis
The following general steps can be used to plan and conduct generalizability (G) and
decision (D) studies.
1. Decide on the goals of the analysis, including score-based decisions that are to be
made (e.g., relative or absolute) if applicable.
2. Determine the universe of admissible observations.
3. Select the G-study design that will provide the observed score variance compo-
nent estimates to generalize to a D-study.
4. Decide on random and fixed facets or conditions of measurement relative to the
goal(s) of the D-study.
5. Collect the data and conduct the G-study analysis using ANOVA.
6. Calculate the variance components and the generalizability (G) coefficient for
the G-study.
7. Calculate the proportion of variance for each facet (measurement condition) to
provide a measure of effect.
8. If applicable (e.g., for relative or absolute decisions) calculate the standard error
of measurement (SEM) for the G-study that can be used to derive confidence
intervals for scores in a D-study.
Recall that the fundamental unit of analysis in generalizability theory is the variance com-
ponent. The general linear equation (Equation 8.1; Brennan, 2010; Crocker & Algina, 1986,
p. 162) can be used to estimate the variance components for a generalizability theory analy-
sis. Notice that Equations 8.1 and 8.2 constitute a linear, additive model. This is convenient
because using the linear, additive model the individual parts of variation from person, items,
and raters can be summed to create a measure of total variation. To understand the compo-
nents that the symbols in Equations 8.1 and 8.2 represent, we turn to Tables 8.1 and 8.2. Table
8.1 illustrates how deviation scores and the variance are derived for a single variable. Table 8.2
provides the item responses and selected summary statistics for our example for 20 persons
responding to the 10-item short-term memory test 2 of auditory memory in the GfGc data.
Next, the variance components must be estimated, and therefore we need the devia-
tion scores for persons from the grand mean (Equation 8.2; Brennan, 2010; Crocker &
Algina, 1986, p. 162).
Using Equation 8.2, we can obtain an effect for persons and items and a residual (error
component) that captures the error of measurement (random and systematic combined).
Next, we review how the variance is derived to aid an understanding of variance components.
264 PSYCHOMETRIC METHODS
The variance of a set of scores (see Chapter 2 for a review) is obtained by (1) deriving
the mean for a set or distribution of scores, (2) calculating the deviation of each person’s
score from the mean (i.e., deriving deviation scores), (3) squaring the deviation scores,
and (4) computing the mean of the squared deviations. Table 8.1 illustrates the sequen-
tial parts for estimating the variance using the total (sum) score on short-term memory
test 2 for 20 randomly selected persons representing the universe of persons on short-term
memory test 2. The sample of 20 persons is considered exchangeable with any other ran-
domly drawn sample of size 20 from this universe of scores. Therefore, the person facet
is considered random. The item facet in design 1 (illustrated in the next section) is fixed
(i.e., we are only interested in how this particular set of items functions with our random
sample of persons). An important point to note here is that both persons and items could be
random if we were also interested in generalizing to a larger set of items from a possible uni-
verse of items measuring short-term memory.
Generalizability Theory 265
Table 8.1. Calculation of the Variance for Sample Data in Table 8.2
Score(X) Mean(μ) Squared deviation
3 13.35 –10.35 107.1225
5 13.35 –8.35 69.7225
5 13.35 –8.35 69.7225
9 13.35 –4.35 18.9225
9 13.35 –4.35 18.9225
11 13.35 –2.35 5.5225
11 13.35 –2.35 5.5225
12 13.35 –1.35 1.8225
12 13.35 –1.35 1.8225
13 13.35 –0.35 0.1225
13 13.35 –0.35 0.1225
14 13.35 0.65 0.4225
14 13.35 0.65 0.4225
16 13.35 2.65 7.0225
16 13.35 2.65 7.0225
17 13.35 3.65 13.3225
20 13.35 6.65 44.2225
22 13.35 8.65 74.8225
22 13.35 8.65 74.8225
23 13.35 9.65 93.1225
Sum (SX) = 267 SS = S (X – X)2 = 614.55
GRAND MEAN (X; m) = 13.35 Variance = s2 = 30.72
Standard deviation = s = 5.54
Note. The denominator for the variance of this random sample is based on n = 20, not n – 1 = 19.
The symbol s2 is for a sample. The symbol s2 is for the population and is used throughout this
chapter to represent the variance. The symbol s is the standard deviation for a sample, and the
symbol s is the standard deviation for the population.
With an understanding of how the variance is derived using deviation scores, we are
in a position to estimate the variance components necessary for use in our first example
of generalizability theory analysis. Specifically, we need estimates of the following vari-
ance components based on the data in Table 8.2.
The next section illustrates our first generalizability theory analysis.
Items
Person Person
Person 1 2 3 4 5 6 7 8 9 10 mean variance
1 3 0 0 0 0 0 0 0 0 0 0.3 0.90
2 3 1 1 0 0 0 0 0 0 0 0.5 0.94
3 3 2 0 0 0 0 0 0 0 0 0.5 1.17
4 3 3 3 0 0 0 0 0 0 0 0.9 2.10
5 3 3 3 0 0 0 0 0 0 0 0.9 2.10
6 3 3 2 1 1 0 1 0 0 0 1.1 1.43
7 3 3 2 1 1 0 1 0 0 0 1.1 1.43
8 3 3 3 1 1 0 1 0 0 0 1.2 1.73
9 3 3 3 1 1 0 1 0 0 0 1.2 1.73
10 3 3 3 1 1 1 1 0 0 0 1.3 1.57
11 3 3 3 1 1 1 1 0 0 0 1.3 1.57
12 3 3 2 2 2 0 2 0 0 0 1.4 1.60
13 3 3 2 2 2 0 2 0 0 0 1.4 1.60
14 3 3 3 2 2 1 2 0 0 0 1.6 1.60
15 3 3 3 2 2 1 2 0 0 0 1.6 1.60
16 3 3 3 2 2 2 2 0 0 0 1.7 1.57
17 3 3 3 3 3 2 3 0 0 0 2 2.00
18 3 3 3 3 3 1 3 1 1 1 2.2 1.07
19 3 3 3 3 3 1 3 3 0 0 2.2 1.73
20 3 3 3 3 3 3 3 2 0 0 2.3 1.57
Item mean 3 2.7 2.4 1.4 1.4 0.65 1.4 0.3 0.05 0.05 1.34
Item variance 0 0.64 0.99 1.19 1.19 0.77 1.19 0.64 0.05 0.05
In Design 1, we use short-term memory test 2 from our GfGc data measuring an audi-
tory component of memory. The range of possible raw scores is 0 to 3 points possible
for each item. Table 8.2 provides a random sample of 20 persons from the target uni-
verse of persons; these data will be used to illustrate Design 1, and the person facet is
random. In Design 1 (and in most G-studies), persons’ scores are the object of mea-
surement. In this example and for other designs throughout this chapter, we use the
mean score across the 10 items for the 20 persons as opposed to the sum or total score
mainly for convenience in explaining how a generalizability theory analysis works.
Additionally, using mean scores and the variance is consistent with ANOVA. Design 1
is known as a crossed design because all persons respond to all items. In Design 1, we
assume that the 10 items on the short-term memory test have been developed as one
representative set from a universe of possible items that measures this aspect of memory,
as posited by the general theory of intelligence. The item facet in this example is con-
sidered fixed (i.e., we are only interested in how the 10 items function for our random
sample of persons).
Generalizability Theory 267
Returning to the person’s facet, if we are willing to assume that scores in Table 8.2
reflect universe scores accurately, we have a universe score for each person. Since the goal
in generalizability theory is to estimate the universe score for persons, we use persons’
observed score as representative of their universe score (i.e., the expectation of observed
score equals true score). Based on this assumption, our sample of 20 persons is consid-
ered exchangeable with any other random sample from the universe. Ultimately, we want
to know how accurate our score estimates are of the target universe.
To calculate the variance components using the data in Table 8.2, we can use the
mean square estimates from an ANOVA. Before proceeding to the ANOVA, Table 8.3
illustrates how to structure the data for the ANOVA analysis in this example. Using this
information, you should duplicate the results presented here to understand the process
from start to finish. The layout in Table 8.3 is for the first two items only from the data
in Table 8.2. The data layout in Table 8.3 is for a one-facet (p × i) analysis. Note that the
complete dataset for the example analysis will include 200 rows (20 persons × 10 items),
with the appropriate score assigned to each person and item row.
Next, we can conduct ANOVA in SPSS to estimate the variance components using
the SPSS program below.
SS(I) = NP ΣI X I2 − NP N I X 2
Note. Adapted from Brennan (2010, p. 26). Copyright 2010 by Springer. Adapted by permission. p, persons; i, items;
pi, persons by items interaction; df, degrees of freedom; SS, sum of squared deviations from mean; MS, mean squared
deviation derived as SS divided by degrees of freedom; ŝ2, variance component estimate for a particular effect; e,
residual for persons and items.
2 2 2
sˆ p ŝ pi, e ŝ i
Figure 8.2. Variance components in a one-facet design. Figure segments are not to scale.
270 PSYCHOMETRIC METHODS
SPSS syntax for estimating variance components using variance components procedure
Note. In the METHOD command, the ANOVA option with the desired sum of squares
(in parentheses) can also be used.
Using Equation 8.5, we can derive the proportion of variance for the person effect. The
proportion of variance provides information about how much each facet explains in the
analysis. Using the proportion of variance is advantageous because it is a measure of
effect size expressed in a unit that is comparable across studies (or different designs).
Using the estimates from Table 8.4 or 8.6, we can derive the proportion of variance values
as follows.
We see from Equation 8.5 that the person effect accounts for approximately 16%
of the variability in memory scores. In our example the sample size is only 20 persons
(very small); the person variability may be much larger with an increased, more realistic
sample size.
272 PSYCHOMETRIC METHODS
sˆ 2P .285 .285
= = = .16
ˆs2P + sˆ 2I + sˆ 2RES .285 + 1.16 + .389 1.83
Next, in Equation 8.6 we calculate the proportion of variance for the item effect.
We see from Equation 8.6 that the item effect accounts for approximately 63% of the
variability in memory scores (i.e., differences between items is large). So, from this infor-
mation we conclude that the item effect is relatively large (i.e., the items vary substantially
in terms of their level of difficulty). Next, we derive the residual variance in Equation 8.7.
The residual variance component is about one-third the size (21%) relative to the
item variance component (63%). Also, the variance component for persons (16%) is
small relative to the item variance component. The large variance component for items
indicates that the items do not discriminate equally and are therefore of unequal difficulty
across persons. In Table 8.2 (p. 266), we see that the range of item means for persons is
.05 to 2.7 (range = 2.65). Also, we see that the range of person means is .3 to 2.3 (range =
2.0), smaller than the range for items. This partially explains why the item variance
component is larger than the person variance component.
The final statistic that is calculated in a generalizability theory analysis is the
coefficient of generalizability (i.e., G coefficient). Under certain conditions, the G
sˆ 2I 1.16 1.16
= = = .63
ˆs2P + sˆ 2I + sˆ 2RES .285 + 1.16 + .389 1.83
In CTT, under the assumption of strictly parallel tests, recall that item means and vari-
ances are equal for two parallel tests. In the language of generalizability theory, the result
of this assumption means that the item effect or variance component is zero. Because the
item effect is zero under the strictly parallel assumptions of CTT, the analysis resolves to the
individual differences among persons. Finally, since items are considered to be of equal
difficulty, in the right-hand side of the denominator in Equation 8.8 the error (.389) is
divided or averaged by the number of items.
In Equation 8.8 we see that using the variance components estimated from the vari-
ance components procedure but dividing the error by the number of items, we arrive at
the same result that you would obtain calculating the coefficient alpha (a) of reliability
for this 20-item dataset (i.e., a = .88; you can verify this result for yourself in SPSS).
Next, we turn to a different design where the condition of measurement is ratings of
performance. For example, in Design 2 observers rate the performance of persons on test
items where performance can be rated on a scale based on gradations of quality.
In Design 2, the example research design again involves conducting a G-study and using the
results to plan a D-study. Design 2 is highly versatile because we can use the variance compo-
nents estimated from ANOVA to plan a variety of D-study scenarios where ratings are used.
In a D-study, raters are different from those used in the G-study. So, the question is: “How
generalizable are the results from our G-study with respect to planning our D-study?” Our
example for design 2 is based on ratings of person performance on an item from subtest 2 on
the auditory component of short-term memory. For clarity and ease of explanation, we use
a single item to illustrate the analysis. In Table 8.7 there are three different observers (raters)
providing ratings on each of the 20 persons for item number one. Notice that this is a crossed
design because all persons are rated by all raters. The ratings are based on a 10-point scale
with 1 = low to 10 = high. The ANOVA is used to estimate the necessary statistics for estimat-
ing the variance components in the G- and D-studies.
The variance components we need from the data in Table 8.7 for this analysis are:
The following SPSS program provides the mean square statistics we need for calculat-
ing the variance components. The technique employed is a two-factor repeated measures
ANOVA model with a within-rater factor (because the repeated measures are based on the
ratings on 1 item for 20 persons from three different raters) and a between-subjects factor
(persons). For example, each person signifies one level of the person factor and allows us
to estimate the between-persons effect for the ratings. Each rater represents one level of
the rater factor; with each combination of rater and person contained within each cell in
Generalizability Theory 275
Table 8.7. Design 2 Data: Single Facet with 20 Persons and Three Raters
Person Item Rater 1 Rater 2 Rater 3 XpI
1 1 2 3 2 2.33
2 1 7 5 7 6.33
3 1 3 3 2 2.67
4 1 4 2 6 4.00
5 1 4 3 5 4.00
6 1 5 4 7 5.33
7 1 7 2 6 5.00
8 1 8 2 3 4.33
9 1 8 4 2 4.67
the data matrix. Given this ANOVA design, there is only one score for each rater–person
combination. Tables 8.8a–8.8b provide the results of the SPSS analysis.
SPSS program for repeated measures ANOVA for the person x rater design
Table 8.8a. Repeated Measures ANOVA Output for the Person × Rater Design
Tests of Within-Subjects Effects
Measure: MEASURE_1
Source Type III Sum of df Mean Square F Sig.
Squares
Table 8.8b. Repeated Measures ANOVA Output for the Person × Rater Design
Tests of Between-Subjects Effects
Measure: MEASURE_1
Transformed Variable: Average
Source Type III Sum of Squares df Mean Square F Sig.
Intercept 1118.017 1 1118.017 . .
persons 95.650 19 5.034 . .
Error .000 0 .
Next, the variance components are calculated using mean squares from the ANOVA
results. The variance component estimate for persons is provided in Equation 8.9, and
the estimate for raters is provided in Equation 8.10.
The variance component estimate for error or the residual is provided in Equation 8.11.
To illustrate how the generalizability coefficient obtained in our G-study can be used
within a D-study, let’s assume that the raters used in our G-study are representative of the
raters in the universe of generalization. Under this assumption, our best estimate is the
MS MS
s = = = =
RATERS - RESIDUAL -
sˆ 2RATERS = = = = 1.04
NP 20 20
Generalizability Theory 277
ŝ2E = RESIDUAL =
average observed score variance for all the raters in the universe. The average score vari-
ance is captured in the sum of s2P + s2E . Because we are willing to assume that our raters
are representative of the universe of raters we can estimate the coefficient of generaliz-
ability in Equation 8.12 from our sample data. An important point here is that raters are not
usually randomly sampled from all possible raters in the universe of generalization, leading
to one difficulty with this design.
The value of .90 indicates that the raters are highly reliable in their ratings. Using
this information, we can plan a D-study in a way that ensures that rater reliability will
be adequate by changing the number of raters. For example, if the number of raters is
reduced to two in the D-study, the variance component for persons changes to 2.25.
Using the new variance component for persons in Equation 8.13 yields a generalizability
coefficient of .81 (which is still acceptably high).
Next, we turn to the proportion of variance as illustrated in Equation 8.14 as a way
to understand the magnitude of the effects.
In G theory studies, the proportion of variance provides a measure of effect size
that is comparable across studies. The proportion of variance is reported for each
facet in a study. For example, the proportion of variance for persons is provided in
Equation 8.14.
Equation 8.14 shows that the person effect accounts for approximately 61% of the
variability in rating scores among persons. Next, in Equation 8.15 we calculate the pro-
portion of variance for the rater effect.
We see from Equation 8.15 that the rater effect accounts for approximately 32% of
the variability in memory score performance ratings. From this information we conclude
sˆ 2P 5.03 5.03
rˆ 2RATERS* = = = = .90
ˆs2P + sˆ 2E 5.03 + .53 5.56
Note. The asterisk (*) signifies that the G coefficient can be used
for a D-study with persons crossed with raters (i.e., the measure-
ment conditions). Notation is from Crocker and Algina (1986,
p. 167).
278 PSYCHOMETRIC METHODS
ˆ 2P 2.25 2.25
ρˆ RATERS*
2
= = = = .81
ˆ 2P ˆ 2E 2.25 .53 2.78
Note. The asterisk (*) signifies that the G coefficient can be used
for a D-study with persons crossed with the average number of raters
(i.e., the measurement conditions).
sˆ 2P 5.03 5.03
= = = .61
sˆ 2P + sˆ 2R + sˆ 2RESIDUAL 5.03 + 2.62 + .53 8.18
sˆ R2 2.62 2.62
= = = .32
ˆs2P + sˆ 2R + sˆ 2RESIDUAL 5.03 + 2.62 + .53 8.18
that the rater effect is relatively small (i.e., raters account for or capture a small amount
of variability among the raters). Another way of interpreting this finding is that the raters
are relatively similar or consistent in their ratings.
In Design 3, we cover a G-study where the ratings are averaged, a strategy used to reduce
the error variance in the measurement condition. We can average over raters because
the same observers are conducting the ratings on each occasion for persons (i.e., raters are
not different for persons). Averaging over raters involves dividing the appropriate error
component by the number of raters. For example, in Equation 8.16 the error variance
Generalizability Theory 279
ˆ 2P 5.03 5.03
ρˆ 2 = = = = .96
ˆ 2e .53 5.03 .17
ˆ 2P 5.03
NRATERS 3
Note. The asterisk (*) signifies that the G coefficient can be used
for a D-study with persons crossed with the average number of
raters (i.e., the measurement conditions). Capital notation for
RATERS signifies that the error variance is divided by 3, the number
of raters in a D-study. The symbol N′RATERS signifies the number of
ratings to form the average. Notation is from Crocker and Algina
(1986, p. 167).
component is divided by 3 (i.e., .53/3). In our example data, the change realized in the G
coefficient by averaging over raters is from .90 to .96 (Equation 8.16).
There is a substantial increase in the G coefficient (i.e., from .90 in Design 2 to .96
in Design 3), telling us that when it is reasonable to do so, averaging over raters is an
excellent strategy.
In Design 3, we illustrated the situation in which each person is rated by the same raters
on multiple occasions. In Design 4, each person has three ratings (on three occasions),
but each person is rated by a different rater. For example, this may occur in the event that
a large pool of raters is available for use in a G-study. In this scenario, raters are nested
within persons. Symbolically, this nesting effect is expressed as r : p or r(p). In this design,
differences among persons are influenced by (1) rater differences plus (2) universe score
differences for persons and (3) error variance. To capture this variance, the observed
score variance for this design is σ2P + σ2RATERS + σ2E, where the variance component symbols
are the same as in Design 2. Using the same mean square information in Equations 8.9,
8.10, and 8.11, we find that the G coefficient for Design 3 is provided in Equation 8.17.
We see that there is substantial reduction in the G coefficient from .90 (Design 2) or
.96 (Design 3) to .70 (Design 4). Knowing this information about the reduction of the
G coefficient to an unacceptable level, we can plan accordingly by using Design 2 or 3
rather than Design 4.
280 PSYCHOMETRIC METHODS
σˆ 2P 1.5 1.5
ρˆ 2RATERS = = = = .70
σˆ 2P + σˆ 2RATERS + σˆ 2RESIDUAL 1.5 + .104 + .53 2.13
In Design 4, the scenario was illustrated where different raters rate each person and
each person is rated on two occasions. Our strategy in Design 5 with multiple raters and
occasions of measurement is to average over ratings. The G coefficient for Design 5 is
provided in Equation 8.18.
Table 8.9 summarizes the formulas for the four G coefficients based on the designs
covered to this point (excluding Design 5, which is a modification of Design 4).
s2P 5.03
rˆ 2 = =
s2ERROR (.104 + .53)
s2P + s2RATERS + 5.03 +
¢
NRATERS 3
Note. The word RATERS in capital letters signifies that the mea-
surement condition, ratings, are averaged over raters. The sym-
bol N′RATERS signifies the number of ratings to form the average.
Notation is from Crocker and Algina (1986, p. 167).
Generalizability Theory 281
σ P2
1 p × i (crossed) 1 σP2 + σ 2E ρ I2* =
σ P2 + σ E2
σ P2
σE2 ρ 2* =
2 p × i (crossed) ni′ σ P2 + σ2
N′I σ P2 + E
NI¢
σ P2
3 i : p (nested) 1 σP2 + σ 2I + σE2 ρ I2* =
σ P2 + σ 2I + σ 2E
σ P2
( σ 2 + σE2 ) ρ 2I* =
4 i : p (nested) ni′ σP2 + I (σ 2 + σ E2 )
N′I σ P2 + I
NI¢
Note. Adapted from Crocker and Algina (2006). Copyright 2006 by South-Western, a part of Cengage Learning, Inc.
Adapted by permission. www.cengage.com/permissions. Crossed, all persons respond to all questions or are rated by all
raters; nested, condition of measurement is nested within persons (e.g., condition may be number of raters or occa-
sions of ratings; ni, the number of raters [or test items] in a G-study; ni¢, number of raters in a D-study; I, score is an
average over the raters). Note that the only difference between ρ 2I * and ρ 2I* is that in ρ 2I* the error σ 2E is divided by the
number of raters in the D-study.
In D-studies, the standard error of measurement (SEM) is used in a similar way as was pre-
sented in CTT. Recall that the SEM provides a single summary of measurement error with
which we can construct confidence intervals around observed scores. Recall also that
the observed (Xpi) score for a person is based on the expectation of the person’s true (Tpi)
score on an item (or rating); and that this process is applied to all persons in the sample.
Finally, the error score for a person is (Epi). Given this information about observed score
(Xpi) representing true score (Tpi), a confidence interval is based on a person’s true score.
Symbolically, the confidence interval for a person’s score is Xpi ± (SEM). Using this nota-
tion, we can create a confidence interval for any observed score in a D-study. To construct
a confidence interval, we need the error variance for the design being used in a D-study.
For example, in Design 1 where persons and test items were crossed, the residual or error
variance was .389. To return to standard deviation units, we take the square root of the
variance yielding s = .623.
This chapter concludes with an example of a two-facet design. Many measurement prob-
lems involve more complex scenarios than were presented in the previous section on
single-facet designs. To address increased measurement and/or design complexity, we can
use a two-facet G-study to estimate the necessary variance components. Two examples
282 PSYCHOMETRIC METHODS
are provided to illustrate two-facet G theory designs. In our first example, we use five per-
sons from the GfGc data to illustrate how to apply a two-facet G-study. Specifically, our
focus is on short-term memory as the broad construct of interest. In our first example,
short-term memory consists of the subtests auditory, visual, and working memory. Next,
ratings by three observers on auditory, visual, and working memory serve as our out-
come measures of interest. Ratings signify the quality (expressed as accuracy) of response
and are based on a 1–10 scale with 1 = low level of short-term memory and 10 = a high
level of short-term memory on each of the items (1–3). In this situation we have two
facets of measurement: an item (or in this case a test) facet and an observer (rater) facet.
In this example, persons are the object of measurement and are included as a random
effect. The design is crossed because all five persons are rated by all three observers on
the three memory subtests. The primary research question of interest for this analysis is
whether the persons elicited different mean ratings averaged across subtests and raters. In
ANOVA, the main effect for persons reflects differences among persons’ averages.
Table 8.10 illustrates the design structure and (1) the person means, (2) rater means,
and (3) grand mean for persons for this two-facet example.
The corresponding data file layout for an SPSS ANOVA analysis is illustrated in Table
8.11.
Next, the SPSS syntax is provided that yields the ANOVA results necessary for deriving
mean squares for estimating the generalizability coefficient for the two-facet generalizabil-
ity theory analysis. Table 8.12 provides the results of the ANOVA for the two-facet design.
/EMMEANS=TABLES(rater)
/EMMEANS=TABLES(person*item)
/EMMEANS=TABLES(person*rater)
/EMMEANS=TABLES(item*rater)
/EMMEANS=TABLES(person*item*rater)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/DESIGN=person item rater item*person person*rater item*rater.
Table 8.13 provides the main effects and two-way interactions for our two-facet design.
Next, the equations for calculating variance components are presented in Tables
8.14 and 8.15.
Table 8.16 provides an interpretation for the results on Tables 8.13 and 8.15.
Table 8.15. Variance Component Estimates in the Person × Rater × Item Model
Effect Variance component % variance
Person 190.47 − 1.88 − 1.82 + 1.62 .890
σP2 = = 20.93
3 *3
Subtest 7.4 − 1.82 − 2.2 + 1.62 .014
σI2 = = .33
5*3
Rater 7.8 − 1.88 − 2.2 + 1.62 .015
σ R2 = = .36
5* 3
Person × subtest 1.82 − 1.62 .002
σPI2 = = .06
3
Person × rater 1.88 − 1.62 .004
σ PR
2
= = .09
3
Subtest × rater 2.2 − 1.62 .005
σR2I = = .12
5
Residual σRES
2
= 1.62 .068
Total 23.51 1.00
286 PSYCHOMETRIC METHODS
Item Items (subtests) were awarded different Item 2 has a higher average rating than
mean ratings averaged across persons item 1.
and items.
Rater Raters provided different mean ratings Rater 3 provides higher average ratings
averaged across persons and items. than rater 2.
Person × item Persons were ranked differently across On subtest 1 person X was rated higher
items relative to their ratings averaged than person Y, but on item 2 person X
across raters. was rated lower than person Y.
Person × rater Persons were ranked differently across Rater 1 rates person X higher than
raters relative to their ratings averaged rater 2 rates person Y, but rater 2 rates
across items. person X lower than person Y.
Item × rater Items (subtests) were ranked differently Rater 1 rates item 1 (Auditory
by raters relative to the ratings averaged memory) higher than item 2 (Visual
across persons. memory), but rater 2 rates item 1
(Auditory memory) lower than item 2
(Visual memory).
This chapter presented generalizability theory—a statistical theory about the depend-
ability of measurements useful for studying a variety of complex measurement problems.
In this chapter, the logic underlying generalizability was introduced followed by practi-
cal application of the technique under single facet and two-facet measurement designs.
Generalizability theory was discussed as providing a way to extend and improve upon
the classical test theory model for situations where measurement is affected by multiple
facets or conditions. Reliability of scores according to the generalizability theory was
discussed in relation to the CTT model, and the advantages of estimating score reliability
in generalizability theory were highlighted. Finally, emphasis was placed on the advan-
tages generalizability theory provides for examining single and multifaceted measure-
ment problems.
Generalizability Theory 287
Fixed facet of measurement. Interest lies in the variance components of specific char-
acteristics of a particular facet (i.e., we will not generalize beyond the characteristics
of the facet).
G-study. A generalizability study with the purpose of planning, then conducting a D-study
that will have adequate generalizability to the universe of interest.
Generalizability coefficient. Synonymous with the estimate of the reliability coefficient
alpha (a) in CTT under certain measurement circumstances.
Generalizability theory. A highly flexible technique for studying error that allows for the
degree to which a particular set of measurements on an examinee are generalizable
to a more extensive set of measurements.
Item facet. Generalization from a set of items, defined under a set of similar conditions
of measurement to a set of items from a universe of items.
Measurement precision. How close scores are to one another and the degree of mea-
sure of error on parallel tests.
Nested design. A design where each person is rated by three raters and the raters rate
each person on two separate occasions (i.e., persons are nested within raters and
occasions).
288 PSYCHOMETRIC METHODS
Object of measurement. The focus of measurement; usually persons but may also be
items.
Occasion facet. A generalization from one occasion to another from a universe of occa-
sions (e.g., days, weeks, or months).
Partially nested. Different raters rate different persons on two separate occasions.
Factor Analysis
This chapter introduces factor analysis as a technique for reducing multiple themes embed-
ded in tests to a simpler structure. An overview of the concepts and process of conducting
a factor analysis is provided as it relates to the conceptual definitions underlying a set of
measured variables. Additionally, interpretation of the results of a factor analysis is included
with examples. The chapter concludes by presenting common errors to avoid when con-
ducting factor analysis.
9.1 Introduction
289
290 PSYCHOMETRIC METHODS
where the basic principles of the FA approach to variable reduction is useful owing to the
nature of the complex correlational structure of psychological attributes and/or constructs.
This chapter presents an overview of the process of conducting FA and the mechanics of FA as
it relates to the conceptual definitions underlying a set of measured variables (Fabrigar &
Wegner, 2012, p. 144). The presentation here therefore focuses on the factor-analytic tradi-
tion to variable reduction targeting simple structure based on the common factor model.
Recall that throughout this book we have used data representing part of the gen-
eral theory of intelligence represented by the constructs crystallized intelligence, fluid
intelligence, and short-term memory. In our examples, we use score data on 10 subtests
acquired from a sample of 1,000 examinees. Chapters 3 and 4 introduced the issue of
score accuracy. For example, do examinee scores on tests really represent what they are
intended to represent? Establishing evidence that scores on subtests display patterns of
association in a way that aligns with a working hypothesis or existing theory is part of the
test or instrument validation process. The degree to which the subtests cluster in patterns
that align with a working hypothesis or theory provides one form of evidence that the
subtests actually reflect the constructs as they exist relative to a theoretical framework.
Therefore, an important question related to the validation of the general theory of intel-
ligence in our examples is whether scores on the items and subtests comprising each
theoretical construct reflect similar patterns. The number and composition of the subtest
clusters are determined by the correlations among all pairs of subtests.
To provide a conceptual overview of FA, we return to the GfGc data used through-
out this book. The relationships among the 10 subtests are summarized according to
their intercorrelations (e.g., in Table 9.2). Note that in this chapter, to help present concepts
involved in conducting factor analysis, we use subtest total score data rather than item-level
data. Alternatively, FA can also be conducted at the level of individual items comprising
a test. With regard to the correlation matrix in Table 9.2, although we see basic infor-
mation about the relationships among the subtests, it is difficult to identify a discernible
pattern of correlations. Using the correlation matrix as a starting point, FA provides a
way for us to identify order or relational structures among the correlations. In identifying
relational structures among our 10 subtests, FA can be used in an exploratory mode. For
example, exploratory factor analysis (EFA) is used in the early stages of test or instru-
ment development, and confirmatory factor analysis (CFA) is used to test or confirm
an existing theory on the basis of the tests.
We begin the chapter with a conceptual overview and brief history of FA. Next, an
example FA is presented using the GfGc data, with an emphasis on basic concepts. The
presentation aims to facilitate an understanding of FA by considering associated research
questions. Core questions common to correctly conducting and interpreting a factor ana-
lytic study (adapted from Crocker & Algina, 1986, p. 287) include:
1. What role does the pattern of intercorrelations among the variables or subtests
play in identifying the number of factors?
2. What are the general steps in conducting a factor-analytic study?
Factor Analysis 291
FA was created by Charles Spearman in 1904, related to his work on formulating a theory
of general intelligence (McArdle, 2007, p. 99). Spearman observed that variables from a
carefully specified domain (e.g., intelligence) are often correlated with each other. Since
variables are correlated with one another, they share information about the theory under
investigation. When variables in a domain are correlated, factor analysis is a useful tech-
nique for determining how variables work together in relation to a theory. The primary
goals of FA include (1) exploration and identification of a set of variables in terms of a
smaller number of hypothetical variables called factors, based on patterns of associa-
tion in the data (i.e., EFA; see Cattell, 1971; Mulaik, 1987; Fabrigar & Wegner, 2012);
(2) confirmation that variables fit a particular pattern or cluster to form a certain dimen-
sion according to a theory (i.e., CFA; see McDonald, 1999; Fabrigar & Wegner, 2012;
Brown, 2006); and (3) synthesis of information about the factors and their contribution
as reflected by examinee performance on the observed variables (e.g., scores on tests).
Additionally, when researchers conduct a CFA, they attempt to understand why the
variables are correlated and to determine the degree or level of accuracy the variables and
factors provide relative to a theory. Factor-analytic theory posits that variables (i.e., test
total scores or test items) correlate because they are determined in part from common but
unobserved influences. These common influences are due to common factors, meaning
that variables are correlated to some degree—thus the name common factor model. The
unobserved influences are manifested as a latent factor (or simply a factor) in FA.
Several approaches to FA are possible depending on the goal(s) of the research. The
most common type of FA is the R-type where the focus is on grouping variables (e.g.,
subtests in the GfGc data) into similar clusters that reflect latent constructs. R-Type FA
292 PSYCHOMETRIC METHODS
is used widely in test and scale development, and we use it in this chapter to illustrate
how it works with the GfGc data. Other variations of FA include Q-type (i.e., FA of
persons into clusters with like attributes; Kothari, 2006, p. 336; Thompson, 2000) and
P-type, which focuses on change within a single person or persons captured by repeated
measurements over time (Browne & Zhang, 2007; Molenaar, 2004). The reasons for con-
ducting an R-type factor analysis in the test development process include the following
(Comrey & Lee, 1992, pp. 4–5):
This section illustrates FA using the subtest total scores from the GfGc data. Recall that
four subtests measure crystallized intelligence, three subtests measure fluid intelligence, and
three subtests measure short-term memory. Table 9.1 (introduced in Chapter 1) provides the
details of each of the subtests that comprise the factors or constructs. Figure 9.2 (introduced
in Chapter 1) illustrates the conceptual (i.e., theoretical) factor structure for the GfGc data.
Conducting FA begins with inspection of the correlation matrix of the variables (or
in our example, the subtests) involved. Table 9.2 provides the intercorrelation matrix for
the 10 GfGc subtests used in the examples in this chapter.
Table 9.2 reveals that the correlations within and between the subtests crystallized
intelligence, fluid intelligence, and short-term memory do in fact correlate in a way that
supports conducting a factor analysis. For example, the variations in shading in Table
9.2 show that the clusters of subtests correlate moderately with one another. The excep-
tion to this pattern is in the short-term memory cluster where subtest 10 (inductive and
Factor Analysis 293
confirmatory
exploratory
Structural equation
modeling Select type of factor analysis
Orthogonal Oblique
Varimax Oblimin
Select the rotational method
Equimax Promax
Quartimax Orthoblique
Figure 9.1. Guidelines for conducting FA. Adapted from Hair, Anderson, Tatham, and Black (1998,
pp. 94, 101). Copyright 1998. Reprinted by permission of Pearson Education, Inc. New York, New York.
294 PSYCHOMETRIC METHODS
deductive reasoning) does not correlate at even a moderate level with graphic orientation
and graphic identification. Additionally, inspection of the unshaded cells in Table 9.2
reveals that the subtests in the theoretical clusters also correlate moderately (with the
exception of subtest 10 on inductive and deductive reasoning) with subtests that are not
part of their theoretical cluster.
At the heart of FA is the relationship between a correlation matrix and a set of factor
loadings. The intercorrelations among the variables and the factors share an intimate
relationship. Although factor(s) are unobservable variables, it is possible to calculate the
correlation between factors and variables (e.g., subtests in our GfGc example). The cor-
relation between factors and the GfGc subtests are called factor loadings. For example,
consider questions 1–4 originally given in Section 9.1.
1. What role does the pattern of intercorrelations among the variables or subtests
play in identifying the number of factors?
2. What are the general steps in conducting a factor-analytic study?
3. How are factors estimated?
4. How are factor loadings interpreted?
Through these questions, we seek to know (1) how the pattern of correlations among
the variables inform what the factor loadings are, (2) how the loadings are estimated; and
Factor Analysis 295
Figure 9.2. General theory of intelligence. The smallest rectangles on the far right represent
items. The next larger rectangles represent subtests that are composed of the sum of the individual
items representing the content of the test. The ovals represent factors also known as latent or unob-
servable constructs posited by intelligence theory.
296 PSYCHOMETRIC METHODS
1 2 3 4 5 6 7 8 9 10
1. Short-term memory: based on 1 — — — — — — — — —
visual cues
2. S hort-term memory: auditory .517** 1 — — — — — — — —
and visual components
3. S
hort-term memory: math .540** .626** 1 — — — — — — —
reasoning
4. Gc: measure of vocabulary .558** .363** .406** 1 — — — — — —
(3) how to properly interpret the loadings relative to a theory or other context (e.g., an
external criterion as discussed in Chapter 3 on validity). To answer these questions, we
can examine the relationship between the correlation matrix and factor loadings. Recall
that a factor is an unobserved or a latent variable. A relevant question is, “How is a fac-
tor loading estimated since a factor is unobserved?” An answer to this question is found
in part by using the information given in Chapter 7 on reliability. For example, in fac-
tor analysis, factors or latent variables are idealized as true scores just as true score was
defined in Chapter 7 on reliability. Recall that in Chapter 7 we were able to estimate the
correlation between an unobservable true score and an observed score based on the axi-
oms of the classical test theory (CTT) model. Also recall that the total variance for a set
of test scores can be partitioned into observed, true, and error components. Later in this
chapter the common factor model is introduced, and parallels are drawn with the clas-
sical true score model. At this point, it is only important to know that we can estimate
factors and their loadings using techniques similar to those presented in Chapter 7.
Continuing with our example using the correlation matrix and how factor loadings
are estimated, we use the seven subtests representing the two factors, crystallized and
fluid intelligence. The correlation matrix for the seven crystallized and fluid intelligence
subtests is presented in Table 9.3.
Related to questions 1–4, we want to know (1) how the subtests relate to the hypo-
thetical factors of crystallized and fluid intelligence (i.e., the size of the factor loadings)
Factor Analysis 297
and (2) if a correlation between the factors exists. To illustrate how answers to these ques-
tions are obtained, a table of initial and alternate factor loadings is given for the seven sub-
tests measuring crystallized and fluid intelligence. Table 9.3 reveals that subtest 7, the Gf
measure of inductive and deductive reasoning, is problematic based on its correlation with
the graphic identification subtest (.19) and graphic orientation (.21) under fluid intelli-
gence and the four tests measuring crystallized intelligence (e.g., all correlations are <.10).
In practical terms, retaining this subtest in the GfGc theoretical model should be revisited
within the context of the proposed use of the test for a population of examinees (e.g., the
validity of test score use is another issue to consider here). Based on the initial informa-
tion, subtest 7 contributes little to the GfGc theoretical model. The loadings in Table 9.4
are produced by a process known as factor extraction. Several techniques are available to
extract the factors (e.g., see Fabrigar & Wegner, 2012, p. 40). In the current example, the
factor loadings were extracted using the principal axis factor (PAF) extraction technique
(e.g., see the shaded line in the SPSS syntax below that highlights EXTRACTION PAF).
Extraction techniques are reviewed in more detail later in this chapter.
FACTOR
/VARIABLES cri1_tot cri2_tot cri3_tot cri4_tot fi1_tot fi2_tot
fi3_tot
/MISSING LISTWISE
/ANALYSIS cri1_tot cri2_tot cri3_tot cri4_tot fi1_tot fi2_tot
fi3_tot
/PRINT UNIVARIATE INITIAL CORRELATION DET KMO AIC EXTRACTION
/FORMAT SORT
/PLOT EIGEN
/CRITERIA FACTORS(2) ITERATE(25)
/EXTRACTION PAF
298 PSYCHOMETRIC METHODS
/CRITERIA ITERATE(25)
/METHOD=CORRELATION.
Note. The first shaded line denotes the syntax required to produce the Measure of Sampling Ade-
quacy (MSA) and Bartlett’s tests. The second shaded area denotes the syntax required to yield load-
ings by the principal axis factor extraction technique.
The correlations in Table 9.4 are called factor loadings and are defined as the
correlation between factors and subtests. Factors are based on structures or patterns
produced by the covariation of the GfGc tests. For example, notice the pattern of cor-
relations within each factor for each subtest. We see that for the initial factor solution
the crystallized intelligence subtests correlate highly with factor 1 (i.e., all > .60). Con-
versely, the majority of the fluid intelligence subtests (5 out of 7) display a low cor-
relation with factor 1 (i.e., .30 or lower). Two of the fluid intelligence subtests display
a high correlation with factor 1 (i.e., .64). However, as before, we see that subtest 7 is
problematic (i.e., a loading < .30). Taken together, the pattern of results in the left side
of Table 9.4 illustrate that there appears to be a single dominant factor represented by
six of the seven subtests. Additionally, we see that the graphic orientation and graphic
Table 9.4. Initial Factor Loadings for Crystallized and Fluid Intelligence
the previous example (i.e., crystallized intelligence subtest word knowledge and the fluid
intelligence subtest of graphic orientation) in Equation 9.2.
Inserting the initial factor loadings into Equation 9.2, we see nearly the same result
(.38; the difference is due to rounding error) as before in Equation 9.1b. These results illus-
trate an important point in factor analysis: that there are an infinite number of sets of factor
loadings that satisfy Equation 9.1a. This infinite number is called factor indeterminacy.
Table 9.4 presents alternate loadings created to illustrate the point that there is
always more than one factor solution that satisfies Equation 9.1a. The alternate loadings
(i.e., the right-hand side of Table 9.4) were derived using Equation 9.3 (Paxton, Curran,
Bollen, Kirby, & Chen, 2001; Crocker & Algina, 1986, p. 291).
Applying Equation 9.3 to create alternate factor loadings in Table 9.4 reveals two
points. First, there appears to be a general factor underlying the seven subtests. This pattern
ρ15 = 12
′ 15
′ + 22
′ 52
′
of loadings supports at least two components of the general theory of intelligence. Second,
in Equation 9.3, F2 represents the difference between factor loadings on factors 1 and 2 (i.e.,
notice the sign of operations in each equation). The difference between the factor loadings
is tantamount to the idea that the two factors are tapping different parts of general intelli-
gence. The idea of two factors aligns with our example of crystallized and fluid intelligence.
Finally, Equation 9.1a can be modified to be applicable to any number of subtests as
expressed in Equation 9.4.
Recall that an infinite number of sets of factor loadings satisfy Equations 9.1a (i.e., for
two factors) and 9.3 (i.e., for more than two factors). The fact that (1) multiple sets of
factor loadings are possible and (2) most factor extraction methods yield initial load-
ings that are not easily interpreted provides results that are unclear or that lack ease of
interpretation. Fabrigar and Wegner (2012) and Kerlinger and Lee (2000) argue that it
is necessary to rotate factor matrices if they are to be adequately interpreted. Rotation is
helpful because original factor matrices are arbitrary inasmuch as any number of refer-
ence frames (i.e., factor axes) can be derived that reproduce any particular correlation
matrix. Factor rotation is the process of transforming the initial loadings using a set of
equations (such as in Equation 9.2) to achieve simple structure. The idea underlying
simple structure is to work to identify as pure a set of variables as possible (e.g., each
variable or subtest loads on as few factors as possible and as many zeros as possible in the
rotated factor matrix; see Kerlinger & Lee, 2000).
The guidelines for simple structure (based on Fabrigar & Wegner, 2012, p. 70, and
Kerlinger & Lee, 2000, p. 842) include the following:
1. Each row of the factor matrix should have at least one loading close to zero.
2. For each column of the factor matrix, there should be at least as many variables
with zero or near-zero loadings as there are factors.
302 PSYCHOMETRIC METHODS
3. For every pair of factors (columns), there should be several variables with load-
ings in one factor (column) but not in the other.
4. When there are four or more factors, a large proportion of the variables should
have negligible (close to zero) loadings on any pair of variables.
5. For every pair of factors (columns) of the factor matrix, there should be only a
small number of variables with appreciable (nonzero) loadings in both columns.
For any rotation technique used, the original factor loadings are related by a math-
ematical transformation. Factor rotation is accomplished geometrically as illustrated
graphically in Figures 9.3 through 9.5. Importantly, when any two sets of factor load-
ings are obtained through the rotation process, the two sets contain loadings that reflect
the correlations among the subtests equally well. Although factor rotation techniques
produce loadings that represent the correlations among subtests equally well, the mag-
nitude or size of the factor loadings varies, and a different set of factors represent each set
of factor loadings. This final point means that interpretations of the factors differ based on
the rotational technique applied. There are two classes of rotational techniques; orthogo-
nal and oblique (Brown, 2006, pp. 30–32; Lattin et al., 2003). Applying the orthogonal
.7
Factor 2 plotted on the Y-axis
.6
Gf graphic
.5 orientation ●
● Gf graphic
.4 identification
.3 Gf inductive and
● deductive
.2 reasoning
Factor 1 plotted on the X-axis
.1
Figure 9.3. Unrotated factor loadings for crystallized and fluid intelligence.
Factor Analysis 303
technique yields transformed factors that are uncorrelated (i.e., factors are oriented at 90°
angles in multidimensional space; see Figure 9.4). Applying the oblique technique yields
transformed factors that are correlated (i.e., permit factor orientations > 90°). Figures 9.4
and 9.5 illustrate orthogonal and oblique rotations for the crystallized and fluid intel-
ligence subtests.
Table 9.5 provides a comparison of the initial factor loadings and the obliquely rotated
loadings. From this table we see that the two loading solutions reveal that two interpreta-
tions are plausible. First, in the unrotated or initial solution, we see that six out of seven
subtests exhibit high and positive loadings, suggesting a single dominant factor for the seven
subtests. Similarly, for factor 2 we see high and positive loadings for two out of three subtests
Y (F2)
F2
.8
Factor 2 plotted on the Y-axis
.7
.6 Gf graphic orientation
●
●
.5 Gf graphic identification
.4
.1
X (F1)
−.9 −.8 −.7 −.6 −.5 −.4 −.3 −.2 −.1
.1 ● Gc abstract reasoning
−.1 90˚
.2
Rotated factor matrixa
Factor
.3
−.2 Gc word knowledge
1 2 .4 ●
Gc meas of vocabulary .883 .122 .5 ● Gc conceptual reasoning
Gc meas of conceptual
.831 .139 −.3 .6
reasoning
.7 ● Gc vocabulary
Gc meas of knowledge .799 .154
Gc meas of abstract .8
reasoning
.763 .376 −.4 .9 F1
Gf meas of graphic
.339 .733
orientation
Gf meas of graphic
.345 .731 −.5
identification
Gf meas of inductive
−0.17 .288
and deductive reas
−.6
Extraction method: principal axis factoring.
Rotation method: Varimax with Kaiser
normalization.
a. Rotation converged in 3 iterations. −.7
−.8
Figure 9.4. Orthogonally rotated factor loadings for crystallized and fluid intelligence. Rotated
scale metric or perspective is only an approximation. In orthogonal rotation, the angle is constrained
to 90 degrees, meaning that the factor axes are uncorrelated in multidimensional space.
304 PSYCHOMETRIC METHODS
Y(F2)
F2' – reference axis
F2
Factor 2 plotted on the Y-axis
Gf graphic orientation
●
● Gf graphic identification
X(F1)
-.9 -.8 -.7 -.6 -.5 -.4 -.3 -.2 -.1
-.1
● Gc abstract reasoning
a
Pattern matrix
Factor
-.2
1 2 Gc word knowledge ● F1
Gc meas of vocabulary .957 –.119 ● Gc conceptual reasoning
Gc meas of conceptual -.3
reasoning
.888 –.083
● Gc vocabulary
Gc meas of knowledge .845 –.056
Gc meas of abstract
reasoning .702 .221 -.4
Gf meas of graphic
orientation .049 .778
-.8
F1' – reference axis
Figure 9.5. Factor loadings for crystallized and fluid intelligence after oblique rotation.
In oblique rotation, the angle is less than 90 degrees, meaning that the factor axes are correlated
in multidimensional space.
(e.g., recall that subtest 7 has been consistently identified as problematic in our examples).
Negative loadings are interpreted as differences in abilities as measured by crystallized and fluid
intelligence. We interpret theses loadings as differences in ability based on the fact that in the
original correlation matrix (Table 9.2) we see that the seven subtests all positively correlate.
Alternatively, inspection of the obliquely rotated factor loadings on the right side of Table 9.5
reveals a much clearer picture of how the seven subtests reflect the two factors.
The obliquely rotated factor loadings provide the clearest picture of the factor struc-
ture for the seven subtests. However, interpreting factor loadings from an oblique solu-
tion is slightly more complicated than interpreting loadings from an orthogonal rotation.
For example, the factor loadings obtained from an oblique rotation do not represent
simple correlations between a factor and an item or subtest (as is the case of loadings
in an orthogonal rotation) unless there is no overlap among the factors (i.e., the fac-
tors are uncorrelated). Specifically, because the factors correlate, the correlations between
Factor Analysis 305
Table 9.5. Unrotated and Obliquely Rotated Factor Loadings for Crystallized
and Fluid Intelligence Subtests
Obliquely
Initial (unrotated) rotated factor
factor loadings loadings
Construct Subtest 1 2 1 2
Crystallized 1. Gc measure of vocabulary .84 –.30 .96 –.12
intelligence
2. Gc measure of knowledge .78 –.23 .85 –.06
3. Gc measure of abstract reasoning .85 –.02 .70 .22
4. Gc measure of conceptual reasoning .80 –.26 .89 –.08
Fluid 5. Gf measure of graphic orientation .64 .49 .05 .78
intelligence
6. Gf measure of graphic identification .64 .48 .06 .77
7. Gf measure of inductive and deductive reasoning –.12 .26 –.15 .35
indicators (variables or tests) and factors may be inflated (e.g., a subtest may correlate
with one factor in part through its correlation with another factor). When interpreting
loadings from an oblique rotation, the contribution of a subtest to a factor is assessed
using the pattern matrix. The factor loadings in the pattern matrix represent the unique
relationship between a subtest and a factor while controlling for the influence of all the
other subtests. This unique contribution is synonymous with interpreting partial regres-
sion coefficients in multiple linear regression analysis (see the Appendix for a thorough
presentation of correlation and multiple regression techniques). One final point related to
regression and factor analysis is that the regression weights representing factor loadings
in an oblique solution are standardized regression weights.
In Table 9.6, we see that the orthogonally rotated factor loadings provide a clearer
picture of the factor structure than did the initial loadings, but not as clear as those
obtained from the oblique rotation. In an orthogonal rotation, the factors are constrained to
be uncorrelated (i.e., 90° in multidimensional space; see Figure 9.4). From a geometric
perspective, because the cosine (90°) of an angle is equal to zero, this amounts to saying
that the factors have no relationship to one another. One perceived advantage of using
306 PSYCHOMETRIC METHODS
Table 9.6. Unrotated and Orthogonally Rotated Factor Loadings for Crystallized
and Fluid Intelligence Subtests
Orthogonally
Unrotated factor rotated factor
loadings loadings
Construct Subtest 1 2 1 2
Crystallized 1. Gc measure of vocabulary .84 –.30 .88 .12
intelligence
2. Gc measure of knowledge .78 –.23 .79 .15
3. Gc measure of abstract reasoning .85 –.02 .76 .37
4. Gc measure of conceptual reasoning .80 –.26 .83 .14
Fluid 5. Gf measure of graphic orientation .64 .49 .33 .73
intelligence
6. Gf measure of graphic identification .64 .48 .34 .73
7. Gf measure of inductive and deductive reasoning –.12 .26 –.02 .28
orthogonal rotations is that the loadings between a factor and a subtest are interpreted as
a correlation coefficient, making interpretation straightforward. However, in an oblique
rotation, the pattern matrix provides loadings that are interpreted as standardized partial
regression coefficients (e.g., as are regression coefficients in multiple regression analy-
ses; for a review see Chapter 3 or the Appendix). Thus, the increase in simple structure
obtained by using an oblique rotation in conjunction with the availability and proper
interpretation of the pattern matrix is usually the best way to proceed—unless the factors
are uncorrelated by design (e.g., subtests comprising each factor are correlated with one
another but factors are uncorrelated within each other). A variety of rotation techniques
have been developed; Table 9.7 provides an overview of the techniques. The most com-
monly used orthogonal technique is varimax, and the most used oblique techniques are
promax and oblimin.
Recall that applying an orthogonal rotation results in factors being uncorrelated and that
applying an oblique rotation results in factors being correlated. In this section we examine
Factor Analysis 307
the role correlated factors play in the mechanics of factor analysis. To begin, we return
to Table 9.5, which provides the factor loadings for the oblique solution. Notice that the
correlation between factors 1 and 2 is .59 (see the note at the bottom of Table 9.5).
Recall that in an oblique rotation, the factor loadings do not represent correlations
between subtests and factors; rather, the loadings are standardized regression weights
(i.e., a unique relationship between a subtest and a factor while controlling for the influ-
ence of all the other subtests is based on partial regression coefficients). To illustrate how
the correlation between factors relates to the relationship between two subtests, consider
the crystallized intelligence subtest of abstract reasoning. In Table 9.5 we see that the
loading on factor 1 for abstract reasoning is .70 and .22 on factor 2. Because oblique
rotations allow factors to be correlated, after accounting for the statistical relationship
between factor 1 and abstract reasoning (i.e., partialing out factor 1), abstract reason-
ing and factor 2 are related by a standardized regression weight of .22. A loading of .22
in this context is a partial (standardized) regression weight—not simply the correlation
308 PSYCHOMETRIC METHODS
between a factor and a subtest. Also, a factor loading of .70 for the abstract reasoning
subtest on factor 1 indicates a strong relationship for this subtest on crystallized intel-
ligence (unique factor) after controlling for factor 2 (the graphic identification compo-
nent of fluid intelligence). Modifying Equation 9.1a, we can account for the correlation
between factors and the relationship between any two subtests by Equation 9.5a (Crocker
& Algina, 1986, p. 293).
For example, we know from the results of the factor analysis with an oblique rota-
tion that the correlation between factors 1 and 2 is .59 (see the note in Table 9.5). To illus-
trate Equation 9.5a, we use the crystallized intelligence subtest abstract reasoning and the
fluid intelligence subtest graphic identification. Applying the factor loading values for
these subtests (from Table 9.5) into Equation 9.5a, we have Equation 9.5b.
Returning to Table 9.2, which presented the original correlation matrix for all seven
subtests, we can verify that Equation 9.5b holds by noting that the correlation between
the crystallized intelligence subtest abstract reasoning and fluid intelligence subtest
graphic identification is in fact .54.
r = 1 1 + 2 2 + 1 2f + 2 1f
= .54
Factor Analysis 309
Earlier in this chapter, Equations 9.1a and 9.2 illustrated how correlations between sub-
tests in the GfGc data are related to factor loadings. Equation 9.6 presents the factor
analysis model, a general equation that subsumes and extends Equations 9.1a and 9.2 in
a way that allows for the estimation of common factors and unique factors. A common
factor is one with which two or more subtests are correlated. These subtests are also cor-
related with one another to some degree. Conversely, a unique factor is correlated with
only one subtest (i.e., its association is exclusive to a single subtest). The common factor
model assumes that unique factors are uncorrelated (1) with each common factor and
(2) with unique factors for different tests. Thus, unique factors account for no correlation
between subtests in a factor analysis.
Related to the FA model are communality and uniqueness. One of the primary goals
of an FA is to determine the amount of variance that a subtest accounts for in relation to
a common factor. The communality of a subtest reflects the portion of the subtest’s vari-
ance that is associated with the common factor. For the case where the factors are uncor-
related, Equation 9.7 is applicable to estimation of the communality.
For example, consider the vocabulary subtest of crystallized intelligence. Using the
orthogonally (uncorrelated) derived factor loadings provided in Table 9.6, we find that
zi = aikFk + Ei
• zi = z-score on test i.
• aik = loadings of test i on factor k.
• Fk = scores on common factor k.
• Ei = scores on the factor unique to test i.
the loading for the vocabulary subtest is .88 on factor 1 and .12 on factor 2. Next, insert-
ing these loadings into Equation 9.8 results in a communality of .78 as illustrated in
Equation 9.7.
Raw scores (scores in their original units of measurement) are converted to z-scores
in factor analysis. As a result of this transformation, the variance of the scores equals 1.0.
The communality represents the portion of a particular subtest variance that is related
to the factor variance. The communality estimate is a number between 0 and 1 since any
distribution of z-scores has a mean of 0 and a standard deviation (and variance) of 1.0.
When two factors are correlated as illustrated in Table 9.8, Equation 9.8 is modified
as in Equation 9.9.
Applying Equation 9.9 to the obliquely estimated factor loadings in Table 9.8, we
have Equation 9.10.
Based on the results of Equation 9.10, we see that when factors 1 and 2 are corre-
lated, the communality of the vocabulary subtest is substantially lower than is observed
when the factors are constrained to be uncorrelated (i.e., for the orthogonal solution).
The unique part of the variance for any particular subtest is expressed in Equation
9.11.
Using Equation 9.11 and inserting the results of Equation 9.8 for the orthogonal
solution we have Equation 9.12a and for the oblique solution, Equation 9.12b.
UI2 = 1 − HI2
UI2 = 1 − HI2
= 1 − .78
= .22
UI2 = 1 − HI2
= 1 − .54
= .46
312 PSYCHOMETRIC METHODS
1 = HI2 = S 2I + E 2I
Continuing with our illustration using the vocabulary subtest, we find that the
2
unique variance can be partitioned into two components—specific variance SI and error
variance . The specific variance is the part of the vocabulary subtest true score variance
that is not related to the true score variance on any other subtest. At this point, you may
notice a connection to classical test theory by the mention of true score. In fact, the com-
mon factor model provides another framework for measurement theory as related to true,
observed, and error scores (e.g., see McDonald, 1999, for an excellent treatment). Based
on the definition of unique variance being the sum of specific variance and error variance,
we can partition the communality into two additive parts, as in Equation 9.13.
Earlier in the chapter we raised a question regarding how factors are estimated since
they are unobservable variables. Recall that in Chapter 7 the topic of reliability was intro-
duced within the framework of classical test theory. In the common factor analysis model,
the reliability of the test scores can be conceived as the sum of the communality and the
specific variance for a subtest as presented in Equation 9.13. Figure 9.6 illustrates how
the variance in a set of test scores is partitioned in CTT (top half) and FA (bottom half).
The following conceptual connections can be made between the equations in Figure
9.6. In Equation 9.6, Fk is synonymous with Ti, zik represents observed scores on test k,
Xi, and Eik are synonymous with Ei in the true score model. Using the assumptions of the
true score model and extending these ideas to factor analysis, we can estimate common
factors that are unobservable.
XO = XT + XE
Variance from Observed Scores (VO)
Variance from
Variance from True Scores (VT) Error Scores (VE)
80% 20%
VT = VCO + VSP + VE
Commonality (h2)
Figure 9.6. Variance partition in classical test theory and factor analysis. V denotes the vari-
ance; CO signifies common factor.
314 PSYCHOMETRIC METHODS
> 75%) in the original variables (subtests). An important difference between PCA and com-
mon factor analysis is that in PCA, the correlation matrix is used during estimation of the
loadings (i.e., the main diagonal in the matrix contains “1’s”). Therefore, in PCA the vari-
ances of the measured variables are assumed to be zero. To this end, PCA is a variable reduc-
tion technique that provides an explanation of the contribution of each component to the
total set of variables. The fundamental unit of analysis in PCA is the correlation matrix. In a
PCA, all of the values along the diagonal in the correlation matrix to be analyzed are set to
unity (i.e., a value of 1). The intercorrelations for the 10 GfGc subtests are revisited in Table
9.8 (notice that all of the values along the diagonal or darkest shaded cells are set to 1.0).
Because all values along the diagonal of the correlation matrix are 1.0, all of the vari-
ance between the observed variables is considered shared or common. The components result-
ing from a PCA are related to the variables by way of the factor–component relationship.
The first component derived from a PCA is a linear combination of subtests that represents
the maximum amount of variance; the variance of this first component is equal to the
largest eigenvalue of the sample covariance matrix (Fabrigar & Wegner, 2012, pp. 30–35;
McDonald, 1999). The second principal component is a second linear combination of
the 10 subtests that is uncorrelated with the first principal component. For example, con-
sider the four subtests comprising crystallized intelligence in the GfGc dataset. To obtain
7. Gc: measure of conceptual reasoning .548** .319** .365** .749** .694** .677** 1 — — —
8. Gf: measure of graphic orientation .420** .407** .545** .391** .394** .528** .377** 1 — —
9. Gf: measure of graphic identification .544** .480** .588** .392** .374** .544** .397** .654** 1 —
10. G
f: measure of inductive/deductive .073* .121** .156** 0.01 0.04 .096** 0.03 .210** .199** 1
reasoning
Note. N = 1,000. **Correlation is significant at the 0.01 level (2-tailed).
*Correlation is significant at the 0.05 level (2-tailed). Shaded cells highlight the intercorrelations among the
subtests comprising each of the three areas of intelligence.
Factor Analysis 315
examinee-level scores on a particular principal component, observed scores for the 1,000
examinees for the 10 intelligence subtests are optimally weighted to produce an optimal
linear combination. The weighted subtests are then summed yielding a principal compo-
nent. An eigenvalue (also called a latent or characteristic root) is a value resulting from the
consolidation of the variance in a matrix. Specifically, an eigenvalue is defined as the column
sum of squared loadings for a factor. An eigenvector is an optimally weighted linear combi-
nation of variables used to derive an eigenvalue. The coefficients applied to variables to form
linear combinations of variables in all multivariate statistical techniques originate from eigen-
vectors. The variance that the solution accounts for (e.g., the variance of the first principal
component) is directly associated with or represented by an eigenvalue.
Using the intercorrelations in Table 9.1, we see that FA is used (1) to identify underlying
patterns of relationships for the 10 subtests and (2) to determine whether the information
can be represented by a factor or factors that are smaller in number than the total number
of observed variables (i.e., the 10 subtests in our example). A technique related to but
not the same as factor analysis, principal components analysis (Hotelling, 1933, 1936;
Tabachnick & Fidell, 2007), is used to reduce a large number of variables into a smaller
number of components. In PCA, the primary goal is to explain the variance in observed
measures in terms of a few (as few as possible) linear combinations of the original vari-
ables (Raykov & Marcoulides, 2011, p. 42). The resulting linear combinations in PCA
are identified as principal components. Each principal component does not necessarily
reflect an underlying factor because the goal of PCA is strictly variable reduction based
on all of the variance among the observed variables. PCA is a mathematical maximization
technique with mainly deterministic (descriptive) goals. Strictly speaking, PCA is not a
type of FA because its use involves different scientific objectives than factor analysis. In
fact, PCA is often incorrectly used as a factor-analytic procedure (Cudeck, 2000, p. 274).
For example, principal components analysis is not designed to account for the correla-
tions among observed variables, but instead is constructed to maximally summarize the
information among variables in a dataset (Cudeck, 2000, p. 275).
Alternatively, consider the goal of FA. One type of FA takes a confirmatory approach
where researchers posit a model based on a theory and use the responses (scores) of
examinees on tests based on a sample to estimate the factor model. The scores used in an
FA are evaluated for their efficacy related to the theory that is supposed to have gener-
ated or caused the responses. This type is confirmatory in nature (i.e., confirmatory factor
analysis). Recall that in CFA, researchers posit an underlying causal structure where one
or more factors exist rather than simply reducing a large number of variables (e.g., tests
or test items) to a smaller number of dimensions (factors).
Researchers also use FA in an exploratory mode (i.e., EFA). For example, suppose that
the theory of general intelligence was not well grounded empirically or theoretically. Using
316 PSYCHOMETRIC METHODS
the 10 subtests in the GfGc data, you might conduct an EFA requesting that three factors
be extracted and evaluate the results for congruence with the theory of general intelligence.
From a statistical perspective, the main difference between FA and PCA resides in
the way the variance is analyzed. For example, in PCA the total variance in the set of
variables is analyzed as compared to factor analysis where common and specific variance
is partitioned during the analysis. Figure 9.7 illustrates the way the variance is partitioned
in PCA versus FA. For an accessible explanation of how a correlation matrix containing a
set of observed variables is used in PCA versus common factor analysis, see Fabrigar and
Wegner (2012, pp. 40–84). Notice in the top portion of Figure 9.7 that there is no provi-
sion for the shared or overlapping variance or among the variables. For this reason, PCA is
considered a variance-maximizing technique and uses the correlation matrix in the analy-
sis (e.g., 1’s on the diagonal and correlation coefficients on the off-diagonal of the matrix).
Conversely, in FA the shared variance is analyzed (see the lower half of Figure
9.7), thereby accounting for the shared relationship among variables. Capturing the
relationship(s) among variables while simultaneously accounting for error variance (ran-
dom and specific) relative to a theoretical factor structure is a primary goal of factor analysis
(and particularly true for confirmatory factor analysis). FA therefore uses a reduced cor-
relation matrix consisting of communalities along the diagonal (see Equation 9.11) of the
matrix and correlation coefficients on the off-diagonal of the matrix. The communalities
are the squared multiple correlations and represent the proportion of variance in that variable
that is accounted for by the remaining test in the battery (Fabrigar & Wegner, 2012).
In FA, the common variance is modeled as the covariation (see Chapter 2 for how the
covariance is derived) among the subtests (i.e., the covariance is based on the deviation
of each examinee score from the mean of the scores for a particular variable). In this way,
common or shared variance is accounted for among the variables by working in deviation
scores, thereby keeping the variables in their original units of measurement. Additionally,
the variance (i.e., the standard deviation squared) is included along the diagonal of the
matrix (see Table 9.9).
Factor analysis
Variance extracted
Variance lost
Figure 9.7. Variance used in principal components analysis and factor analysis. Adapted from
Hair, Anderson, Tatham, and Black (1998, p. 102). Copyright 1998. Reprinted by permission of
Pearson Education, Inc. New York, New York.
Factor Analysis 317
Although statistical programs such as SPSS and SAS internally derive the variance–
covariance matrix within FA routines when requested, the following program creates a
correlation matrix for the set of 10 GfGc subtests used in our example and then transforms
the matrix into a variance–covariance matrix. You may find the program useful for calcu-
lating a variance–covariance matrix that can subsequently be used in a secondary analysis.
Next, we turn to an illustration of how PCA and FA produce different results using the
10 subtests in the GfGc data. First, the 10 subtests in the GfGc dataset are used to conduct
a PCA. The correlation matrix derived from the subtests in Table 9.1 is used to conduct the
PCA (refer to Figure 9.7 to recall how the variance is used in PCA). A partial output is dis-
playing the eigenvalue solution and principal components in Table 9.10 for the PCA. In Table
9.10, 10 eigenvalues are required to account for 100% of the variance in the 10 subtests. An
eigenvalue (reviewed in greater detail shortly) is a measure of variance accounted for by a
given dimension (i.e., factor). If an eigenvalue is greater than 1.0, the component is deemed
significant or practically important in terms of the variance it explains (Fabrigar & Wegner,
2012, p. 53). However, the eigenvalue greater than one rule has several weaknesses, and alter-
native approaches should also be used (e.g., see Fabrigar & Wegner, 2012, pp. 53–64). Spe-
cifically, parallel analysis, likelihood ratio tests of model fit, and minimum average partial
correlation techniques all offer improvements over the eigenvalue greater than one rule.
318 PSYCHOMETRIC METHODS
Returning to our interpretation of the PCA and inspecting Table 9.10, we see that
only two of the eigenvalues meet the criteria of 1.0 criterion for retaining or classifying
a component as significant. In Table 9.10, the first principal component consists of an
eigenvalue of 5.1 and accounts for or explains 51% of the variance in the 10 subtests.
The second principal component consists of an eigenvalue of 1.34 and accounts for or
explains an additional 14% of the variance in the 10 subtests; these principal components
are uncorrelated, so they can be summed to derive a total cumulative variance. Together,
components one and two account for 65% of the cumulative variance in the 10 subtests.
Following is the SPSS program that produced Table 9.10.
FACTOR
/VARIABLES stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_
tot cri4_tot fi1_tot fi2_tot fi3_tot
/MISSING LISTWISE
/ANALYSIS stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_
tot cri4_tot fi1_tot fi2_tot fi3_tot
/PRINT INITIAL EXTRACTION
/CRITERIA MINEIGEN(1) ITERATE(25)
/EXTRACTION PC
/ROTATION NOROTATE
/METHOD=CORRELATION.
Factor Analysis 319
To this point in the chapter, FA has been presented in an exploratory and descriptive
manner. Our goal has been to infer factor structure from patterns of correlations in
the GfGc data. For example, using the crystallized and fluid intelligence subtests, we
reviewed how FA works and how it is used to identify the factor(s) underlying the
crystallized and fluid intelligence subtests. To accomplish this review, we allowed every
subtest to load on every factor in the model and then used rotation to aid in interpret-
ing the factor solution. Ideally, the solution is one that approximates simple structure.
However, the choice of a final or best model could only be justified according to sub-
jective criteria. CFA makes possible evaluation of the overall fit of the factor model,
along with the ability to statistically test the adequacy of model fit to the empirical data.
In CFA, we begin with a strong a priori idea about the structure of the factor model.
CFA provides a statistical framework for testing a prespecified theory in a manner that
requires stronger statistical assumptions than the techniques presented thus far. For an
applied example of CFA with a cross validation using memory test data, see Price et al.
(2002).
Observed variables – X
Measurement errors-E
Factor loadings - λ
λ1
Lexical knowledge – X2
Factor-F λ2
Crystallized Intelligence – F
λ3
Listening ability – X3
λ4
Communication ability – X4
X = λF + E
Factors are represented by ovals in a path diagram. Relationships in an SEM are repre-
sented by lines—either straight (signifying a direct relationship) or curved (representing
a covariance or correlation). Furthermore, the lines may have one or two arrows. For
example, a line with a single arrow represents a hypothesized direct relationship between
two variables; the variable with the arrow pointing to it is the dependent variable. A line
that includes arrows at both ends represents a covariance or correlation between two
variables with no implied direct effect.
In a latent variable SEM, two parts comprise the full SEM; a measurement model
and a structural model. In our example using crystallized and fluid intelligence, the
measurement model relates the subtest scores to the factor. For example, the measure-
ment model for crystallized intelligence is provided in Figure 9.8.
Figure 9.9 illustrates the common factor model introduced earlier.
Figure 9.10 illustrates an orthogonal common factor model based on the examples
in this chapter. Figure 9.11 illustrates an oblique or correlated factors model based on
crystallized and fluid intelligence.
Factor Analysis 321
Unique factors or
Observed variables measurement errors
common factors
Lexical–knowledge Error 2
Measurement errors
Crystallized intelligence
Induction/deduction Error 7
Figure 9.9. Common factor model represented as a path diagram. Exploratory common fac-
tor model is one where each factor is allowed to load on all tests. The dashed arrows are not
hypothesized to “cross-load,” but in an exploratory analysis, this is part of the analysis. Also, a
common factor is a factor that influences more than one observed variable. For example, language
development, lexical knowledge, listening ability, and communication ability are all influenced
by the crystallized intelligence factor. The common factor analysis model above is orthogonal
because the factors are not correlated (e.g., a double-headed arrow connecting the two factors is
not present).
SEM provides a thorough and rigorous framework for conducting factor analysis of
all types. However, conducting FA using an SEM approach requires a thorough under-
standing of covariance structure modeling/analysis in order to correctly use the tech-
nique. Additionally, interpretation of the results of a confirmatory (or exploratory) factor
analysis using SEM involves familiarity with model fit and testing strategies. Readers
interested in using SEM for applied factor analysis work or in factor-analytic research
studies are encouraged to see Schumacker and Lomax (2010), and Brown (2006).
322 PSYCHOMETRIC METHODS
Unique factors or
Observed variables measurement errors
Common factors
Induction/deduction Error 7
Given that the integrity of a FA hinges on the design of the study and the actual use of the
technique, there are many possible ways for researchers to commit errors. Comrey and
Lee (1992, pp. 226–228) and Fabrigar and Wegner (2012, pp. 143–151) offer the follow-
ing suggestions regarding errors to avoid when conducting an FA:
1. Collecting data before planning how the factor analysis will be used.
2. Using data variables with poor distributions and inappropriate regressions forms:
a. Badly skewed distributions, for example, with ability tests that are too easy or
too hard for the subjects tested.
b. Truncated distributions.
Factor Analysis 323
Unique factors or
Observed variables measurement errors
Common factors
Lexical–knowledge error 2
Induction/deduction error 7
Figure 9.11. Oblique factor model represented as a path diagram. A common factor is a fac-
tor that influences more than one observed variable. For example, language development, lexical
knowledge, listening ability, and communication ability are all influenced by the crystallized intel-
ligence factor. The common factor analysis model above is oblique because the factors are corre-
lated (e.g., a double-headed arrow connecting the two factors is present).
c. Bimodal distributions.
d. Distributions with few extreme cases.
e. Extreme splits in dichotomized variables.
f. Nonlinear regressions.
3. Using data variables that are not experimentally independent of one another:
a. Scoring the same item responses on more than one variable.
b. In a forced-choice item, scoring one response alternative on one variable and
the other on a second variable.
c. Having one variable as a linear combination of others, for example, in the
GfGc data used in this book, crystallized intelligence and fluid intelligence
comprise part of the construct of general intelligence, so the total score for
general intelligence should not be factor analyzed as a single variable.
324 PSYCHOMETRIC METHODS
Factor analysis is a technique for reducing multiple themes embedded in tests to a simpler
structure. This technique is used routinely in the psychometric evaluation of tests and
other measurement instruments. It is particularly useful in establishing statistical evi-
dence for the construct validity of scores obtained on tests. An overview of the concepts
and process of conducting an FA was provided as they relate to the conceptual definitions
underlying a set of measured variables. Core questions common to correctly conduct-
ing and interpreting a factor-analytic study were provided. Starting with the correlation
matrix comprising a set of tests, the process of how FA works relative to the common
factor model was introduced by way of applied examples. Exploratory and confirmatory
approaches to FA were described, with explanations of when their use is appropriate.
The distinction was made between principal components analysis and factor analysis—
conceptually and statistically.
Structural equation modeling was introduced as a technique that provides a flexible
and rigorous way to conduct CFA. The chapter concluded by presenting common errors
to avoid when conducting factor analysis.
Common factor model. Factor analytic model where variables are correlated in part due
to common unobserved influence.
Communality. Reflects the portion of the subtest’s variance associated with the common
factor. It is the sum of the squared loadings for a variable across factors.
Confirmatory factor analysis. A technique used to test (confirm) a prespecified relation-
ship (e.g., from theory) or model representing a posited theory about a construct or
multiple constructs; the opposite of exploratory factor analysis.
Eigenvalue. The amount of total variance explained by each factor, with the total amount
of variability in the analysis equal to the number of original variables (e.g., each
variable contributes one unit of variability to the total amount, due to the fact that the
variance has been standardized; Mertler & Vannatta, 2010, p. 234).
Eigenvector. An optimally weighted linear combination of variables used to derive an
eigenvalue.
326 PSYCHOMETRIC METHODS
Exploratory factor analysis. A technique used for identifying the underlying structure
of a set of variables that represent a minimum number of hypothetical factors. EFA
uses the variance–covariance matrix or the correlation matrix where variables (or test
items) are the elements in the matrix.
Factor. An unobserved or a latent variable representing a construct. Also called an inde-
pendent variable in ANOVA terminology.
Factor indeterminacy. The situation in estimating a factor solution where an infinite num-
ber of possible sets of factor loadings are plausible.
Factor loading. The Pearson correlation between each variable (e.g., a test item or total
test score) and the factor.
Factor rotation. The process of adjusting the factor axes after extraction to achieve a
clearer and more meaningful factor solution. Rotation aids in interpreting the factors
produced in a factor analysis.
Latent factor. Manifested as unobserved influences among variables.
In this chapter, an alternative to the classical test theory model is presented. Item response
theory (IRT) is a model-based approach to measurement that uses item response patterns
and ability characteristics of individual persons or examinees. In IRT, a person’s responses
to items on a test are explained or predicted based on his or her ability. The response
patterns for a person on a set of test items and the person’s ability are expressed by a
monotonically increasing function. This chapter introduces Rasch and IRT models and their
assumptions and describes four models used for tests composed of dichotomously scored
items. Throughout the chapter examples are provided, using data based on the general-
ized theory of intelligence.
10.1 Introduction
The classical test theory (CTT) model serves researchers and measurement special-
ists well in many test development situations. However, as with any method, there are
shortcomings that give rise to the need for more sophisticated approaches. Recall from
Chapter 7 that application of the CTT model involves using only the first and second
moments of a distribution of scores (i.e., the mean and variance or covariances) to index
a person’s performance on a test. In CTT, the total score for an examinee is derived by
summing the scores on individual test items. Using only the total score and first and sec-
ond moments of a score distribution (i.e., the mean and standard deviation) is somewhat
limiting because the procedure lacks a rigorous framework by which to test the efficacy
of the scores produced by the final scale. An alternative approach is to have a psychomet-
ric technique that provides a probabilistic framework for estimating how examinees will
perform on a set of items based on their ability and characteristics of the items (e.g., how
difficult an item is). Item response theory (IRT), also known as modern test theory, is a
system of modeling procedures that uses latent characteristics of persons or examinees
329
330 PSYCHOMETRIC METHODS
and test items as predictors of observed responses (Lord, 1980; Hambleton & Swaminathan,
1985; de Ayala, 2009). Similar to other statistical methods, IRT is a model-based theory
of statistical estimation that conveniently places persons and items on the same metric
based on the probability of response outcomes. IRT offers a powerful statistical frame-
work that is particularly useful for experts in disciplines such as cognitive, educational,
or social psychology when the goal is to construct explanatory models of behavior and/
or performance in relation to theory.
This chapter begins by describing the differences between IRT and CTT and provides
historical and philosophical perspectives on the evolution of IRT. The chapter proceeds by
describing the assumptions, application, and interpretation of the Rasch, one-parameter
(1-PL), two-parameter (2-PL), and three-parameter (3-PL) IRT models for dichotomous
test item responses. Throughout the chapter, applied examples are provided using the
generalized theory of intelligence test data introduced in Chapter 2.
IRT is a probabilistic, model-based test theory that originates from the pattern of exam-
inees’ responses to a set of test items. Fundamentally, it differs from CTT because in CTT
total test scores for examinees are based on the sum of the responses to individual items.
For example, each test item within a test can be conceptualized as a “micro” test (e.g.,
an item on one of the subtests on crystallized intelligence used throughout this book)
within the context of the total test score (e.g., the composite test score conceptualized in
a “macro” perspective). The sum score for an examinee in CTT is considered a random
variable. One shortcoming of the CTT approach is that the statistics used in evaluating
the performance of persons are sample dependent (i.e., they are deterministic compared
to probabilistic). The impact of a particular sample on item statistics and total test score
can be restrictive during the process of test development. For example, when a sample
of persons or examinees comes from a high-ability level on a particular trait (e.g., intelli-
gence), they are often unlike persons comprising the overall population. Also, the manner
in which persons at the extreme sections of a distribution (e.g., our high-ability example)
perform differs from the performance samples composed of a broad range of ability.
Another restriction when using CTT is the need to adhere to the assumption of
parallel test forms (see Chapter 7 for a review). In CTT, the assumption of parallel forms
rests on the idea that, in theory, an identical set of test items meeting the assumption of
strictly parallel tests is plausible—an assumption rarely, if ever, met in practice. Further-
more, because CTT incorporates group-based information to derive estimates of reliabil-
ity, person or examinee-specific score precision (i.e., error of measurement) is lacking
across the score continuum. In fact, Lord (1980) noted that increased test score validity is
achieved by estimating the approximate ability level and the associated error of measure-
ment of each examinee with ability (q).
A third restriction of CTT is that it includes no probabilistic mechanism for estimat-
ing how an examinee might perform on a given test item. For example, a probabilistic
Item Response Theory 331
framework for use in test development is highly desirable if the goals are (1) to predict
test score characteristics in one or more populations or (2) to design a test specifically
tailored to a certain population. Finally, other limitations of CTT include the inability
to develop examinee-tailored tests through a computer environment (e.g., in computer
adaptive testing [CAT]), less than desirable frameworks for identifying of differential
item functioning (DIF), and equating test scores across different test forms (de Ayala,
2009; Lord, 1980; Hambleton, Swaminathan, & Rogers, 1991).
IRT posits, first, that an underlying latent trait (e.g., a proxy for a person’s ability) can
be explained by the responses to a set of test items designed to capture measurements
on some social, behavioral, or psychological attribute. The latent trait is represented as a
continuum (i.e., a continuous distribution) along a measurement scale. This idea closely
parallels the factor analysis model introduced in Chapter 9 where an underlying unob-
servable dimension or dimensions (e.g., construct(s)) are able to be explained by a set of
variables (e.g., test or survey items/questions) through an optimum mathematical func-
tion. Unidimensional IRT models incorporate the working assumption of unidimension-
ality, meaning that responses to a set of items are represented by a single underlying
latent trait or dimension (i.e., the items explain different parts of a single dimension).
A second assumption of standard IRT models is local independence, meaning that
there is no statistical relationship (i.e., no correlation) between persons’ or examinees’
responses to pairs of items on a test once the primary trait or attribute being measured is
held constant (or is accounted for).
The advantages of using IRT as opposed to CTT in test development include (1) a
more rigorous model-based approach to test and instrument development, (2) a natural
framework for equating test forms, (3) an adaptive or tailored testing approach relative
to a person’s level of ability to reduce the time of testing (e.g., on the Graduate Record
Examination), and (4) innovative ways to develop and maintain item pools or banks for
use in computer adaptive testing. Moreover, when there is an accurate fit between an item
response model and an acquired set of data, (1) item parameter estimates acquired from
different groups of examinees will be the same (except for sampling errors); (2) exam-
inee ability estimates are not test dependent and item parameters are not group dependent;
and (3) the precision of ability estimates are known through the estimated standard errors
of individual ability estimates (Hambleton & Swaminathan, 1985, p. 8; Hambleton et al.,
1991; Baker & Kim, 2004; de Ayala, 2009). The last point illustrates that IRT provides
a natural framework for extending notions of score reliability. For example, IRT makes
it possible to estimate conditional standard errors of measurement and reliability at the
person ability level (Raju et al., 2007; Price, Raju, & Lurie, 2006; Kolen et al., 1992; Feldt
& Brennan, 1989; Lord, 1980). Estimating and reporting conditional standard errors of
measurement and score reliability is highly recommended by AERA, APA, and NCME
(1999) and is extremely useful in test development and score interpretation. Additionally,
332 PSYCHOMETRIC METHODS
using IRT to scale or calibrate a set of test items provides an estimate of the reliability
based on the test items.
IRT is formally classified as a strong true score theory. In a psychometric sense, this
theory implies that the assumptions involved in applying models correctly to real data
are substantial. For example, the degree to which item responses fit an ideal or proposed
model is crucial. In fact, strong true score models such as IRT can be statistically tested
for their adequacy of fit to an expected or ideal model. Alternatively, consider CTT
where item responses are summed to create a total score for a group of examinees. In
Chapter 7, it was noted that the properties of CTT are based on long-run probabilistic
sampling theory using a mainly deterministic perspective. In CTT, a person’s true score
is represented by a sum score that is based on the number of items answered correctly.
The number correct or the sum score for a person serves as an unbiased estimate of the
person’s true score. In CTT, the total score (X) is a person’s unbiased estimate of his or
her true score (T). True score is based on the expectation over a theoretically infinite
number of sampling trials (i.e., long-run probabilistic sampling). Classical test theory
is not a falsifiable model, meaning that a formal test of the fit of the CTT model to the
data is not available.
In IRT, the probability that a person with a particular true score (e.g., estimated by
an IRT model) will exhibit a specific observed score makes IRT a probabilistic approach
to how persons or examinees will likely respond to test items. In IRT, the relationship
between observed variables (i.e., item responses) and unobserved variables or latent
traits (i.e., person abilities) is specified by an item response function (IRF) graphed as
an item characteristic curve (ICC). An examinee’s true score is estimated or predicted
based on his or her observed score. Thus, IRT is the nonlinear regression of observed
score on true score across a range of person or examinee abilities. Establishing an esti-
mated true score for a person by this probabilistic relationship formally classifies IRT
as a strong true score theory. Conversely, CTT is classified as weak theory and therefore
involves few assumptions.
To summarize, IRT is based on the following two axioms. The first axiom is that the
probability of responding correctly to a test item is a mathematical function of a person’s
underlying ability formally known as his or her latent trait or ability. The second axiom
states that the relationship between persons’ or examinees’ performance on a given item
and the trait underlying their performance can be described by a monotonically increas-
ing IRF graphically depicted as an ICC. The ICC is nonlinear or S-shaped owing to the
fact that the relationship between the probability of a correct response to an item (dis-
played on the Y-axis) is expressed as a proportion (a range of 0.0 to 1.0); the proportion
is mapped onto the cumulative normal distribution function (the X-axis) representing
a person’s ability or latent trait. The shape of the IRF/ICC is illustrated shortly using an
example from the intelligence data used throughout the book.
Item Response Theory 333
From a statistical perspective, an important difference between CTT and IRT is the
concept of falsifiability. Item response models are able to be falsified because an item
response model cannot be demonstrated to be correct or incorrect in an absolute sense
(or simply by tautology, meaning that it is valid without question). Instead, the appropri-
ateness of a particular IRT model relative to a particular set of observed data is established
by conducting goodness-of-fit testing for persons and items. For example, the tenability
of a particular IRT model given a set of empirical data is possible after inspection of the
discrepancy between the observed versus predicted residuals (i.e., contained in an error
or residual covariance matrix) after model fitting. Readers may want to return to Chapter
2 and the Appendix to review the role of the covariance matrix in statistical operations in
general, and regression specifically.
Finally, because all mathematical models used to describe a set of data are based on
a set of assumptions, the process of model selection occurs relative to the item develop-
ment and proposed uses of the test (e.g., the target population of examinees for which
the scores will be used).
of each possible outcome on an item (Lord, 1980; Lord & Novick, 1968). In the sampling
approach to IRT, the process of fitting statistical models to a set of item responses focuses
initially on a set of examinees’ item scores rather than on person ability. This differs
from Rasch’s sample-free approach to measurement where person ability is the dominant
component in the probabilistic model. In the Rasch model, test items are constructed or
designed to “fit” the properties of Rasch measurement theory. For readers interested in
the details of Rasch measurement theory, see Wright and Stone (1979) and Bond and Fox
(2001). With regard to the philosophical stances between the Rasch and IRT approaches,
as Holland and Hoskins (2003) note, item parameters and person abilities are always
estimated in relation to a sample obtained from a population. In this sense, it is illusory to
believe that a sample-free measurement exists. In the end, both philosophical approaches
have merit and should be considered when deciding on an approach to address practical
testing problems.
Table 10.1 provides the taxonomy of Rasch and IRT models. From this table, we
see that many Rasch and IRT models are available to meet a variety of measurement and
testing scenarios. In this chapter, we focus on four models (highlighted in gray in Table
10.1) that are foundational to understanding and using IRT: the Rasch, one-parameter,
two-parameter, and three-parameter unidimensional models for dichotomous items.
Once the foundations of Rasch and IRT models are presented, readers are encouraged to
expand their knowledge by reading the suggested references that introduce variations of
the models in this chapter.
To illustrate how Rasch analysis and IRT works, an example is presented using our exam-
ple intelligence test data. The example given in the next sections is based on the Rasch
model. The Rasch model is formally introducted in Section 10.18, and is used in the sec-
tions that immediately follow because it is foundational to item response modeling. We
begin by illustrating how person ability and test items are related on a single continuum.
Next, the assumptions of Rasch and IRT models are reviewed, and applied examples of
how to evaluate the assumptions are provided.
Returning to the example intelligence test data used in this book, we see that a
person with a higher level of intelligence should be more likely to respond correctly to a
particular item in relation to a person with a lower level of intelligence. Recall that intel-
ligence is a latent trait or attribute that is not directly observable. Graphically, an example
of a continuum representing a latent attribute is provided in Figure 10.1. The values on
the horizontal line in the figure are called logits and are derived using the logistic equa-
tion (see Equations 10.5 and 10.6). Logit values are based on a transformation that yields
item locations that are linearized by applying the logarithmic transform to nonlinear data
(e.g., the probability of a correct response based on binary test item responses). Notice
in Figure 10.1 that both the location of items and the ability of a person are located on
the same scale.
Item Response Theory 335
Person Ability θ
δ -3 -2 -1 0 1 2 3
item difficulty
1 2 3 4 5 6
Items
Figure 10.1. Latent variable continuum based on six items mapped by item location/difficulty.
q = theta; d = delta.
To interpret Figure 10.1, consider the hypothetical person that exhibits an ability of
0.0 on the ability (q) scale. Easier items (e.g., 1 and 2) are on the left side of the continuum;
moderately difficult items (e.g., 3 and 4) are in the middle; and harder items (e.g., 5 and
6) are on the positive side of the continuum. From a probabilistic perspective, this person
with ability = 0.0 will be less likely to correctly answer item 5 than item 3 because item 5 is
displayed as having a difficulty of d5 = 2.0 on the logit scale, compared to item 3 having a
difficulty of 0.0. A person with ability of 0.0 is less likely to respond correctly to item 5 as
compared to item 3 because the discrepancy between the difficulty of item 5 and the abil-
ity of the person (0.0) is larger than the discrepancy between item 3 and the ability of the
person. For example, an item with a difficulty of 2.0 is more difficult than, say, an item with
d4 = 1.0. Conversely, the same person with ability q = 0.0 responding to item 1,d1 = –3.0 will
be very likely to respond correctly to the item given that the item location is on the extreme
lower end of the item location/difficulty and theta (ability) continuum.
The key idea in Figure 10.1 is that the greater the discrepancy between the person abil-
ity and item location, the greater the probability of correctly predicting how the person will
respond to an item or question (i.e., correct/incorrect or higher/lower on an ordinal-type
scale). In the Rasch model, the only item characteristic being measured is item diffi-
culty, d. Under these circumstances, as the discrepancy between the person ability and
item locations nears zero, the probability of a person responding correctly to an item
approaches .50, or 50%.
IRT is a statistical model and like most statistical models assumptions are involved. The
first assumption to consider prior to the application of any Rasch or IRT model is the
dimensionality of the set of items comprising the test or instrument. The dimensionality
of a test specifies whether there are one or more underlying abilities, traits, or attributes
Item Response Theory 337
being measured by the set of items. The term dimension(s) is used synonymously with
person ability or latent trait in IRT. Abilities or traits modeled in IRT can reflect educa-
tional achievement, attitudes, interests, or skill proficiency—all of which may be mea-
sured and scaled on a dichotomous, polytomous (ordinal), or unordered categorical level.
The most widely used Rasch and IRT model is unidimensional and assumes that there
is a single underlying ability that represents differences between persons and items on a
test. Strictly speaking, the assumption of unidimensionality is rarely able to be perfectly
met in practice owing to the interplay of a variety of factors such as test-taking anxiety,
guessing, and the multidimensional nature of human cognitive skills and abilities. How-
ever, the performance of Rasch and IRT models has been shown to be robust to minor
violations of the dimensionality assumption, provided that a single overriding or dominant
factor influences test performance (Hambleton et al., 1991). Users of conventional unidi-
mensional IRT models assume that a single ability sufficiently explains the performance
of an examinee or examinees on a set of test items.
The dimensionality of a test is closely related to the idea of a single underlying factor (or
latent trait in IRT terminology) represented by a set of items or questions. Evaluating
the dimensionality of a test can proceed in a number of ways (Hattie, 1985). This sec-
tion begins with early approaches to dimensionality assessment related to IRT and then
transitions to more sophisticated approaches now commonly used. In early applications
of IRT, Lord (1980) recommended examining the eigenvalues produced from a linear
factor analysis in relation to the number of dominant factors present in a particular set of
items. For readers unfamiliar with factor analysis, the topic is presented in Chapter 9 and
should be reviewed to fully understand the key ideas presented here. In factor analysis, an
eigenvalue represents the amount of variance accounted for by a given factor or dimen-
sion. Figure 10.2 illustrates the situation where a single dominant factor (i.e., a distinct
eigenvalue between 4 and 5 on the Y-axis followed by a 90-degree elbow at the second fac-
tor) exists by way of a scree plot, a test attributed to Cattell (1966). A scree plot is a graph
of the number of factors depicted by the associated eigenvalues generated using principal
axis factor analysis. The eigenvalues that appear after the approximate 90-degree break
in the plot line (e.g., eigenvalue 2 and beyond) are termed “scree” synonymous with rem-
nants or rubble at the bottom of a mountain. In Figure 10.2, eigenvalues are plotted as a
function of the number of factors in a particular set of item responses or variables.
Traditional factor analysis techniques such as those introduced in Chapter 9 are appropri-
ate for interval or continuous data. When the item-level response data are dichotomous
338 PSYCHOMETRIC METHODS
Eigenvalue
2
0
0 2 4 6 8 10 12
Number of factors
Figure 10.2. Scree plot generated from principal axis factor analysis.
with an assumed underlying distribution on the latent trait being normal, the tetrachoric
correlation matrix is the appropriate matrix to use for analysis (Lord, 1980; Lord &
Novick, 1968). The tetrachoric correlation coefficient (introduced in the Appendix) is
a measure of the relationship between two dichotomous variables where the underly-
ing distribution of performance on each variable is assumed to be normal (McDonald
& Ahlawat, 1974). According to Lord and Novick (1968), a sufficient condition for the
existence of unidimensionality for a set of dichotomously scored test items is that the
result of factor analyzing a matrix of tetrachoric correlations results in a single common
factor. To illustrate factor analysis of our example item response data, the LISREL/PRELIS
8 program (Jörskog & Sörbom, 1999a) is used. The following PRELIS program produces
a factor analysis using polychoric/tetrachoric correlation matrices. For a review of poly-
choric/tetrachoric correlation coefficients and how they differ from Pearson correlation
coefficients see the Appendix. The syntax language in the PRELIS program below can be
referenced in the PRELIS 2 User’s Reference Guide (Jörskog & Sörbom, 1999b, pp. 7–8).
Once the output files are created and saved (output files are saved using the shaded
line in the program syntax above), the following LISREL program can be used to run
the factor analysis on tetrachoric correlations to evaluate the dimensionality of the set of
Item Response Theory 339
items. The syntax in the LISREL program below can be referenced in the LISREL 8 User’s
Reference Guide (Jörskog & Sörbom, 1996, pp. 248–249).
Next, an abbreviated output from the LISREL program factor analysis results is provided
that includes the fit of the item-level data to a one-factor model.
. . .
. x . .
. x . .
N . x . .
o . x . .
r . x x . .
m . x x . .
a . * . .
l . x x . .
. xx . .
Q . xx . .
u . x . .
a . xx . .
n . * . .
t . xx . .
i .* . .
l .* . .
e .x . .
s .x . .
.x . .
.x . .
. . .
x . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
–3.5..........................................................
–3.5 3.5
Standardized Residuals
Note. Asterisks represent multiple data points. The X’s represent single data points.
Reviewing the output from the LISREL factor analysis, we see that the one-factor
model is supported as evidenced by the root mean square error of approximation (RMSEA)
being 0.06 (a value less than .08 is an established cutoff for adequate model-data fit in
factor analysis (FA) conducted in structural equation modeling). Also presented is the
Q-residual plot, which illustrates how well the one-factor model fits the data from the
view of the residuals (i.e., the observed versus predicted values). A residual is defined as
the discrepancy between the actual (sample) data and the fitted covariance matrix. For
example, if we see excessively large residuals (i.e., > 3.5 in absolute terms) or a severe
departure from linearity, or if the plotted points do not extend entirely along the diagonal
Item Response Theory 341
line, there is at least some degree of misfit (Jörskog & Sörbom, 1996, p. 110). In the
Q-plot, the standardized residuals are defined as the fitted residual divided by the large
sample standard error of the residual. Although the data presented in the Q-plot do not
reflect a perfect fit, inspection of it along with the fit statistics reported provides sufficient
evidence for supporting the existence of the one-factor model (i.e., we can be confident
the set of item responses is unidimensional).
Estimate of Examinee
Guessing on Test: 0.0000
--------------------------------------------------
AT List PT List
--------------------------------------------------
5 1 2 3 4 6 7
8 13 16 17 18 19 20
9 21 22 23 24 25
10
11
12
14
15
--------------------------------------------------
--------------------------------------------------
--------------------------------------------------
DIMTEST STATISTIC
--------------------------------------------------
TL TGbar T p-value
--------------------------------------------------
8.0022 7.0387 0.9587 0.1688
Item Response Theory 343
We see from the DIMTEST T-statistic (last line in the output) that the hypothesis test
of multiple dimensions is rejected (p = .168), providing support for unidimensionality for
crystallized intelligence test 2.
An alternative to DIMTEST is full information item factor analysis—a factor-analytic
technique that uses tetrachoric correlations to estimate an item response function (IRF)
based on the item responses. The full information item factor analysis technique is imple-
mented in the program TESTFACT (Bock, Gibbons, & Muraki, 1988, 1996). Another
program that is similar to TESTFACT and very useful for IRT-based dimensionality
assessment is the Normal Ogive Harmonic Analysis Robust Method (NOHARM; Fraser
& McDonald, 2003). Returning to the TESTFACT program, we list below the TESTFACT
syntax for conducting a test of dimensionality for crystallized intelligence test 2 using full
information item factor analysis.
>TITLE
CRYSTALLIZED INTELLIGENCE
TEST 2 - 25 ITEMS;
>PROBLEM NIT=25, RESPONSE=3;
>RESPONSE ' ','0','1';
>KEY 1111111111111111111111111;
>TETRACHORIC NDEC=3, LIST;
>RELIABILITY ALPHA;
>PLOT PBISERIAL, FACILITY;
>FACTOR NFAC=1, NROOT=3;
>FULL CYCLES=20;
>TECHNICAL NOADAPT PRECISION=0.005;
>INPUT WEIGHT=PATTERN, FILE='F:\CRI2.DAT';
(25A1)
>STOP
>END
The results of the TESTFACT analysis concurs with our factor analysis conducted
using LISREL and reveals one underlying dimension for the set of 25 items. For example,
the largest eigenvalue (latent root) is 9.513384 (see below), with the next largest eigen-
value dropping substantially to a value of 2.086334 and on down the line until at eigen-
value number six the value is .336450.
Partial output from TESTFACT program for full information item factor analysis
1 2 3 4 5 6
1 9.513384 2.086334 0.869951 0.510628 0.392143 0.336450
344 PSYCHOMETRIC METHODS
1
1 0.464062
We can evaluate whether a one- or two-factor model best fits the data by conducting
two separate analyses with TESTFACT by simply changing the NFAC keyword in the
program (highlighted in gray), then comparing the results using a chi-square difference
test. Calculating the difference between the one-factor model chi-square and the two-
factor model yields a chi-square of 270.68; the results are provided below.
One-Factor Model
Chi-square = 4552.15 and degrees of freedom = 449.00
Two-Factor Model
Chi-square = 4281.47 and degrees of freedom = 425.00
The difference between degrees of freedom for the two models is 24. We consult a chi-
square table and find that the difference between chi-square statistics of 270.68 and
degrees of freedom of 24 reveals that the two-factor model fits better from the point of
a statistical test. However, the one-factor model accounts for a substantial amount of
explained variance (relative to the two-factor model) in the set of items. Additionally, the
pattern and size of the eigenvalues (latent roots) do not differ much between the models.
Therefore, we can be reasonably confident that there is a single underlying dimension
that is explained by the 25 items.
Finally, when more than one dimension is identified (i.e., no single dominant fac-
tor emerges) to account for examinee performance on a set of test items, researchers
must either revise or remove certain test items to meet the unidimensionality assump-
tion or use a multidimensional approach to IRT (McDonald, 1985a; Bock et al., 1988;
Reckase, 1985, 2009; McDonald, 1999; Kelderman, 1997; Adams, Wilson, & Wang,
1997). One relatively new approach to use when multidimensionality is present is
called mixture modeling and is based on identifying mixtures of distributions of per-
sons within a population of examinees. This approach to IRT is based on latent class
analysis (LCA) of homogeneous subpopulations of persons existing within a sample
(de Ayala, 2009).
Item Response Theory 345
A second assumption of IRT is local independence, also known as conditional item inde-
pendence. Recall that in IRT, a latent trait or dimension influences how a person or exam-
inee will respond to an item. Operationally, once examinees’ ability is accounted for (i.e.,
statistically controlled), no covariation (or correlation) remains between responses to dif-
ferent items. When local item independence holds, a particular test item in no way pro-
vides information that may be used to answer another test item. From classical probability
theory, when local item independence is present, the probability of a pattern of responses
to test items for an examinee is derived as the product of the individual probabilities of
correct and incorrect responses on each item (e.g., by applying the multiplicative rule of
probability). To formalize the local independence assumption within standard IRT ter-
minology, let q represent the complete set of latent abilities influencing examinee per-
formance on a set of test items, and Ui represent the response to item j (across the vector
of items j = 1, 2, 3, . . ., n). Using conditional probability theory, let P(U|q) represent the
probability of the response of a randomly chosen examinee from a population given abil-
ity q, with P(1|q) as a correct response and P(0|q) as an incorrect response. Equation 10.1
illustrates the probability of conditionally independent responses to items by a randomly
chosen examinee with a given level of ability (Hambleton et al., 1991, p. 33).
N
P(U1, U 2 , U 3 … , U N| q = P(U1| q)P(U2| q)P(U3 | q)…PR (UN| q) =å P(U J q|)
I =1
• P = probability of response to
an item.
• Un = probabilistic interpretation
of the response to an item,
either 1 for correct or 0 for
incorrect.
• P(U1|q) = probability of a randomly
chosen examinee respond-
ing to a set of items given
their ability.
• q = person ability or theta.
• P(U1|q)P(U2|q)P(U3|q)P(Un|q) = the product of the probabil-
ities of a correct response
to items 1 through n.
346 PSYCHOMETRIC METHODS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Item 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Item 2 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Examining the results in Table 10.4, we reject the chi-square hypothesis test that
the items are independent of one another at an exact probability of p = .015. Fisher’s
Exact Test is appropriate when at least some cells in the analysis include less than five
scores; this is the case in the present analysis. In rejecting the hypothesis of indepen-
dence, we conclude that the assumption of local item independence does not hold for
these items for the 25 examinees; however, this is a very simple example using only two
examinees. In practice, local independence is evaluated based on the item response pat-
terns across all ability levels for all examinees in a sample. Computationally, this step is
challenging and is (1) performed in conjunction with testing the dimensionality of a set
of items as described earlier in the DIMTEST program explanation or (2) evaluated by
using a separate analysis, as presented next.
Another approach for evaluating the assumption of local item independence is using
Yen’s (1984, 1993) Q3 statistic. The advantage of using the Q3 statistic approach is that
(1) it is relatively easy to implement, (2) it requires no specialized software, and (3) it
yields reliable performance across a wide range of sample size conditions (Kim, de Ayala,
Ferdous, & Nering, 2007). The Q3 technique works by examining the correlation of the
residuals between pairs of items. A residual is defined as the difference between an exam-
inee’s observed response to an item and his or her expected response to the item. Two
residuals are necessary in order to implement the Q3: an item residual and a person-level
residual. The person-level residual for an item (j) is given in Equation 10.2 (de Ayala,
2009, p. 132).
The person-level residual for an item (k) is given in Equation 10.3 (de Ayala, 2009,
p. 133).
With the two residual components known, one can calculate the Q3 statistics as pro-
vided in Equation 10.4 (de Ayala, 2009, p. 133).
Applying the Q3 technique involves evaluating the magnitude and sign of the
pairwise correlations in Equation 10.4. The main point of the technique is to evaluate
the dependence between item pairs across all examinees in a sample. For example, a
348 PSYCHOMETRIC METHODS
DIJ = X IJ - P J (qˆ I)
DIK = X IK - P K (qˆ I)
Q3 JK = RD J DK
correlation of 0.0 between item pairwise residuals means that the primary condition for
the assumption local item independence is tenable. However, a 0.0 correlation may also
result from a nonlinear relationship. For this reason, a Q3 value of 0.0 is a necessary but
not sufficient condition that local item independence is evident in the test items. For
comprehensive details of implementing the Q3 technique, refer to Yen (1984, 1993) and
de Ayala (2009).
Item Response Theory 349
In Section 10.2 comparisons were presented between CTT and IRT. Arguably, the most
important difference between the two theories and the results they produce is the prop-
erty of invariance. In IRT, invariance means that the characteristics of item parameters
(e.g., difficulty and discrimination) do not depend on the ability distribution of exam-
inees, and conversely, the ability distribution of examinees does not depend on the item
parameters. In Chapter 7, CTT item indexes introduced included the proportion of
examinees responding correctly to items (i.e., proportion-correct) and the discrimination
of an item (i.e., the degree to which an item separates low- and high-ability examinees).
In CTT, these indexes change in relation to the group of examinees taking the test (e.g.,
they are sample dependent). However, when the assumptions of IRT hold and the model
adequately fits a set of item responses (i.e., either exactly or as a close approximation),
the same IRF/ICC for the test items is observed regardless of the distribution of ability of the
groups used to estimate the item parameters. For this reason the IRF is invariant across
populations of examinees. This situation is illustrated in Figure 10.3.
The property of invariance is also a property of the linear regression model. We
can make connections between the linear regression model and IRT models because IRT
models are nonlinear regression models. Recall from Chapter 2 and the Appendix that
the regression line for predicting Y from X is displayed as a straight line connecting the
conditional means of the Y values with each value or level of the X variable (Lomax,
2001, p. 26; Pedhazur, 1982). If the assumptions of the linear regression model hold,
the regression line (i.e., the slope and intercept) will be the same for each subgroup of
persons within each level of the X variable. In IRT, we are conducting a nonlinear regres-
sion of the probability of a correct response (Y) on the observed item responses (X). To
illustrate the property of invariance and how the assumption can be evaluated, we return
to the crystallized intelligence test 2 data, made up of 25 dichotomously scored items,
and focus on item number 11. First, two random subsamples of size 500 were created
from the total sample of 1,000 examinees. SPSS was used to create the random subsam-
ples, but any statistical package can be used to do this. Next, the classical item statistics
proportion-correct and point–biserial are calculated for each sample. To compare our
random subsamples derived from CTT item statistics with the results produced by IRT,
a two-parameter IRT model is fit to each subsample (each with N = 500) and the total
sample (N = 1,000). Although the two-parameter IRT model is yet to be introduced, it is
used here because item difficulty and discrimination are both estimated making compari-
sons between IRT and CTT possible. A summary of the CTT statistics and IRT parameters
is presented in Table 10.5. Figure 10.4 illustrates the two item characteristic curves for
random subsamples 1 and 2. The CTT item statistics and IRT parameter estimates are
provided in Table 10.5.
Inspection of the classical item statistics (CTT) in Table 10.5 (top half of the table)
for item 11 reveals that the two samples are not invariant with respect to the ability of
the two groups (i.e., the proportion-correct and point–biserial coefficients are unequal).
Next, comparing the parameters estimated for item 11 using the two-parameter IRT
350 PSYCHOMETRIC METHODS
Ability -3 -2 -1 0 1 2 3
IQ 55 70 85 100 115 130 145
1.00
Probability of Yes Response
.50
0
-3 -2 -1 0 1 2 3
Ability
Figure 10.3. Invariance of item response function across different ability distributions. A test
item has the same IRF/ICC regardless of the ability distribution of the group. For an item location/
difficulty of 0.0, the low-ability group will be less likely to respond correctly because a person in
the low-ability group is located at –1.16 on the ability scale whereas a person in the high-ability
group is located at 0.0 on the ability scale.
model, in Table 10.5 we see that the item difficulty or location estimates for the samples
are very close (.04 vs. –.07) and the discrimination parameters (labeled as “slope”)
are the same (.92 vs. .92). Finally, conducting a chi-square difference test between the
groups on item 11 yields no statistical difference, indicating that the item locations and
discrimination parameters are approximately equal. To summarize, (1) invariance holds
regardless of differences in person or examinee ability in the IRT model and (2) invari-
ance does not hold for the two random subsamples when using the CTT model (i.e., the
Item Response Theory 351
Table 10.5. Classical Item Statistics and 2-PL IRT Parameter Estimates
for Two Random Samples
Statistic Sample 1 Sample 2 Total sample
CTT
Proportion correct 0.49 0.52 0.52
Point–biserial 0.60 0.53 0.61
Biserial correlation 0.67 0.67 0.70
IRT
Logit 0.02 –0.06 –0.06
Intercept (¡) –0.04 0.07 0.07
Slope (a or a) 0.92 0.92 0.99
Threshold (d or b) 0.04 –0.07 –0.07
Note. Correlation between all 25-item thresholds (locations) for samples 1 and 2 = .97. Logit is derived
based on the slope–intercept parameterization of the exponent in the 2-PL IRT model: a(q) + g. The relation-
ship between an item’s location or difficulty, intercept, and slope is δ = −γ/α , and for the total sample the
item difficulty/location is derived as –1.52/1.48 = 1.03.
The relationship between IRT discrimination and CTT biserial correlation for the total sample is
aI .99 .99 .99 .99
rBI = = = = = = .70.
1 + a I2 1 + .992 1 + .98 1.98 1.41
proportion-correct and point–biserial correlations are unequal). In the next section, the
process of simultaneously estimating the probability of item responses and person abil-
ity is introduced.
Using actual item responses from a set of examinees (e.g., in Table 10.2) and applying
Equation 10.1, we can estimate the joint probability of correct and incorrect responses to
each item in a test by a set of examinees. To do this, we can use the likelihood function in
Equation 10.5. In the Appendix, the symbol Õ is defined as the multiplicative operator.
Applying the multiplicative operator to the likelihood values for individual item response
scores at a specific examinee ability level yields the total likelihood, in terms of probabili-
ties, for the response pattern of scores for a sample of examinees (see Equation 10.5 on
page 353 from Hambleton et al., 1991, p. 34).
The product resulting from the multiplicative operation yields very small values,
making them difficult to work with. To avoid this issue, the logarithms of the likelihood
functions (i.e., the log likelihoods) are used instead as given in Equation 10.6. Using
logarithms rescales the probabilities such that the log likelihood values are larger and
easier to work with. Furthermore, this step yields a linear model that allows for additive
operations. Because we now have an equation with additive properties, the summation
operator replaces the multiplicative operator. Equation 10.6 (Hambleton et al., 1991,
p. 35) illustrates these points about the use of log likelihoods.
352 PSYCHOMETRIC METHODS
0.8
Probability
0.6
0.4
0.2
b
0
-3 -2 -1 0 1 2 3
Ability
0.8
Probability
0.6
0.4
0.2
b
0
-3 -2 -1 0 1 2 3
Ability
Figure 10.4. Item response functions for item 11 for subsamples 1 and 2. Vertical bars around
the solid dots indicate the 95% level of confidence around the fit of the observed data relative to the
predicted IRF based on the two-parameter IRT model. The nine dots represent the fit of different
distribution points along the ability continuum.
Item Response Theory 353
= (.02)(.05)(.11)(.16)(.25)
= .00000345
In logarithms of likelihoods:
= -5.456
Equation 10.7 (Hambleton et al., 1991, p. 34) illustrates the process of applying the
likelihoods for estimating the probability of the observed response pattern for examinee
number 4 based on the data in Table 10.6.
During an IRT analysis, the steps above are conducted for all examinees and all items
in a sample. As presented in the Appendix, to facilitate calculations in estimating the
location of a person’s ability that is located at its maximum, the logarithm of the likeli-
hood function is used as illustrated in Equations 10.6 and 10.7. Using logarithms, we
define the value of ability (q) that maximizes the log likelihood for an examinee as the
maximum likelihood estimate of ability (q̂) or MLE [q̂]; (the “hat” on top of q signifies
that it is an estimate from the population). The process of estimating the MLE is iterative,
meaning, for example, that ability for a sample of examinees is estimated based on initial
item parameters from the observed data, and the maximum of the likelihood estimate
for person ability is derived by using calculus-based numerical integration methods. The
numerical integration algorithms are included in IRT programs such as IRTPRO (2011),
BILOG-MG (2003), PARSCALE (1997), MULTILOG (2003), WINSTEPS (2006), and
CONQUEST (1998). The process of estimating ability and item parameters involves iter-
ative techniques because locating the maximum likelihood of ability necessitates com-
puter searching for the location where the slope of the likelihood function is zero; this
must be performed for all persons in a sample. Further explanation of the process of
estimating item parameters and person ability is provided in Section 10.16.
To make Equations 10.1 through 10.4 more concrete, we use the Rasch model as
an illustrative framework. The Rasch model receives a formal introduction later, but for
now it is used to illustrate the estimation of the probability of responses to items for a
sample of six examinees. In the Rasch model, the probability of a response depends on
two factors, the examinee’s ability to answer the item correctly and the difficulty of the
item. To account for both examinee ability and item difficulty in a single step, we can use
Equation 10.3. To illustrate the combined role of Equations 10.1 through 10.4, we use a
small portion of the data from our sample of 1,000 persons on crystallized intelligence
test 2 as shown in Table 10.7.
In the Rasch model (Equation 10.9 on page 356), the two parameters involved are the
difficulty of an item and the ability of the examinee. Note that in Table 10.7, we assume
the six examinees all possess an ability of 2.0. The probability of responding correctly
to an item given person ability is expressed on a 0 to 1 metric; ability is expressed on a
standard or z-score scale. The z-score metric is useful because z-scores can be mapped
onto the normal distribution as an area or proportion under the normal curve. Because
of the nonlinear metric of (1) the item responses being dichotomous and (2) the prob-
ability of a response to an item (i.e., a 0 to 1 range), a logistic function is used with a
scaling factor of 2.7183 as in Equation 10.8. A convenient result of using the logistic
EZ
P( X) =
1 + EZ
equation is that by taking the exponent of the combination of predictor variables (in
this case, q – d), the result is a linear model that is much easier to work with. In fact,
logistic regression is a widely used alternative in statistical methods when the outcome
variable is on a 0.0 to 1.0 metric (e.g., a binomial distributed outcome variable rather
than a continuous one).
Next, inserting q – dj into the logistic equation, as illustrated in Equation 10.9, yields
the Rasch model. The key to understanding Equation 10.9 is to look closely at the expo-
nent in the numerator and denominator. In this part of the numerator, we see that item
difficulty is subtracted from person ability. It is this difference that is plotted against the
probability of an examinee responding correctly to an item. The importance of the previous
sentence cannot be overstated because other, more advanced types of IRT models build
on this concept. Continuing with our example, the probability of a correct response is
mapped onto the cumulative normal distribution (i.e., a z-score metric; see Chapter 2
and the Appendix for a review of the cumulative normal distribution function). The item
E(q - dJ )
P ( X J = 1 | q, d J ) =
1 + E ( q - dJ )
a = 1.000 b = 0.000
1.0
0.8
0.6
Probability
0.4
0.2
0
-3 -2 -1 0 1 2 3
Ability
difficulty and person ability are also represented on the z-score metric and are therefore
linked to the cumulative normal distribution.
The Rasch model, like other IRT models, incorporates the logistic function because
the relationship between the probability of an item response to person ability and item
difficulty is nonlinear (e.g., expressed as an S-shape curve). Figure 10.5 illustrates a Rasch
ICC where person ability is 0.0, item location or difficulty is 0.0, and the probability of
a response is .50 or 50%. In the figure, the ICC is based on the 1,000 item responses to
item 3 on the test of crystallized intelligence 2.
Continuing with our example, we can apply Equations 10.1, 10.5, and 10.6 to obtain
the probability of a correct response for examinee 2 regarding their response to item
number 4. For example, if we insert a value of 2.0 for the examinee’s ability (q), 1.0 for
the item 4 difficulty, and 1 for a correct response into Equation 10.9, we obtain the result
in Equation 10.10. To interpret, the probability is .73 that a person with ability 2.0 and
item difficulty 1.0 will answer the item correctly. In practice, a complete Rasch (or IRT)
analysis involves repeating this step for all examinees and all items on the test.
Finally, the goal in IRT is to estimate the probability of an observed item response
pattern for the entire set of examinees in a sample. To accomplish this, we estimate the
likelihood of observing an item response pattern using all 25 items on crystallized intel-
ligence test 2 for 1,000 examinees over a range of ability (a range of z = –3.0 to 3.0). We
return to the step of estimating the likelihood of unique response patterns for a sample
of examinees shortly.
358 PSYCHOMETRIC METHODS
2.7183(2.0-1.0) 2.7183
P(X I = 1 | q, d I ) = (2.0 -1.0)
= = .73
1 + 2.7183 1 + 2.7183
The Appendix introduces maximum likelihood estimation, noting that its use is par-
ticularly important for challenging parameter estimation problems. The challenges of
estimating person ability and item parameters in IRT make maximum likelihood estima-
tion an ideal technique to use. The Appendix provides an example to illustrate how MLE
works. The distributional form of the total likelihood is approximately normally distrib-
uted, and the estimate of the standard deviation serves as the standard error of the MLE.
Once the item parameters are estimated, they are fixed (i.e., they are a known entity), and
the sampling distribution of ability (q) and its standard deviation can be estimated. The
standard deviation of the sampling distribution of ability (q) is the standard error of the
MLE of ability (q). The dispersion of likelihoods resulting from the estimation process
may be narrow or broad depending on the location of the value of (q) for examinee and
item parameters.
Closely related to the item response function (IRF/ICC) is the item information
function (IIF). The IIF plays an important role in IRT because (a) it provides a way to
identify where a test item is providing the most information relative to examinee ability
and (b) a standard error of the MLE is provided, making it possible to identify the preci-
sion of ability along the score scale or continuum. Additionally, IIFs can be summed to
create an index of total test information. The IIF is presented in Equation 10.11.
Because the slope is set to 1.0 in the Rasch model, the information function simpli-
fies, as illustrated in Equation 10.11a.
Equation 10.11a is also applicable to the one-parameter IRT model because, although
the slope is not required to be set to 1.0, it is required to be set to a constant value; this
constant is dictated by the empirical data. To illustrate Equation 10.11a with our intel-
ligence test data, let’s assume that we are interested in looking at the information for item
11 in relation to an examinee with ability of 0.0. Using the item location of –.358 and abil-
ity of 0.0, we insert these values into Equation 10.11a as illustrated in Equation 10.11b
for the Rasch model where the slope or discrimination is set to 1.0. For example, if the
probability of correctly responding to item 11 is –.358 (see Table 10.8 to verify this) for a
person with ability of 0.0, the information for the item is .23, as illustrated in Figure 10.6.
Finally, in the Rasch model, item information reaches its maximum at .25; this is the loca-
tion where d or item difficulty is –.358. You should verify the fact that item information
Item Response Theory 359
[P¢J ]2
I J (q) =
P J (1 - PJ )
Maximum information
Item 11 information
b = -.358
Scale Score
Figure 10.6. Item information function based on item 11. I(q) on the Y-axis is the information
function of ~.23. Proficiency is the ability scale (q) on the X-axis. The information provided by the
item reaches maximum (.23) when the location parameter = –.358. The information for the item
is .25 (maximum) when the probability of a correct response is .50.
is .25 in the Rasch model by inserting the probability of a correct response of .50 and a
person ability of 0.0 into Equation 10.11a.
Because the slope model is set to 1.0 in the Rasch model, the information function
simplifies, as illustrated in Equation 10.11b.
Similarly, the information can be estimated for the estimate of examinee ability (q).
The ability that is estimated by an IRT model is q̂ (theta is now displayed as “theta hat”
362 PSYCHOMETRIC METHODS
1 1
SE(qˆ |q) = =
I (q) L
[P¢ ]2
å P J(1 -J P J)
I =1
At the outset of an IRT analysis, both item parameters and examinee ability are unknown
quantities and must be estimated. The estimation challenge is to find the ability of each
Item Response Theory 363
examinee and the item parameters using the responses to items on the test. It is beyond
the scope of this chapter to present a full exposition of the various estimation techniques
and the associated mechanics of how they are implemented. This section presents a con-
ceptual overview of how the estimation of ability and item parameters works. Readers
are referred to Baker and Kim (2004) and de Ayala (2009) for excellent treatments and
mathematical details of estimation techniques and their implementation in computer
programs currently employed in IRT.
Simultaneously estimating the item parameters and examinee abilities is computa-
tionally challenging. The original approach to estimating these parameters is joint maxi-
mum likelihood estimation (JMLE), and it involved simultaneously estimating both
examinee ability and item parameters (Baker & Kim, 2004, pp. 83–108). However, the
JMLE approach produces inconsistent and biased estimates of person abilities and item
parameters under circumstances such as small sample sizes and tests composed of fewer
than 15 items. Another problem associated with JMLE includes inflated chi-square tests
of global fit of the IRT model to the data (Lord, 1980). For these reasons, the marginal
maximum likelihood estimation (MMLE) approach (Bock & Aitkin, 1982) is the tech-
nique of choice and is incorporated into most, if not all, IRT programs (e.g., IRTPRO,
BILOG-MG, PARSCALE, MULTILOG, and CONQUEST). In the MMLE technique, the
test items are estimated first and subsequently considered fixed (i.e., nonrandom). Next,
the person abilities are estimated and are viewed as a random component sampled from
a population. The person ability being a random component within the population pro-
vides a way to introduce population information without directly and simultaneously
estimating the item parameters.
In practice, the item parameters are estimated first using MMLE. This step occurs
by first integrating out the ability parameters based on their known approximation
to the normal distribution. Specifically, in MMLE it is the unconditional (marginal-
ized) probability of a randomly selected person from a population with a continuous latent
distribution that is linked to the observed item response vector (de Ayala, 2009; Baker &
Kim, 2004; Bock & Aitkin, 1982). With person ability eliminated from the estima-
tion process through integration, the unconditional or marginal likelihood for item
parameter estimation becomes possible in light of the large number of unique per-
son ability parameters. Once the item parameters are estimated and model-data fit is
acceptable, the estimation of person ability is performed. The result of this estimation
process is a set of person abilities and item parameter estimates that have asymptotic
properties (i.e., item parameter estimates are consistent as the number of examinees
increases). When conducting an IRT analysis using programs such as BILOG-MG,
IRTPRO, PARSCALE, MULTILOG, and CONQUEST, the process of ability and item
parameter estimation is iterative (i.e., the program updates ability and item parameter
estimates until an acceptable limit or solution is reached). The process results in abil-
ity and item parameter estimates that have been refined in light of one another based
on numerical optimization.
IRT is a large-sample technique that capitalizes on the known properties of the cen-
tral limit theorem. For this reason, sample size is an important factor when estimating
364 PSYCHOMETRIC METHODS
ability and item parameters in any IRT analysis. Research has demonstrated (e.g., de
Ayala, 2009; Baker & Kim, 2004) that in general, for Rasch and one-parameter IRT model
estimation (also called Rasch or IRT calibration), a sample size of at least 500 examinees
is recommended. For the two- and three-parameter IRT models, a sample size of at least
1,000 is recommended. However, in some research and analysis situations these numbers
may be relaxed. For example, if the assumptions of the Rasch or IRT analysis are met
and inspection of the model-data fit diagnostics reveals excellent results, then the sample
size recommendations provided here may be modified. As an example, some simulation
research has demonstrated that sample sizes as low as 100 yield adequate model-data fit
and produce acceptable parameter estimates (de Ayala, 2009).
Now we return to the task of estimating the unobserved (latent) ability for per-
sons after item parameters are known (are a fixed entity). In MMLE, the population
distribution of ability for examinees or persons is assumed to have a specific form (usu-
ally normal). For explanation purposes, let’s assume that our population of interest is in
fact normally distributed. Knowing the statistical characteristics of the population, the
mechanics of ability estimation employs an empirical Bayesian statistical approach to
estimating all of the parameters of person ability within a range (usually under a standard
score range of –3.0 to +3.0). The Bayesian approach to probability and parameter estima-
tion is introduced in the Appendix. Readers should briefly review this information now.
Recall that the item parameters are integrated out during the estimation of ability in
the MMLE approach. With the item parameters temporarily out of the picture, the abil-
ity parameters (q) can be estimated more efficiently. Two Bayesian approaches are used
to estimate person ability: expected a posteriori (EAP) and the maximum a posteriori
(MAP). One of the two is selected based on the requirements of the analysis at hand.
For example, one of the techniques is used based on characteristics of the sample such
as sample size and distributional form in relation to the target population. Also, the type
of items that comprise the test (i.e., dichotomous, partial credit, or polytomous formats)
also has issues to be considered. In the Bayesian context, the population distribution of
ability (q) is called the prior, and the product of the likelihood of q and prior density gives
the posterior distribution of ability of q, given the empirical item response pattern (Du
Toit, 2003, p. 837). As a Bayesian point estimate (e.g., the mean or mode) of q, it is typical
to use the value of q at the mode of the posteriori distribution (MAP), or the mean of the
posteriori distribution (EAP). The choice depends on the context of the testing scenario
(e.g., the type and size of sample and the type and length of test). The equation illustrat-
ing the estimation of the likelihood of an item response vector, given person ability and
item parameters a-, b-, and c-, is provided in Equation 10.14.
There are two instances when the assumptions of IRT are violated and prevent the use of
standard IRT models. First, local independence is violated when examinees respond to test
items composed of testlets (Wainer & Kiely, 1987; Wainer, Bradlow, & Du, 2000). Testlets
Item Response Theory 365
N
L(U1, U2 , U3 ,…, UN |qˆ , A, B, C) = ÕÕ PIUI Q1I - UI
I =1 J=1
• u1 = response to an item.
• L = likelihood of item responses given ability and item
parameters.
• a = item discrimination parameter.
• b = item difficulty parameter.
• c = pseudoguessing parameter.
• q̂ = examinee’s estimate of ability.
N
• Õ = multiplication over person abilities.
I =1
N
• Õ = multiplication over item responses.
J=1
are a collection of items designed to elicit responses from a complex scenario (e.g., a mul-
tistep problem in mathematics or laboratory problems or a sequence in science) expressed
in a short paragraph. Such clusters of items are correlated by the structure of the item
format, thereby violating local item independence. Wainer et al. (2007) and Jannarone
(1997) provide rudimentary details and present a framework for developing IRT models
for items and tests that violate the conventional assumption of local independence.
Unidimensional IRT models are also inappropriate when a test is given under the
constraint of time (i.e., a speeded testing situation). For example, under a speeded test-
ing scenario two underlying abilities are being measured: cognitive processing speed and
achievement. Researchers interested in using IRT for timed or speeded tests are encour-
aged to read Verhelst, Verstralen, and Jansen (1997) and Roskam (1997), both of whom
provide comprehensive details regarding using IRT in these situations.
The next section presents Rasch and IRT models used in educational and psycho-
logical measurement. Specifically, the Rasch, one-, two-, and three-parameter logistic IRT
models for dichotomous data are presented. These models were the first to be developed
and are foundational to understanding more advanced types of Rasch and IRT mod-
els (e.g., tests and instruments that consist of polytomous, partial credit, or Likert-type
items, and multidimensional Rasch and IRT models).
366 PSYCHOMETRIC METHODS
Perhaps no other model has received more attention than Rasch’s model (1960). Georg
Rasch (1901–1980), a Danish mathematician, proposed that the development of items
comprising a test follow a probabilistic framework directly related to a person’s abil-
ity. Rasch, using a strict mathematical approach, proposed that a certain set of require-
ments must be met prior to obtaining objective-type measurement similar to those in the
physical sciences. Rasch’s epistemological stance was that in order for measurement to
be objective, the property of invariant comparison must exist. Invariant comparison is
a characteristic of interval or ratio-level measurement often used for analysis in applied
physics. According to Rasch (1960), invariant comparison (1) is a comparison between
two stimuli that should be independent of the persons who were used for the compari-
son, and (2) should be independent of any other related stimuli that might have been
compared. Thus, the process of Rasch measurement and modeling is different from
classic statistical modeling—and the other IRT modeling approaches presented in this
chapter. In the Rasch approach to measurement, the model serves as a standard or crite-
rion by which data can be judged to exhibit the degree of fit relative to the measurement
and statistical requirements of the model (Andrich, 2004). Also important to the Rasch
approach is the process of using the mathematical properties of the model to inform the
construction of items and tests (Wright & Masters, 1982; Andrich, 1988; Wilson, 2005;
Bond & Fox, 2001). Conversely, in general statistical approaches, models are used to
describe a given set of data, and parameters are accepted, rejected, or modified depend-
ing on the outcome. This latter approach is the one adopted and currently used by a large
proportion of the psychometric community regarding IRT.
In the Rasch and other IRT models, the probability of a correct response on a dichot-
omous test item is modeled as a logistic function (Equation 10.8) of the difference
between a person’s ability and an item’s difficulty parameter (Equation 10.9). The logistic
function is used extensively in statistics in a way that extends the linear regression model
to estimate parameters comprised of variables that are dichotomous. Although many dis-
tributions are possible for use with dichotomous variables, the logistic has the following
desirable properties. First, it is easy to use and is highly flexible. Second, interpretation
of the results is straightforward because application of the logistic function results in a
model that is linear based on the logarithmic transform, making interpretation similar to
a linear regression analysis. In linear regression, the key quantity of interest is the mean
of the outcome variable at various levels of the predictor variable.
There are two critical differences between the linear and logistic regression models.
First is the relationship between the predicator (independent) variables and the criterion
(dependent) variable. In linear regression, the outcome variable is continuous, but in
Item Response Theory 367
logistic regression (and in IRT), the outcome variable is dichotomous. Therefore, the
outcome is based on the probability of a correct response (Y) conditional to the ability
of a person (i.e., the x variable). In the linear regression model, the outcome variable is
expressed as the conditional mean expressed as E(Y |x – the expected value of Y given x).
In linear regression, we assume that this mean can be expressed as a linear equation.
The second major difference between the linear and logistic regression models
involves the conditional distribution of the outcome variable (probability of a correct
response). In the logistic regression model, the outcome variable is expressed as y = p(x) + e.
The symbol e is an error term and represents an observation’s deviation from the con-
ditional mean. The symbol p is a dichotomous random variable based on the binomial
distribution (i.e., a density function on a 0 to 1 metric). A common assumption about e is
that it follows a normal distribution with mean 0.0 and constant variance across the lev-
els of the independent variable. Based on the assumption that errors are normally distrib-
uted, the conditional distribution of the outcome variable given x will also be normally
distributed. However, this is not true for dichotomous variables modeled on the range of
0 to 1. In the dichotomous case, the value of the outcome variable is expressed as 1 – p(x).
The quantity of e may assume only 0 or 1. If y = 1, then e = 1 – p(x) with probability p(x)
and if y = 0, then e = –p(x) with probability 1 – p(x).
Inspection of Figure 10.3 reveals that as ability increases, the probability of a cor-
rect response increases. Also, as Figure 10.5 illustrates, the relationship E(Y|x) is now
expressed as p(x) for the logistic model or simply p(x) in the Rasch or any IRT model.
Notice that because the conditional mean of Y (the probability) gradually approaches 0
or 1 (rather than directly in a linear sense), the IRF is depicted as an S-shaped curve. In
fact, the curve in Figure 10.5 resembles one-half of the cumulative normal distribution
(see the Appendix). The following logistic Equation 10.15 (and Equation 10.8 presented
earlier) and the Rasch model Equation 10.16 (Equation 10.9 presented earlier) yield
parameters that are linear in the logistic transformation.
To illustrate, in Figure 10.7 the probability of a person responding correctly to item
3 (from Figure 10.1 in the beginning of the chapter) is provided on the Y-axis, and
EZ
P( X) =
1 + EZ
E(q - d J)
P ( X J = 1 | q, d J ) =
1 + E(q - d J)
a = 1.000 b = 0.000
1.0
0.8
0.6
Probability
0.4
0.2
0
-3 -2 -1 0 1 2 3
Ability
Figure 10.7. IRF for a person with an ability of 0.0 and item difficulty or location of 0.0.
person ability is given on the X-axis. Notice that the item location or difficulty is 0.0 and
is marked on the X-axis by the letter b (this is denoted as d in the Rasch model). The
b-parameter (or d) is a location parameter, meaning that it “locates” the item response
function on the ability scale.
Using Equation 10.15 and inserting the values of 0.0 for person location and 0.0 for
item location into Equation 10.16, we see that the probability of a person responding cor-
rectly to item 3 is .50 (see Equation 10.17).
Item Response Theory 369
2.7183(0.0-0.0) 1
P( X I = 1 | q, d I ) = (0.0 -0.0)
= = .50
1 + 2.7183 1+1
In words, Equation 10.17 means that a person with ability 0.0 answering an item with a
location (difficulty) of 0.0 has a 50% probability of a correct response. Next, we calibrate
the item response data for intelligence test 2 with the Rasch model using BILOG-MG
(Mislevy & Bock, 2003). In Figure 10.8, we see the result of Rasch calibration using
BILOG-MG for item 11 on the crystallized intelligence test 2. In the BILOG-MG phase 2
output, the chi-square test of fit for this item under the Rasch model was observed to be
0.8
0.6
Probability
0.4
0.2
0
-3 -2 -1 0 1 2 3
Ability
Figure 10.8. Rasch logistic ICC for item 11 on crystalized intelligence test 2. The graph pro-
vides the fit of the observed response patterns versus the predicted pattern. Slope is constrained
to 1.0. The solid dots indicate the number of segments by which the score distribution is divided.
In the graph, notice that as person ability (X-axis) and item difficulty (the b-value in the graph)
become closer together in the center area of the score distribution, the probability of responding
correctly to the item is 5. The dots also indicate that as the discrepancy between ability and item
difficulty becomes larger, the model does not fit the data within the 95% level of confidence (e.g.,
the 95% error bars do not include the dot).
370 PSYCHOMETRIC METHODS
0.62—indicating a good fit (i.e., the chi-square test of independent was not rejected for
this item). However, by inspecting the 95% confidence bars in Figure 10.8, we see that
at the extremes of the ability distribution, the observed versus predicted model–data fit
is not within the range we would like (i.e., the solid dots are not within the 95% level of
confidence). Later, we fit the 1-PL IRT model to these data and compare the results with
the Rasch analysis for item 11.
The BILOG-MG syntax below provided the ICC presented in Figure 10.8 (Du Toit,
2003).
Figure 10.9 illustrates the results of the BILOG-MG analysis in relation to item
parameter and person ability estimates for all 25 items from the Rasch analysis. To aid
interpretation, the IQ metric is included to illustrate the direct relationship to the ability
scale (q) typically scaled on a z-score metric and item difficulty scale. In Rasch and IRT
analyses, scale transformation from the ability metric to other metrics (such as IQ) is
Person Ability -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
Item 16 Item 20
Item 11 Item 21 Item 23
Item 3 Item 4 Item 5 Item 6
Item 7 Items
Item 9 Item 10 12 - 15 Item Item 22
2 Item 25
Item 8 Item 18 Item 19
9 Item 24
17
Item location
(difficulty δ ) -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
IQ metric 40.0 55.0 70.0 85.0 100.0 115.0 120.0 130.0 140.0
Note.
*Item 1 not
scaled due
perfect score
*Item 2 = -5.3
Figure 10.9. Item–ability graph for crystallized intelligence test 2 based on Rasch analysis.
Item Response Theory 371
possible owing to the property of scale indeterminacy. Scale indeterminacy exists because
multiple values of q and d lead to the same probability of a correct response. Therefore,
the metric is unique up to a linear transformation of scale.
In the Rasch model (and all IRT models for that matter), a metric for the latent trait
continuum is derived as the nonlinear regression of observed score on true score, with
the person ability and item locations established on the same metric (i.e., a z-score met-
ric). To explain how examinees and items function under the Rasch model, if examinee
1 exhibits ability twice that of examinee 2, then this discrepancy is mathematically
expressed by applying a multiplicative constant of 2 (i.e., h1 = 2h2 or equivalently q1 =
2q2). Also, if item 1 is twice as difficult as item 2, then d1 = 2d2. Providing that the prop-
erties of person ability and item difficulty hold, a ratio level of measurement is attained,
with the only changes being due to the value of the constant involved. Theoretically,
such a ratio level of measurement is applicable to any sample of persons and items as
long as the same constants are used. This allows for direct comparisons across differ-
ent samples of persons and items, a property known in the Rasch literature as specific
objectivity, or sample-free measurement. With regard to the question of minimum
sample size for a Rasch analysis, simulation research supports the recommendation of
a minimum of 100 examinees and test length of at least 15 items for accurate item and
ability parameter estimates (Baker & Kim, 2004; Hambleton et al., 1991); however, this
is only a recommendation because of the complexity of the characteristics of the sample, test
items, test length, and amount of missing data which have implications for the performance
of the model given the data.
Importantly, as in any statistical modeling scenario, evaluating the fit of the model
to the data is crucial regardless of sample recommendations. Table 10.8 provides the item
parameter results from a Rasch analysis of the 25-item crystallized test of intelligence 2
for the total sample of 1,000 examinees using BILOG-MG. The item parameter estimates
for item 11 are highlighted in gray.
The BILOG-MG syntax below provided the output for Table 10.8 (Du Toit, 2003).
Table 10.9 provides the proportion correct, person ability, and standard error esti-
mates for the crystallized intelligence test 2 data. The values in this table are provided in
the BILOG-MG phase 3 output file. Table 10.9 provides only a partial listing of the actual
output.
As shown in Table 10.9, an examinee or person with ability of approximately 0.0
answered 12 out of 24 items correct (there are only 24 items in this analysis because
item 1 had no maximum and therefore no item statistics). Comparing Table 10.9 with
CTT proportion correct, we see that a person answering 50% of the items correctly cor-
responds to an ability of 0.0. Finally, notice that the standard error of ability is smallest at
ability of 0.0 (i.e., where information is highest in the Rasch model).
Graphically, Figure 10.8 illustrated an item characteristic curve for item 11 based on
a sample of 1,000 examinees. The item location parameter (i.e., difficulty) d or b = –.358
(see Table 10.8). Notice in Figure 10.8 that as a person’s ability parameter increases on
the X-axis, his or her probability of correctly responding to the item also increases on the
Y-axis. In Figure 10.8, the only item parameter presented is the item location or difficulty
because in the Rasch model the slope of the curve for all items is set to a value of 1.0
(verify this in Table 10.8). Also, in Table 10.8, we see that the discrimination parameters
(labeled column a) are all 1.0 and the c-parameter for pseudoguessing is set to 0.0 (this
parameter is introduced in the section on the three-parameter IRT model).
Earlier in this chapter, Equation 10.11b and Figure 10.6 illustrated the information
function for item 11 under the Rasch model. Reviewing briefly, item information Ij(q)
is defined as the information provided by a test item at a specific level of person or
examinee ability (q). The IIF quantifies the amount of information available in estimat-
ing the person ability estimate (q). The information function capitalizes on the fact that
the items comprising a test are conditionally independent. Because of the independence
assumption, individual items can be evaluated for the unique amount of information they
contribute to a test. Also, individual items can be summed to create the total information
for a test. The test information function provides an overall measure of how well the test
is working specific to the information provided. In test development, item information
plays a critical role in evaluating the contribution an item makes relative to the underly-
ing latent trait (ability). For this reason item information is a key piece of information in
test development. Examining Equations 10.11a and 10.11b, we see that item information
is higher when an item’s b-value is closer to person ability (q) as opposed to further away
from (q). In fact, in the Rasch model, information is at its maximum at the location value
of d (or b in IRT). Item information can also be extended to the level of the total test
yielding a test information function (TIF) by summing the item information functions.
Summation of the IIFs is possible because of the assumption of local item independent
(i.e., responses to items by examinees are statistically independent of one another, allow-
ing for a linear summative model).
An example of how item responses and persons are included in the data matrix used for
a Rasch analysis is presented next. The item-level responses are represented as a two-
dimensional data matrix composed of N persons or examinees responding to a set of n
test items. The raw-data matrix is composed of a column vector (uij) of item responses of
length n. In Rasch’s original work, items were scored dichotomously (i.e., 0 or 1), so in
(uij) the subscript i represents items i = 1, 2, . . . , N, and subscript j represents persons
j = 1, 2, . . . , N. Given this two-dimensional data layout, each person or examinee is
represented by a unique column vector based on his or her responses to items of length
n. Because there are N vectors, the resulting item response matrix is n (items) X N (per-
sons). Figure 10.10 illustrates a two-dimensional matrix based on Rasch’s original item
response framework for a sample of persons.
374 PSYCHOMETRIC METHODS
Person
1 2 … … N (item total)
1 11 12 … … 1 1.= 1
2 21 22 … … 2 2.= 2
. . . … . … . .
. . . … . … . .
. . . … . … . .
Item
1 2 … … 2.=
. . . … . … . .
. . . … . … . .
. . . … . … . .
1 2 … . … .=
.1= 1 .2= 2 … . = … . =
(person total)
Figure 10.10. Two-dimensional data matrix consisting of items (rows) and persons (col-
umns) in the original data layout for Rasch analysis. In IRT, the data layout is structured as items
being columns and persons or examinees as rows.
Referring to the data matrix in Figure 10.10, we find that the two parameters of
interest to be estimated are (1) a person’s ability and (2) the difficulty of an item. Origi-
nally, Rasch used the symbols hj for person ability and di for the difficulty of an item. In
the Rasch model, these symbols represent properties of items and persons, although now
the symbol h is presented as q in the Rasch model and in the IRT models. The next sec-
tion transitions from the Rasch model to the one-parameter IRT model.
The one-parameter (1-PL) logistic IRT model extends the Rasch model by including a
variable scaling parameter a (signified as a in IRT). Understanding the role of a in relation
to the Rasch model is perhaps best explained by thinking of it as a scaling factor in the
Item Response Theory 375
Ea I (q - dI )
P( X I = 1|q, a I, d I) =
1 + Ea I (q - dI )
regression of observed score on true score. For example, the a-parameter (a-parameter in
IRT language) scales or adjusts the slope of the IRF in relation to how examinees of different
ability respond to an item or items. To this end, the scaling factor or slope of the ICC of the
items is not constrained to a value of 1.0, but may take on other values for test items. The
addition of the scaling parameter a to the Rasch model is illustrated in Equation 10.18.
In the Rasch model, the scaling parameter a is set to a value of 1.0. However, in the
one-parameter IRT model the restriction of 1.0 is relaxed in a way that allows the slope
of the IRF to conform to the empirical data (e.g., in a way that provides the best fit of the
nonlinear regression line). Another way of thinking about this is that in the one-parameter
IRT model, the slope of the IRF is now scaled or adjusted according to the discrimina-
tion parameter a, and the discrimination parameter is estimated based on the empirical
item response patterns. Introducing the scaling factor a allows us to conceptualize the
IRT model in slope–intercept form (as in standard linear regression modeling). Equation
10.19 (de Ayala, 2009, p. 17) illustrates the slope–intercept equation using the symbolic
notation introduced so far.
The inclusion of the scaling parameter provides a way to express the Rasch or one-
parameter model in terms of a linear equation. Remember that IRT models are regression
models, so taking the approach of a linear equation allows us to think about IRT as a
linear regression model. For example, the effect of multiplying the scaling factor or item
discrimination parameter (i.e., a) with the exponent in the one-parameter IRT model
provides a way to rewrite the exponent as Equation 10.19 (de Ayala, 2009, p. 17). Obtain-
ing the item location or difficulty using the elements from Equation 10.19 involves rear-
ranging and solving for the intercept term g, thereby yielding Equation 10.20a. Recall
from earlier in the chapter the linear equation aq + g yields the logit. Graphically, the
slope–intercept equation (expressed in logits) is depicted in Figure 10.11 for item 11 on
the crystallized intelligence test 2.
376 PSYCHOMETRIC METHODS
a(q – d) = aq – ad = aq + g
3.0
2.0
1.0
-1.0
-2.0
-3.0
Figure 10.11. The linear parameterization (logit) for item 11 using values from Table 10.5.
Applying the linear equation to get the logit for item 11 IRT parameters for the total sample in
Table 10.5 of 1,000 examinees: a(q) + g = .99(–07) + .07 = 0.0.
Item Response Theory 377
g
d=-
a
Equation 10.20a illustrates how, by rearranging terms, one can derive the item loca-
tion if the intercept and discrimination are known.
Next, using the item location or difficulty from our example in Figure 10.5 and
inserting it into Equation 10.20a we have the result in Equation 10.20b for the item loca-
tion (i.e., difficulty).
Practically speaking, the slope–intercept equation tells us that as a changes, the slope
of the IRF changes across the continuum of person ability. This becomes more relevant
later when we introduce the two-parameter IRT model. In the two-parameter model, the
item discrimination parameter (a-parameter) is allowed to be freely estimated and there-
fore varies for each item. Figure 10.12 illustrates the IRF for item 11 based on the one-
parameter IRT model. Notice that the slope is 1.66 and the location or difficulty is –.303
as opposed to 1.0 and –.358 in the Rasch model. These new values are a direct result of
relaxing the constraints of the Rasch model in regard to the fit of the empirical data to the
one-parameter IRT model.
Next we have the result of a one-parameter IRT model estimated using the same
data as was previously done with the Rasch model with item 11 as the focal point. Figure
10.12 illustrates the ICC for item 11 based on the one-parameter IRT analysis.
The BILOG-MG syntax on p. 378 provided the graph presented in Figure 10.12 (Du
Toit, 2003).
Notice that the ability metric (0,1) has been rescaled to (100,15) in the SCORE
command.
g -.358
d=- = = -.358
a 1
378 PSYCHOMETRIC METHODS
0.8
Probability
0.6
0.4
0.2
0
-3 -2 -1 0 1 2 3
Ability
Figure 10.12. One-parameter logistic IRF for item 11 with the slope, location, and intercept
freely estimated based on the characteristics of the item responses of 1,000 examinees. The graph
provides the fit of the observed response patterns versus the predicted pattern. The solid dots indi-
cate the number of segments by which the score distribution is divided. In the graph notice that
as person ability (X-axis) and item difficulty (the b-value in the graph) become closer together in
the center area of the score distribution, the probability of responding correctly to the item is .5.
The dots also indicate that as the discrepancy between ability and item difficulty becomes larger, the
model still fits the data within the 95% level of confidence (e.g., the 95% error bars do not include
the dots). The slope is now estimated at 1.664 rather than 1.0 in the Rasch model.
Inspecting Figure 10.12, we see that the one-parameter IRT model fits the data for
item 11 better than did the Rasch model, where the slope was constrained to 1.0 (e.g.,
all of the solid dots are now within the 95% error bars). Table 10.10 provides the item
parameter estimates for all 25 items on the test.
At this point, you may be wondering how to decide which model to use in a test
development situation. Recall that the philosophical tradition when using the Rasch
model is to construct a test composed of items that conform to the theoretical require-
ments or characteristics of the model. This differs from the IRT approach where the goal
is to fit a model that best represents the empirical item responses, after the item construc-
tion process and once data are acquired. In the current example, you will either have to
(1) remove or revise the items so that the requirements of the Rasch model are met or
(2) work within the data-driven approach of the IRT paradigm. Of course, in the IRT or
data-driven approach, individual items are still reviewed for their adequacy relative to
the model based on early activities within the test development process (e.g., theoretical
adequacy of items in terms of their validity). Returning to the current example using item
11, we observe the chi-square test of fit for this item using the 1-PL model to be 0.62—
indicating a good fit (i.e., the chi-square test of independence was not rejected). The item
fit chi-square statistics are provided in the phase 2 output of BILOG-MG, PARSCALE,
MULTILOG, and IRTPRO. An important point regarding evaluating item fit is that the chi-
square fit statistics are only accurate for tests of 20 items or longer (e.g., the accuracy of
the item parameter estimates is directly related to the number of items on the test). The
item difficulty parameter estimated in the one-parameter model is now labeled as b (as
opposed to d in the Rasch model). In the one-parameter IRT model, the b-parameter for
an item represents the point or location on the ability scale where the probability of an
examinee correctly responding to an item is .50 (i.e., 50%). The greater the value of the
b-parameter, the greater the level of ability (q) required for an examinee to exhibit a prob-
ability of .50 of answering a test item correctly.
As in the Rasch model, the item b- or difficulty parameter is scaled on a metric with
a mean of 0.0 and standard deviation of 1.0 (on a standard or z-score metric). In the one-
parameter IRT model, the point at which the slope of the ICC is steepest represents the value
of the b-parameter. The ability (q) estimate of an examinee or person is presented as q̂ and is
also scaled on the metric of a normal distribution (i.e., mean of 0 and standard deviation of 1).
Finally, we see that in the one-parameter IRT model, the test items provide the max-
imum amount of information for persons with ability (q) nearest to the value of the
b-parameter (in this case, a value of b = –.303). Derivation of the information is the same
as presented earlier in Equations 10.11a–10.11b. However, the maximum information
possible in the one-parameter model is not .25 because the slope of the IRF now may
take on values greater or less than 1.0. This result can be seen in Figure 10.13 where the
maximum information for item 11 is .69.
The next section introduces the two-parameter (2-PL) IRT model. In the two-
parameter model, the slope parameter is freely estimated based on the empirical charac-
teristics of the item responses.
380 PSYCHOMETRIC METHODS
0.6
0.5
Item 11 information
0.45
Information
0.4
0.3
0.2
0.1
b = –.303
0
–3 –2 –1 0 1 2 3
Scale Score
Figure 10.13. Item information based on the 1-PL in Figure 10.12. I(q) on the Y-axis is the
information function of ~.45. Proficiency is the ability scale (q) on the X-axis. The information
provided by the items reaches maximum (.69) when the b-parameter = –.303. Note the difference
in the maximum item information possible in the Rasch model for item 11 being .25 versus .69 in
the 1-PL IRT model. This change in maximum information is due to relaxing the assumptions of
the Rasch model during the estimation process.
The two-parameter (2-PL) IRT model marks a clear shift from the Rasch model in
that a second parameter, the item discrimination, is included in the estimation of the
item parameters. The assumptions of local item independence, unidimensionality, and
382 PSYCHOMETRIC METHODS
invariance presented earlier in this chapter are the same for the two-parameter model. In
this model, one works from a data-driven perspective by fitting the model to a set of item
responses designed to measure, for example, ability or achievement. However, the two-
parameter model estimates (1) the difficulty of the items and (2) how well the items dis-
criminate among examinees along the ability scale. Specifically, the two-parameter model
provides a framework for estimating two parameters: a, representing item discrimination
(previously defined as a in the Rasch or a- in the 1-PL IRT model), and b, representing
item difficulty expressed as the location of the ICC on the person ability metric (X-axis).
Increasing the number of item parameters to be estimated means that the sample size
must also increase in order to obtain reliable parameter estimates. The sample size recom-
mended for accurate and reliable item parameter and person ability estimates in the 2-PL
model is a minimum of 500 examinees on tests composed of at least 20 items; however, this is
only a recommendation because sample size requirements will vary in direct response to the
characteristics of the sample, test items, test length, and amount of missing data. The N = 500
general recommendation is based on simulation studies (de Ayala, 2009, p. 105; Baker
& Kim, 2004). Alternatively, some simulation research has demonstrated that one can
use as few as 200 examinees to calibrate item responses using the two-parameter model
depending on (1) the length of the test, (2) the quality of the psychometric properties of
the test items, and (3) the shape of the latent distribution of ability. However, as in any
statistical modeling scenario, evaluating the fit of the model to the data is essential, rather
than relying solely on recommendations from the literature.
In the two-parameter model, the varying levels of an item’s discrimination are
expressed as the steepness of the slope of the ICC. Allowing discrimination parameters
to vary provides a way to identify the degree to which test items discriminate along the
ability scale for a sample of examinees. Specifically, the ICC slope varies across test items,
with higher values of the a-parameter manifested by steeper slopes for an ICC. Items
with high a-parameter values optimally discriminate in the middle of the person ability
(q) range (e.g., q values ± 1.0). Conversely, items with lower values of the a-parameter
discriminate better at the extremes of the person ability (q) range (i.e., outside the range
of q ± 1.0). As is the case in the 1-PL IRT model, the b-parameter of 0.0 corresponds to
an examinee having a 0.50 probability (i.e., 50% chance) of answering an item correctly.
Once person ability (q), the a-parameter, and b-parameter of an item are known, the
probability of a person correctly responding to an item is estimated. The two-parameter
IRT model is given in Equation 10.21a.
To illustrate the two-parameter equation for estimating the probability of a correct
response for an examinee on item 11 on the intelligence test data with ability of 0.0, we
insert person ability of 0.0, location or difficultly of –0.07, and discrimination of .99
into Equation 10.20a yielding Equation 10.21b. Therefore, the probability of a correct
response for an examinee at ability 0.0 on item 11 is .53.
Notice in Equations 10.21a and 10.21b that the element D is introduced. This element
serves as a scaling factor for the exponent in the equation, as a result of which the logistic
equation and normal ogive equation differ by less than .01 over the theta range (Camilli,
1994). The normal ogive IRT model is the logistic model rescaled to the original metric of
the cumulative normal distribution. Next, we calibrate the 25-item crystallized intelligence
Item Response Theory 383
E DAI (q - BI)
P( X I = 1| q, AI , BI ) =
1 + E DAI (q - BI)
E1.7*.99(0.0-0.07)
P( X I = 10.0,.99,
| -0.07) = = .53
1 + E1.7*.99(0.0-0.07)
test 2 item response data (N = 1,000) using the two-parameter IRT model using the fol-
lowing BILOG-MG program. Again, for comparison purposes with the one-parameter and
Rasch models, we focus on item 11. The ICC for item 11 is provided in Figure 10.14.
The BILOG-MG syntax below provided the output for Figure 10.14 and Tables 10.11
and 10.12 (Du Toit, 2003).
0.8
Probability
0.6
0.4
0.2
0
–3 –2 –1 0 1 2 3
Ability
We see that item 11 fits the 2-PL model well as evidenced by (1) a nonsignificant
chi-square statistic (i.e., p = –.395), and (2) the solid dots representing different levels
of ability falling within the 95% level of confidence (i.e., within the confidence level
error bars). Table 10.11 provides the item parameter estimates for the 2-PL model, and
Table 10.12 provides a partial listing of the phase 3 output from BILOG-MG. BILOG-MG
produces phases 1–3 for the one-, two-, or three-parameter models. In Table 10.11 we
see that the item parameters are now estimated for item 1 (recall that in the Rasch and
1-PL no results were produced because no maximum was being achieved). Notice that
in addition to the ability estimates for all 1,000 examinees, phase 3 produces a reliability
estimate for the total test (see the bottom portion of Table 10.12). The reliability pro-
vided in phase 3 is defined as the reliability of the test independent of the sample of persons
(based on the idea of invariance introduced earlier in this chapter). The way reliability
is conceptualized here (i.e., as a property of how the item-level scores for persons are
relative to a set of test items) is a major difference from CTT-based reliability introduced
in Chapter 7.
Item Response Theory 385
Table 10.11. Item Parameter Estimates for 2-PL Model of Crystallized Intelligence
Test 2
Intercept a-parameter b-parameter c-parameter Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
ITEM0001 3.4 0.562 –6.049 0 0.5
0.277* 0.155* 1.597* 0.000* –0.7764
ITEM0002 2.956 0.639 –4.622 0 1.4
0.228* 0.152* 0.908* 0.000* –0.7157
ITEM0003 1.372 0.63 –2.179 0 2.5
0.080* 0.073* 0.195* 0.000* –0.8654
ITEM0004 1.094 0.692 –1.582 0 6.4
0.067* 0.067* 0.123* 0.000* –0.6078
ITEM0005 0.922 1.105 –0.835 0 2.6
0.071* 0.085* 0.055* 0.000* –0.9568
ITEM0006 0.959 1.239 –0.774 0 7.3
0.077* 0.098* 0.050* 0.000* –0.4023
ITEM0007 1.086 0.564 –1.927 0 12.1
0.061* 0.057* 0.172* 0.000* –0.1468
ITEM0008 0.556 0.788 –0.705 0 26.9
0.053* 0.067* 0.065* 0.000* –0.0007
ITEM0009 0.376 0.897 –0.42 0 29.9
0.049* 0.070* 0.056* 0.000* –0.0002
ITEM0010 0.262 0.846 –0.31 0 13.5
0.046* 0.065* 0.056* 0.000* –0.0952
ITEM0011 0.07 0.989 –0.071 0 8.4
0.047* 0.073* 0.048* 0.000* –0.3965
ITEM0012 0.064 1.488 –0.043 0 20.4
0.056* 0.111* 0.038* 0.000* –0.0047
ITEM0013 0.039 1.21 –0.032 0 10.7
0.051* 0.092* 0.043* 0.000* –0.15
ITEM0014 0.069 0.889 –0.077 0 32.3
0.046* 0.069* 0.053* 0.000* 0
ITEM0015 –0.083 0.992 0.084 0 31.5
0.049* 0.077* 0.048* 0.000* 0
ITEM0016 –0.254 1.185 0.214 0 7.2
0.054* 0.092* 0.042* 0.000* –0.3047
ITEM0017 –0.756 1.219 0.62 0 5.1
0.066* 0.095* 0.046* 0.000* –0.6493
ITEM0018 –0.924 0.968 0.954 0 8.5
0.069* 0.087* 0.064* 0.000* –0.2019
ITEM0019 –0.991 0.932 1.064 0 15.3
0.076* 0.089* 0.068* 0.000* –0.0093
ITEM0020 –1.434 1.388 1.033 0 7.4
0.107* 0.121* 0.052* 0.000* –0.1925
ITEM0021 –1.333 1.099 1.214 0 10.9
0.102* 0.109* 0.068* 0.000* –0.0539
(continued)
386 PSYCHOMETRIC METHODS
Table 10.12 provides a partial output from BILOG-MG phase 3 that includes propor-
tion correct, person or examinee ability, standard errors of ability, and reliability of ability
estimates.
======================================
CRIT2
CRIT2 1.0000
TEST: CRIT2
MEAN: -0.0013
S.D.: 0.9805
VARIANCE: 0.9614
TEST: CRIT2
RMS: 0.3299
VARIANCE: 0.1088
EMPIRICAL
RELIABILITY: 0.8983
Note. Reliability here relates to the reliability of the test independent of the sample of persons, based
on the idea of invariance introduced earlier in this chapter. Because of the IRT property of invariance,
the reliability estimate above represents a major difference between CTT reliability in Chapter 5 and
IRT-based reliability.
Item Response Theory 387
Table 10.12. Person Ability Estimates, Standard Errors, and Marginal Probability
Tried Right Percent Ability S.E. Marginal prob
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2898 0.5647 0.0001
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 3 12 –1.9726 0.5364 0.0057
25 3 12 –1.9478 0.5336 0.0002
25 3 12 –1.7086 0.4837 0.0000
25 3 12 –1.7583 0.4983 0.0006
25 3 12 –1.9431 0.533 0.0029
25 3 12 –1.9431 0.533 0.0029
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
25 11 44 –0.2461 0.3813 0.0000
25 11 44 –0.3461 0.2961 0.0000
25 11 44 –0.3033 0.3384 0.0000
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
25 24 96 2.2445 0.3206 0.0032
25 24 96 2.3032 0.3251 0.0018
25 24 96 2.3032 0.3251 0.0018
(continued)
388 PSYCHOMETRIC METHODS
Reviewing the item statistics in Table 10.11, we see that item 11 has a slope (dis-
crimination) of .99 and a location (difficulty) of –0.07. These are different from the Rasch
model where the discrimination was constrained to 1.0 and difficulty was observed as
–.358. Also, in comparison to the 1-PL model the differences are substantial with the
discrimination for item 11 being 1.66 and location or difficulty being –.303.
Item information in the two-parameter model is more complex than in the Rasch or one-
parameter IRT model because each item has a unique discrimination estimate. As the ICC
slope becomes steeper, the capacity of the item to discriminate among persons or examin-
ees increases. Also, the higher the discrimination of an item, the lower the standard error
of an examinee’s location on the ability scale. For items having varying discrimination
parameters, their ICCs will cross one another at some point along the ability continuum.
Item discrimination parameter values theoretically range between –infinity to infinity
(–¥ to ¥), although for purposes of item analysis, items with discrimination values of
0.8 to 2.5 are desirable. Negative item discrimination values in IRT are interpreted in a
similar way as in classic item analysis using the point–biserial correlation. For example,
negative point–biserial values indicate that the item should be discarded or the scoring
protocol reviewed for errors. Equation 10.22 illustrates the item information function for
I J (q) = A 2J PJ (1 - PJ )
1.5
Information
1.0
.70
0.5
b = .07
0
–3 –2 –1 0 1 2 3
(ability)
Scale Score
Ability
Figure 10.15. Item information function based on the 2-PL IRF in Figure 10.14.
the two-parameter model. Figure 10.15 illustrates the item information function for the
two-parameter model (item 11, Figure 10.14).
In the next section, the three-parameter logistic IRT model is introduced with an
example that allows for comparison with the Rasch, one-, and two-parameter IRT mod-
els. As previously, we focus on item 11 in the crystallized intelligence test 2 data.
The three-parameter logistic (3-PL) IRT model is based on the assumptions presented
earlier in this chapter and is the most general of the IRT models (i.e., imposes the fewest
restrictions during item parameter estimation). In the three-parameter model the item
parameters a-, b-, and c- are simultaneously estimated along with examinee ability. The
c-parameter in the three-parameter model is known as the guessing or pseudoguessing
parameter and allows one to model the probability of an examinee guessing a correct
answer. The c-parameter is labeled pseudoguessing because it provides a mechanism for
accounting for the situation where an examinee correctly responds to an item when the
IRT model predicts that examinee should not. However, this contradictory circumstance
may occur for reasons other than an examinee simply guessing. For example, a person
with very low ability may respond correctly to an item of moderate difficulty because of
390 PSYCHOMETRIC METHODS
cheating or other test-taking behavior such as an ability that has been developed that
enables an examinee to answer correctly based on a keen knowledge of how to take
multiple-choice tests.
Recall that in the one-parameter model (and Rasch model) only the b-parameter is
estimated, with no provision for modeling differential item discrimination or the possibility
of correctly answering a test item owing to chance guessing (or another ability altogether).
In the two-parameter model, provision is allowed for the estimation of discrimination and
difficulty parameters (a- and b-) but no possibility for guessing a correct response. In the
one- and two-parameter models, the lower asymptote of the ICC is zero and the upper
asymptote is 1.0 (e.g., refer back to the top half of Figure 10.3). In the one-and two-parameter
models, because the lower asymptote is always zero and the upper asymptote is 1.0, the
probability of a correct response at an item’s location (i.e., the difficulty d or b-value) is given
as (1 + 0.0)/2 or .50. In the 3-PL model, the lower asymptote, called the c-parameter, is esti-
mated along with a- and b-parameters. When the probability of guessing a correct response
(or pseudoguessing parameter) is above zero, this is represented by the lower asymptote of
the ICC (i.e., the c-parameter) being greater than zero. The result of the c-parameter being
greater than zero is that the location of the item’s difficulty or b-parameter shifts such that
the probability of a correct response is greater than .50.
The advantage of using the three-parameter model is its usefulness for test items
or testing situations where guessing is theoretically and practically plausible (e.g., in
multiple-choice item formats). More precisely, the three-parameter model provides a way
to account for a chance response to an item by examinees. Because the c-parameter has
implications for examinees and items in a unique way, the role of the c-parameter merits
discussion. Consider a multiple-choice test item that includes a five-response option.
To account for random guessing, the c-parameter (i.e., lower asymptote of the ICC) for
such an item is set to 1/5 or .20 in the item parameter estimation process. However, the
random guessing approach assumes that all multiple-choice item alternatives are equally
appealing to an examinee, which is not the case in most testing conditions. For exam-
ple, an examinee who does not know the answer to a multiple-choice test item may
always answer the item based on the longest alternative (e.g., a test-taking strategy). In
the three-parameter model, parameters are estimated for persons of varying ability, but
their inclination to guess remains constant (i.e., the c-parameter remains constant for
all examinees), which is not likely to be the case. So, in this sense, the three-parameter
model may or may not accurately account for guessing.
Another artifact of using the three-parameter model is that nonzero c-parameter val-
ues reduce the information available for the item. If you compare the two-parameter
model item calibration results with the three-parameter calibration results, there is no
mechanism for modeling or accounting for the probability of a person of very low ability
responding correctly to items of medium to high level of difficulty (or even easy items).
This is the case because in the 2-PL model the lower asymptote is constrained to zero
(i.e., there is no chance guessing when the model does not mathematically provide for it).
The previous scenario regarding the probability of guessing or of a person with very
low ability correctly answering an item correctly can be explained by the following ideas.
Item Response Theory 391
E DA J (q - BJ)
P( X I = 1|q, AJ , BJ, CJ ) = C J + (1 - C J )
1 + E DA J (q - BJ)
E1.7*1.15(0.0-.08)
P( X11 = 1|q, A11, B11, C11) = .08 + (1 - .08)
1 + E1.7*1.15(0.0-.08)
E1.95( -.08) E1.95(.16) 1.17
= .08 + .92 1.95( -.08)
= 1* = = .54
1+ E 1+ E 1.95(.16)
1 + 1.17
BILOG-MG program provided below. We focus on item 11, and the IRF is provided in
Figure 10.16. Notice that the only change in the BILOG-MG program syntax from the
two-parameter model to a model with three-parameters is changing the NPARM=2 option
to NPARM=3 (highlighted in gray). Table 10.14 provides the item parameter estimates,
standard errors, and marginal probability fit statistics for the three-parameter analysis.
Three-Parameter Logistic Model.BLM - CRYSTALLIZED INT.TEST 2
ITEMS 1–25
Table 10.13 provides the classical item statistics and the logit scale values for the 25-item
crystallized intelligence test 2. Notice that item 1 is now reported because the maximum
of the log likelihood was obtained, making the estimation of item 1 parameters possible.
However, you see that this item (and item 2 as well) contributes very little to the test
because its point–biserial coefficient is .02 and 99.5% of the examinees answered the item
correctly. Shortly, we provide a way to decide which model—the two- or three-parameter—
is best to use based on the item response data from crystallized intelligence test 2. Table
10.13 is routinely provided in phase I of the BILOG-MG output.
Table 10.14 provides the parameter estimates from the BILOG-MG phase 2 output.
Notice that an additional column labeled “loading” is included. The loading values are
Item Response Theory 393
394
Threshold Loading (item*total test Asymptote
Intercept Slope (a-parameter) (b-parameter) correlation) (c-parameter) Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
ITEM0001 3.26 0.55 –5.94 0.48 0.20 0.00
0.283* 0.155* 1.627* 0.136* 0.090* –0.98
ITEM0002 2.86 0.68 –4.24 0.56 0.20 1.70
0.245* 0.151* 0.772* 0.125* 0.090* –0.44
ITEM0003 1.22 0.70 –1.74 0.57 0.22 3.20
0.118* 0.087* 0.249* 0.071* 0.093* –0.78
ITEM0004 0.92 0.79 –1.17 0.62 0.21 7.50
0.112* 0.095* 0.211* 0.075* 0.087* –0.48
ITEM0005 0.74 1.39 –0.53 0.81 0.18 3.80
0.100* 0.167* 0.103* 0.097* 0.052* –0.80
ITEM0006 0.83 1.54 –0.54 0.84 0.15 8.70
0.097* 0.183* 0.087* 0.100* 0.047* –0.28
ITEM0007 0.94 0.61 –1.54 0.52 0.18 13.70
0.100* 0.068* 0.242* 0.058* 0.082* –0.09
ITEM0008 0.12 1.26 –0.09 0.78 0.28 5.70
0.136* 0.214* 0.120* 0.133* 0.052* –0.58
ITEM0009 0.26 0.99 –0.26 0.70 0.09 37.10
0.077* 0.101* 0.091* 0.072* 0.038* 0.00
ITEM0010 0.12 0.96 –0.12 0.69 0.10 12.00
0.085* 0.103* 0.096* 0.074* 0.040* –0.15
ITEM0011 –0.09 1.15 0.08 0.75 0.08 12.00
0.089* 0.123* 0.073* 0.081* 0.031* –0.15
ITEM0012 –0.01 1.57 0.01 0.84 0.03 25.90
0.067* 0.131* 0.043* 0.071* 0.014* 0.00
ITEM0013 –0.04 1.28 0.03 0.79 0.04 11.60
0.066* 0.111* 0.051* 0.068* 0.018* –0.12
ITEM0014 0.00 0.93 0.00 0.68 0.04 36.10
0.060* 0.079* 0.065* 0.058* 0.020* 0.00
ITEM0015 –0.14 1.02 0.14 0.71 0.03 38.90
0.060* 0.085* 0.055* 0.059* 0.015* 0.00
(continued)
395
396
Table 10.14. (continued)
Threshold Loading (item*total test Asymptote
Intercept Slope (a-parameter) (b-parameter) correlation) (c-parameter) Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
0.095* 0.101* 0.070* 0.073* 0.009* 0.00
ITEM0020 –1.58 1.51 1.05 0.83 0.02 7.80
0.150* 0.154* 0.051* 0.085* 0.008* –0.16
ITEM0021 –1.40 1.14 1.23 0.75 0.01 17.60
0.120* 0.121* 0.068* 0.080* 0.007* 0.00
ITEM0022 –1.67 1.26 1.33 0.78 0.01 14.80
0.147* 0.134* 0.066* 0.083* 0.006* –0.01
ITEM0023 –1.80 1.04 1.74 0.72 0.02 8.50
0.172* 0.147* 0.124* 0.102* 0.008* –0.07
ITEM0024 –2.77 1.44 1.92 0.82 0.01 7.90
0.327* 0.234* 0.121* 0.133* 0.005* –0.05
ITEM0025 –3.99 1.84 2.18 0.88 0.01 10.00
0.669* 0.402* 0.149* 0.192* 0.003* –0.01
Note. Table values are from the phase 2 output of BILOG-MG. The intercept is based on the linear parameterization of the logistic model. The loading column is synonymous
with the results obtained from a factor analysis and reflects the impact or contribution of each item on the latent trait or ability.
Item Response Theory 397
0.8
Probability
0.6
0.4
0.2 c
0
–3 –2 –1 0 1 2 3
Ability
Figure 10.16. Three-parameter model item response function for item 11.
synonymous with the results from a factor analysis and reflect the strength of association
between an item and the underlying latent trait or attribute.
Figure 10.16 provides the ICC for item 11 based on the three-parameter model. We
see in the figure that the c-parameter is estimated at a constant value of .079 for item 11
for the sample of 1,000 examinees. Notice that at the lower end of the ability scale the
item does not fit so well (e.g., at ability of –1.5, the solid dot is outside the 95% level of
confidence for the predicted ICC).
The estimation of item information in the three-parameter model is slightly more com-
plex compared to the one- and two-parameter models. The introduction of the c-param-
eter affects the accuracy of locating examinees along the ability continuum. Specifically,
the c-parameter is manifested as uncertainty and therefore an inestimable source of error.
A test item provides more information when the c-parameter is zero given an item’s dis-
crimination and difficulty. Thus, the two-parameter model offers an advantage. Equation
10.24 illustrates this situation. Figure 10.17 presents the item information for item 11 on
crystallized intelligence test 2.
The maximum level of item information in the three-parameter model differs from
the one- and two-parameter models in that the highest point occurs slightly above an
398 PSYCHOMETRIC METHODS
D2 * A 2J (1 - C J )
I J (q) =
[C J + E1.7( A J )(q - BJ)][1 + E -1.7( A J ) (q - BJ)]2
2
item’s location or difficulty. The slight shift in maximum information is given by Equa-
tion 10.25 (de Ayala, 2009, p. 144). For example, the item 11 location (b) is .08, but the
information function shifts the location to .085 in Figure 10.17. Birnbaum, as described
in Lord and Novick (1968), demonstrated that an item provides its maximum informa-
tion according to Equation 10.25.
2.0
Information
1.5
1.0
.81
0.5
.085
0
–3 –2 –1 0 1 2 3
Scale Score
Figure 10.17. Three-parameter model item information function for item 11.
Item Response Theory 399
1
qMAXIMUM = B J + LN[0.5(1 + 1 + 8CJ )]
DA J .
Now that the Rasch, one-, two-, and three-parameter models have been introduced, we turn
to the question of which model to use. As you may now realize, this is a complex question
involving the entire test development process; not simply the mathematical aspects of fit-
ting a model to a set of item response data. From a statistical perspective, I present a way to
select among the possible models using a model comparison approach. Recall that the three-
parameter model is the most general of those presented in this chapter. Working from the most
general three-parameter model (i.e., least restrictive in terms of assumptions or constraints
placed on the item parameters), we can statistically compare the two-, one-parameter, and
Rasch models to it because they are variations on the three-parameter model. For example,
by imposing the restriction that the c-parameter is zero (i.e., there is no possibility of guess-
ing or adverse test-taking behavior), we have the two-parameter model. Likewise, imposing
the restriction that the c-parameter is zero and that the a-parameter is set to a constant, we
have the one-parameter model. Finally, imposing the restriction that the c-parameter is zero
and that the a-parameter is set to a value of 1.0, we have the Rasch model. The adequacy of
each model can be tested against one another by taking the difference between the –2 log
likelihood values available in phase 2 output of BILOG-MG. For the three-parameter model,
the final convergence estimate for the –2 log likelihood is 20250.0185 (highlighted in gray
in the display on page 401). Below is a partial display of the three-parameter model output
from phase 2 BILOG-MG that illustrates the expectation–maximization (E-M) cycles from
the calibration process for our crystallized intelligence test 2 data.
Phase 2 output for 3-PL model illustrating the –2 log likelihood values, interval counts for item
chi-square fit statistics, and average ability (theta) values across eight intervals based on the
empirical item response data
[E-M CYCLES]
[NEWTON CYCLES]
-2 LOG LIKELIHOOD: 20250.0185
Next, I provide the same section of the BILOG-MG output based on calibration of the
data using the two-parameter model.
Phase 2 output for 2-PL model illustrating the –2 log likelihood values, interval counts for item
chi-square fit statistics, and average ability (theta) values across eight intervals based on the
empirical item response data
[E-M CYCLES]
[NEWTON CYCLES]
The –2 log likelihood values are formally called deviance statistics because they are
derived from the fitted (predicted) versus observed item responses to the IRF. Because the
two-parameter model is completely nested within the three-parameter model, and know-
ing the final –2 log likelihood values known for the two- and three-parameter models, we
can conduct a test of the difference between the final deviance (i.e., –2 log likelihoods)
values using the likelihood ratio (LRT) test (Kleinbaum & Klein, 2004, p. 132), as illus-
trated in Equation 10.26a. The likelihood ratio statistic is distributed as a chi-square
when the sample size is large (e.g., the sample sizes normally used in IRT qualify).
Inserting the deviance values for the two-parameter and three-parameter models
into Equation 10.26a yields the result in Equation 10.26b.
To evaluate the difference between the two models, we need to know the degrees
of freedom for the two models. The degrees of freedom for each model is derived as the
number of parameters in the model (e.g., for the three-parameter model there are three)
times the number of items in the test (in our crystallized intelligence test 2 there are
25 items). Therefore, the degrees of freedom for the two-parameter model is 2*25 = 50,
and for the three-parameter model, 3*25 = 75. Next, we subtract the degrees of freedom
from the two-parameter model (50) from the three-parameter model degrees of freedom
(75), yielding a result of 25. Next, we use the chi-square distribution to test whether
the change between the two models is significant; by consulting a chi-square table of
critical values with 25 degrees of freedom (testing at a = .05) we find a critical value of
37.65. Recall that our value of the difference between the two model deviance statistics
Item Response Theory 403
is –83.72. The difference of –83.72 is larger than chi-square critical, so we reject the test
that the models are the same.
The deviance value for the two-parameter model is smaller (20166.30) than the devi-
ance for the three-parameter model (20250.02). And since the difference between the two
values is statistically significant, the two-parameter model appears to be the best choice
given our data—unless there is an overwhelming need to employ a three-parameter model
for reasons previously discussed. Similarly, one can conduct a model comparison between
the two-parameter and one-parameter model to examine the statistical difference between
the two models (e.g., as in Table 10.15). However, the decision between using the one- or
two-parameter model may require more than a statistical test because a goal of the test
may be to estimate how the items discriminate differently for examinees of different ability
levels. Table 10.15 provides a summary of the three IRT models using the 25-item crystal-
lized intelligence test data with N = 1,000 examinees. The relative change values are derived
using Equation 10.27. In Equation 10.27 the deviance statistics are inserted into the equa-
tion to illustrate the relative change between the two- and three-parameter IRT models.
The column labeled “relative change” in Table 10.15 provides a comparison strategy
similar to that used in comparing multiple linear regression models. For example, in regres-
sion analysis a key issue is identifying the proportion of variance (R2) that a model accounts
for (i.e., how well a regression model explains the empirical data). The larger the R2, the bet-
ter the model explains the empirical data. Using this idea, we can compare our models by
examining the relative change in terms of proportion or percent change (or improvement)
across our competing models. Inspection of Table 10.15 shows that the relative change
from the one- to two-parameter model is very large (i.e., 97%). Next, we see that the change
between the three-parameter and two-parameter models is less than 1% (although the LRT
detected a statistically significant difference). Evaluating our three IRT models this way tells
us that the one-parameter model is the most parsimonious and that the difference between
the two- and three-parameter models, though statistically significant, is of little practical
importance from the perspective of how much variance each model explains. Based on the
model comparison results, it appears that the two-parameter model is the best to use if item
discrimination is an important parameter to be estimated for testing purposes. If item dif-
ficulty is the only parameter deemed as being important with regard to the goals of the test,
the one-parameter model provides an acceptable alternative to use.
This chapter highlighted some key theoretical differences between CTT and Rasch and IRT
modeling. The ideas of weak and strong true score test theory were introduced. Next, the
assumptions of unidimensionality, local item independence, and invariance were covered,
and examples were provided regarding how these assumptions are evaluated with empiri-
cal data. We then turned to applied examples of the Rasch, one-, two-, and three-parameter
models for dichotomous item responses. Importantly, this chapter serves as a primer to
understanding other types of IRT models. For example, other types of Rasch and IRT mod-
els that are extensively used include (1) Rasch and IRT models for test items that yield item
responses scored on a partial credit basis (e.g., on problems in mathematics that require
steps in arriving at an answer); (2) Rasch and IRT models for attitude or rating scales that
yield item responses that are scored using Likert-type items (i.e., polytomous); (3) models
for exclusively nominal response data (e.g., in personality assessment); and (4) tests that are
multidimensional in structure (e.g., tests that measure two or more attributes or constructs
simultaneously). Excellent resources are available for learning about and implementing these
models including de Ayala (2009), Baker and Kim (2004), and Ostini and Nering (2006).
The foundational material in this chapter will serve you well in preparation for the transition
to using other Rasch and IRT models for addressing practical measurement problems.
Item information function (IIF). The contribution that an item makes to estimation of
person ability.
Item response function. The mathematical function that produces a trace line or ICC.
Item Response Theory 405
Item response theory. A system of modeling procedures that uses latent characteristics
of persons and test items as predictors of observed responses.
Joint maximum likelihood estimation. An early approach to the simultaneous estima-
tion of item parameters and person ability.
Latent class analysis. An analysis technique used when there are homogeneous sub-
populations of examinees within a sample. LCA can be used within the IRT framework
when multidimensionality is present in a measurement model.
Latent trait. A person’s underlying ability that is only observed indirectly.
Local independence. An axiom of IRT that states there is no statistical relationship (i.e.,
no correlation) between persons’ item responses to pairs of items once the primary
trait or attribute being measured is held constant or is accounted for.
Logistic function. An S-curve incorporating an exponent (2.718) where the initial
rate is exponential, slowing through the middle section, and finally reaching a
plateau.
Marginal maximum likelihood estimation. An optimal IRT parameter estimation tech-
nique in terms of consistent large-sample asymptotic properties of item parameter
estimates. This asymptotic consistency property exists for short and long tests.
Maximum likelihood estimate. An iterative numerical method where the maximum of
the likelihood function of person ability reaches its maximum point (the slope of the
tangent line = 0) with a desired degree of precision (equivalent to the log likelihood
function).
Mixture model. A measurement or psychometric model that includes a heterogeneous
subpopulation of persons or examinees. Mixture IRT models also include item types
with heterogeneous response formats.
MLE[θ̂]. The maximum of the likelihood estimate of person ability.
Strong true score theory. A theory that involves applying mathematical models to data
obtained on tests or other social and behavioral measuring instruments. The assump-
tions involved in applying the model correctly to real data are substantial as com-
pared to classical test theory. The true relationship between observed variables (i.e.,
item responses) and unobserved variables or latent traits formally classifies IRT as a
strong true score theory.
Test information function. The sum of the item information functions; represents the con-
tribution that a set of items comprising a test makes to estimation of ability.
Testlets. An item that is purposively designed to consist of a related set of items that
are therefore correlated. In IRT, this type of test item violates the assumption of local
independence.
Unidimensionality. An assumption of unidimensional IRT models whereby responses
to a set of items are represented by a single underlying latent trait, dimension, or
continuum.
11
This chapter introduces norms and test score equating and the role each plays in psycho-
metrics. First, standard scores are introduced along with the role they play in testing. Next,
the development and use of standard scores is introduced along with techniques for creat-
ing normative scores. Examples of linear, equipercentile, and item response theory–based
methods of equating observed and true scores are provided with intelligence test data.
11.1 Introduction
A primary difficulty in interpreting test scores stems from the variety of scales that exist
in psychological measurement. Furthermore, there are a variety of examinee groups on
which the scales (and test items) are defined during the process of test development.
These circumstances reveal that it is nearly impossible for users of psychological tests
to develop practical familiarity with a number of scales and/or tests they may use. In
Chapter 2, foundational principles and concepts of psychological measurement were
introduced, and comparisons were made with familiar physical measurement scales (e.g.,
temperature, weight, and length). Interpreting numerical values acquired from using these
well-established standard physical measurement scales, which are so common to our daily
lives, requires no reference manual to describe their characteristics (e.g., the precision,
accuracy, and reliability of numbers the scales provide). Finally, there is no need for nor-
mative information about the measurements acquired on standard measurement scales
for temperature, weight, and length in order to ensure the correct use and interpretation
of the scores or numerical values obtained since direct experience with their use provides
standard guidance in most situations where these standard measurement scales are used.
In psychological (and educational) measurement and testing, we face substan-
tially more challenges than those encountered in standard physical measurements. In
407
408 PSYCHOMETRIC METHODS
Most standardized tests of achievement and ability in psychology and education use
norms such as percentiles, age, or grade equivalents and standard scores. A standard
score is a raw score converted from one scale to another where the latter employs an
arbitrary mean and standard deviation. Standard scores are more easily interpreted than
raw scores, and the position of an examinee’s performance relative to other examinees
is clearly indexed. The term norm is used in the scholarly literature to refer to a behav-
ior that is usual, average, normal, standard, expected, or typical (Cohen & Swerdlik,
2010, p. 111). Norms are defined as test performance data on a group of examinees used
as a reference for evaluating, interpreting, or placing in context of persons’ test scores
(Cohen & Swerdlik, 2010, p. 111). Norming is the process of creating norms based on
a normative sample—a sample of examinees whose performance is analyzed and then
used as a reference for other individual persons taking the test. Norm-referenced testing
and assessment is defined as a method for evaluating and interpreting an examinee’s score
and comparing it to scores of examinees on the same test.
The following points provide guidelines for planning and conducting a norming study or
project. The information is general enough to be adapted to the specific needs and goals
of a particular study. We use the characteristics of subjects in the GfGc dataset to make
the examples concrete.
1. Decide on the (target) population to be used to derive the norms. Example: A
large sample is obtained from the U.S. population for the purpose of calculating norma-
tive values on the crystallized intelligence, fluid intelligence, and short-term memory
Norms and Test Equating 409
subtests. The sample is stratified by age (ages 15–90 years, grouped into eight age bands),
sex, and region, and the total sample size is at least 1,600 (e.g., 200 subjects per age
band). The sample includes a minimum of 25–100 individuals in each targeted demo-
graphic and language subgroup.
2. Select the sampling strategy. Example: A probability sampling strategy will
be employed. Probability (random) sampling ensures that every person in a defined
target population has a known probability of being selected into the sample. Knowing
this probability of selection, we can compute estimates of sampling error, leading to
information about the precision of the statistics computed from the raw scores. Fur-
thermore, in selecting a probability sample, stratification (the partitioning of a popula-
tion into homogeneous subgroups or strata and sampling independently from each)
will increase the precision (via representativeness) of the population estimates based
on the sample data. Based on the information in point 1 above, our sampling strategy
will be stratified random. Other random sampling strategies include simple random
sampling, systematic sampling, and cluster random sampling. See Groves et al. (2009)
for details of various sampling strategies that are applicable to a variety of norming
study designs.
Note on nonrandom sampling: Sometimes norms are developed using samples of
convenience or samples acquired for a specific purpose (i.e., convenience or purpose-
ful sampling). For example, consider the increased use of the Internet for web-based
survey sampling or test administration. Although convenience or purposive samples are
sometimes used in situations where normative information is calculated and reported,
the drawback to developing and using norms based on these types of samples is the pos-
sibility of systematic bias that influences respondents’ or examinees’ data (responses).
Additionally, the composition of the sample is unlikely to represent the target population
of interest accurately. In this situation, poststratification and weighting adjustments are
possible to better align the sample with the population of interest (e.g., see Groves et al.,
2009).
3. Select the statistics that will be calculated in preparation for the standard or
scale score creation using the norming sample. Example: The mean, standard devia-
tion, variance, skewness, kurtosis, and percentile ranks are based on raw scores from the
sample. Assuming we are using a random or probability-based sampling protocol, the
mean (or any other sample statistic) computed for the norming sample is an estimate of
the population parameter. For example, classical (long-run) probability theory tells us
that employing random sampling, we can construct a frequency distribution of sample
means around the single population means (e.g., We assume that we take a large number
of repeated independent random samples of a particular size and construct a sampling
distribution of the means. There is an estimate of error associated with the sampling dis-
tribution of the means.). Because we are using a random (probability-based) sample, the
error distribution is distributed normally (approximates the normal curve). The afore-
mentioned points provide us with a way to quantify the degree of sampling error in our
norming sample. This step in turn allows us to report the amount of error attributable to
410 PSYCHOMETRIC METHODS
our sampling protocol, which affects the accuracy of any interpretation made from using
the normative data.
4. Decide on the level of sampling error that is acceptable. Example: Previously,
sampling error was introduced in the context of probability theory. Sampling error is the
discrepancy between the sample estimate and the population parameter. The acceptable
margin of sampling error depends on the goals of the norming study and how the scores
are to be used.
5. Acquire the sample and review any anomalies that will influence the devel-
opment of the norms. Example: Conduct thorough data screening to identify outliers,
missing or out-of-range values. Carefully inspect the sample data for numerical errors or
out-of-range values; make corrections as necessary. Decide on and then apply decision
rules for replacing or imputing missing data points.
6. Compute the values of group-level statistics such as the mean, variance,
skewness, kurtosis, and standard error of the mean (see the Appendix for a
review).
7. Select the type of normative scores that are most useful and develop raw
score to standard score conversion tables. Example: Percentiles or linear transformed
z-scores with a standard score metric of, for example, mean = 10/SD = 3; normalized scale
scores with, for example, an IQ scale score metric – mean = 100/SD = 15 for composite
scores (e.g., verbal IQ).
8. Develop detailed written documentation of the norming procedures.
9. Draft guidelines or a technical manual for the purpose, development, use,
and interpretation of the norms.
The unadjusted linear transformation is the most basic of the formal scaling methods.
To apply this method, a standard reference or normative sample is selected. Recall that the
method of sample selection depends on the goals of the norming study. For example, the
sample may be acquired randomly from a defined population with specific characteristics of
the sample or may be purposive (i.e., selected for the specific purpose of developing standard
scores for a particular group). Once the sample is defined and raw score data are acquired,
application of the unadjusted linear transformation method involves (1) relocating the raw
score mean at the desired scale score location and (2) ensuring a uniform change in the size of
412 PSYCHOMETRIC METHODS
the score units to yield the desired scale score standard deviation. Under the unadjusted linear
transformation method, only the mean and standard deviation of the raw score distribution
is changed (i.e., the first two moments of the distribution of scores). Therefore, the skewness
and kurtosis (i.e., the third and fourth moments of the distribution of scores) of the original
raw distribution is unaffected by the transformation to the standard scale score metric. Exces-
sive kurtosis or leptokurtosis in the new scale score distribution will mirror the characteristics
of the original distribution to the same degree. Finally, the linear transformation method does
not transform the raw score units of measurement to a scale where equal units of measure-
ment are obtained (i.e., ensuring an interval level of measurement).
Once the data are acquired from the reference or normative sample, the goal in
the unadjusted linear transformation method is to create standard score deviates (i.e.,
z-scores) for each corresponding raw score in the original distribution. Conceptually, this
is illustrated in Equation 11.1.
Creating the new standard or scale scores involves a transformation using Equation
11.2 where we calculate the slope and intercept of a straight line. Using the constants for
the slope and intercept, we can derive the new standard scores for each raw score in the
original distribution. Equation 11.2 illustrates how to derive the new scale scores using
the slope and intercept equation.
To provide a working example of Equations 11.1 and 11.2, we use a subset of the
GfGc data consisting of examinees 15 to 20 years of age. The total sample size for this
group is N = 231. Our goal is to create a scale score with a mean of 10.0 and a standard
deviation of 3.0 from the distribution of the raw number correct scores for the crystallized
intelligence test 1 (vocabulary subtest). We will use the unadjusted linear transformation
method to accomplish our goal. The transformation of the original raw scores to the scale
scores can be accomplished using the COMPUTE command in SPSS. However, first we
T = AX + B
need to calculate the constants for the slope and intercept so that we can use these in the
SPSS COMPUTE command. Next, we need to obtain the mean and standard deviation
of the original raw score distribution for crystallized intelligence test 1. Below are the
descriptive statistics for original raw score distribution for crystallized intelligence test 1.
Next, using the information provided in Table 11.1, we can calculate the slope (A)
and intercept (B) constants required for creating our new scale scores as follows.
3.0
A= = .357
8.386
B = 10.0 − .357 ( 33.528 )
= 10.0 − 11.969
= −1.969
Finally, using the constants in Equation 11.3 we can calculate the new scale scores
using the following SPSS COMPUTE command syntax.
COMPUTE cri1_tot_SS=cri1_tot*.357-1.969.
EXECUTE.
To verify that our linear transformation was successful according to theory, we can
inspect the descriptive statistics of the raw and scale score distributions. For example, the
mean of the scale score distribution should be 10.0, and the standard deviation should
be 3.0. Recall that in the unadjusted linear transformation method only the mean and
standard deviation are used in deriving the slope and intercept constants. Therefore, the
skewness and kurtosis for the new scale score distribution should be unchanged from
the original raw score distribution. We can check to see if this is true by inspecting
the descriptive statistics (Table 11.2) of both distributions. Reviewing the statistics in
Table 11.2, we see that the unadjusted linear transformation worked as anticipated. By
using the COMPUTE command in SPSS, we have created a new variable in our dataset
Norms and Test Equating 415
labeled “cri1_tot_SS,” where each original raw score now has an associated scale score on
the mean of 10.0 and a standard deviation of 3.0 metric (within rounding error).
A percentile rank corresponds to a specific raw score where the percentage of examinees in
the norm group scored below the score of interest (Crocker & Algina, 1986, p. 439). Percen-
tile ranks are useful for making relative or normative evaluations of examinee’s performance
within a specific group. The percentile rank scale is a type of normative scale that provides the
percentage of examinees in a specific group scoring below the midpoint of each score or score
interval. The percentile rank is defined in Equation 11.4 (Crocker & Algina, 1986, p. 439).
In calculating percentile ranks, we assume that the underlying construct of ability
or achievement is continuous (even though raw scores are discrete variables). Given this
assumption, each raw score point represents a score interval on the ability or achievement
continuum. To properly account for raw score intervals on the ability or achievement con-
tinuum, theoretically one-half of the examinees are expected to score below the midpoint
and one-half are expected to score above the midpoint. Therefore, in the numerator of
Equation 11.4, the value of .5 is used to index the midpoint of a particular class interval.
416 PSYCHOMETRIC METHODS
Percentile ranks are easy to compute and appealing to laypersons and professionals alike.
They are used extensively in reporting or communicating the results of standardized or
norm-referenced tests. For example, provided that characteristics of the testing scenario
such as time of year and administration procedures were the same as those experienced
by the group originally used to develop the norms, certain statements can be made about
a person’s performance. For example, a person with a raw score of 44 scored higher
than 90% of the examinees in the norm group. Percentile ranks have one primary short-
coming: they distort the measurement scale, and this is particularly true at the extreme
regions of the scale (Gregory, 2000, p. 64).
Consider the case where four persons take an exam and person 1 scores at the 50th
percentile, person 2 at the 60th percentile, person 3 at the 90th percentile, and person 4 at the
99th percentile. Is the difference in raw score points between person 1 and 2 the same as
the difference in raw score points for persons 3 and 4? It seems that the difference may be
the same. However, inspection of Figure 11.1 reveals that the distance between the 50th
and 60th percentile ranks (e.g., 10 percentile points—not raw score points) is the same
Norms and Test Equating 417
Figure 11.1. Percentile ranks in a normal distribution. Adapted from Gregory (2000, p. 64).
Copyright 2000. Reprinted by permission of Pearson Education, Inc. New York, New York.
as the distance between the 90th and 99th percentile ranks (i.e., 10 percentile points).
However, the difference in raw score points between the 90th and 60th percentiles is
much less than the difference in raw score points between the 90th and 99th percentile
point. To summarize, there is not a one-to-one correspondence between percentile ranks and
raw scores throughout the score continuum. This is because the conversion of raw scores to
percentile ranks is nonlinear rather than linear.
The decision to create normalized standard or scale scores during the process of creating
norms varies according to the goal(s) of how scores on the test will be used. For exam-
ple, for tests of ability (i.e., intelligence) or achievement (i.e., education or scholastically
based), the practice of using normalized scale scores is defensible for at least two reasons.
First, the items on such tests are developed or written in such a way that the distribution of
responses provided by an examinee group will approximate the normal distribution. Second,
Norms and Test Equating 419
the development of normalized scale scores usually involves large samples during the
norms development process. Creating norms using the normalized scale score approach on
large-scale tests of ability and achievement involves meticulous steps to ensure that (1) the
items are written in such a way as to yield a distribution of scores that is approximately
normally distributed and (2) large, representative samples are acquired for the norming
process. Under this scenario, creating and using the normalized scale scores makes sense
and is appropriate. However, in the situation where local norms (i.e., norms for a specific
group of examinees where generalization to outside groups is not conducted) are being
developed and the distribution of scores is not normally distributed (and is not expected
to be), creating normalized scale scores is arguably the incorrect decision. Instead, alterna-
tive norms (e.g., linear z-scores transformed to a useful metric or percentile scores) can be
developed that are more appropriate for the intended use of the test scores.
Normalized scale scores are created by applying a nonlinear transformation of the
original raw score distribution. Specifically, the inverse of the standard normal cumula-
tive distribution relative to the proportion of each raw score estimate in a distribution of
scores is applied to create the normalized scale scores. Although most raw score distribu-
tions do not meet the criteria for being classified as “normally distributed,” at times it
is reasonable to create normalized scale scores using the inverse of the standard normal
cumulative distribution. Normalized scale scores are typically created in a manner that
provides representation of (1) an examinee’s score relative to the norm group and (2) the
location of that norm group’s distribution in relation to that of other group distributions
(Crocker & Algina, 1986, p. 453). When applying this approach, the raw score distribu-
tion is changed or transformed into a metric that meets the normal distribution criteria.
The advantage of using normalized scale scores based on a z-score scale metric is that
regardless of the sample respondents involved, for each score point on the scale, a fixed
percentage of cases (i.e., persons) fall above and below that point. Normalized scale scores
differ from linear z-scores (transformed from raw scores) in that the normalized adhere
to the properties of the normal distribution (e.g., the mean, standard deviation, skewness,
and kurtosis values adhere to the characteristics of the normal distribution). For this rea-
son, linear z-scores and normalized scale scores (or z-scores) will differ depending on how
the raw score distribution originally departed from the normal distribution.
The following SPSS syntax produces the normalized scale score values in the last col-
umn of Table 11.4. Normalized scale scores are obtained using the inverse of the standard
normal cumulative distribution of the proportion estimate in the sample.
Next, we can evaluate and/or compare the observed (original raw score) versus
expected shape (i.e., normal) of the distribution of the variable cri1_tot using the syn-
tax below (Figures 11.2 and 11.3). We do this for the original raw score followed by
the normalized scale score variable to evaluate if the normalized scale score in fact fits
the normal distribution. Inspection of Figure 11.3 reveals this to indeed be the case.
SPSS syntax for producing quantile–quantile plots of raw and normalized scale scores
Note. The "rankit" option in SPSS applies the formula (r-1/2)/w, where w is the number of observa-
tions and r is the rank, ranging from 1 to w. Other options are available for deriving percentile ranks
depending on the goal of the norming study.
Figure 11.2. Normal Q–Q plot of cri1_tot raw score variable. An expended departure from
normality occurs at the upper end of the raw score distribution. This is depicted by the dots moving
away from the diagonal line.
422 PSYCHOMETRIC METHODS
Figure 11.3. Normal Q–Q plot of cri1_tot normalized scale score variable. No departure from
normality occurs in the normalized scale score distribution. This is depicted by the solid dots fall-
ing on the diagonal line through the range of z = −3.0 to +3.0.
One problem with normalized scale scores r, such as those displayed in Table 11.4, is that
negative scores exist in the distribution, making interpretation and reporting results to
nontechnical audiences a challenge. We present two techniques for converting the scale to
a more useful metric.
The first example involves transforming the linear z-scores created and displayed in
Table 11.4 to a derived score. We can linearly transform the linear scale scores to another
metric. For example, the metric for subtests on the Wechsler test of intelligence and mem-
ory is mean = 10 and standard deviation = 3. To transform the linear scale scores previously
created to this metric, we can use the SPSS syntax below. Using the syntax, a new viable
in the dataset is created, named “cri1_tot_LSS.” Note that this transformation moves the
scores to a more useful metric, the original (1) location (mean), (2) scale (standard devia-
tion), (3) skewness, and (4) kurtosis of the original raw score distribution. The SPSS dataset
“GfGc_Ageband_01.sav” includes the results of applying the syntax below.
*Derived score conversion syntax:
COMPUTE cri1_tot_subtest_metric1=10+3*cri1_tot_LSS.
EXECUTE.
Norms and Test Equating 423
The second example involves transforming the normalized z-scores created and dis-
played in Table 11.4 to normalized scale scores. Again for our example, we use the metric
for subtests on the Wechsler test of intelligence and memory where the mean = 10 and the
standard deviation = 3. To transform the normalized z-scores previously created to this
metric, we can use the SPSS syntax below. Using the syntax, a new variable in the dataset
is created named “cri1_tot_NSS.” Note that although the metric changes, this conver-
sion retains the properties of the normal distribution created by applying the inverse of
the standard normal cumulative distribution. The SPSS dataset “GfGc_Ageband_01.sav”
includes the results of applying the syntax below.
1. Create the linearly transformed z-scores or normalized scale scores that align
with the goal(s) of the test. In large-scale testing programs, these are typi-
cally normalized scale scores derived from the inverse cumulative normal density
function.
2. Smooth the newly created standard scores with the help of either a curve-fitting
program or a function. You can use programs such as SPSS or SAS or Origin
to facilitate the smoothing process, although some manual adjustments may be
required for the final norms.
3. Sum the smoothed standard scores for each subtest to create the composite score
for all examinees.
4. Modify the “normalized scale score conversion syntax” previously presented as:
COMPUTE new_IQ_composite_variable_name=100+15*variable name
for the sum of the four subtests that were normalized.
EXECUTE.
Following the steps above results in normalized composite scores for all examinees.
Typically, these scores will not need additional smoothing since the smoothing process
was conducted at the level of each subtest.
424 PSYCHOMETRIC METHODS
The goal of creating an age- (or grade-) equivalent scale is to communicate the meaning
of a child’s or person’s test performance in terms of what is typical of a child or person at a
particular age or grade. Such scores are used primarily at ages (or grades) where ability or
achievement increases rapidly with age (e.g., in developmental studies of reading ability
or growth with young children). The following steps adapted from Angoff (1984, p. 20)
provide a general framework for constructing age (or grade) equivalent scores within the
context of a norming study. Children (i.e., very young through adolescence) serve as our
example since age-equivalent scores are most often applied to this group.
achievement measured through mental testing) and chronological age. For example, a
perfect correlation or association does not exist between age and ability (represented by
score performance). This fact is easily demonstrated by regressing (1) age on test score
or regressing (2) test score (performance) on age. These two regression lines will be dif-
ferent; therefore, mental age (as linked to ability) will be different as well. The result is
that the same mental age may be assigned to a child with different test scores. Finally,
the lower the correlation (and this may occur for several reasons) between age and test
performance, the greater the challenge in interpreting age and test performance.
The second problem with age-equivalent scores is that the curve estimated in step 3
above does not capture the variability along different score values on the curve. The prob-
lem is that the age-equivalent score can yield a distorted view of a child’s advancement
or lack thereof. Depending on the level of the test’s reliability, this issue may be highly
problematic (e.g., if the score reliability is low). Consider the situation where the correla-
tion between age and test score is low and the variation about the regression line is high.
A child’s score under these circumstances will place him or her more than two years above
his or her actual age. Alternatively, if the correlation between age and test score is high
and the variation about the regression line is low, a child located at the 95th percentile
will be classified as precisely two years advanced beyond his or her age. The point is that
the variation across the regression line for age and test score is not constant across, and
this distorts any interpretation (and therefore utility for users and the public).
The final reason that age-equivalent scores are problematic is that the concept of
mental score (defined as the same intellectual performance regardless of chronological
age) oversimplifies the study of individual differences. For example, although a 5-year-
old may have the same intelligence score as a 10-year-old, important differences in these
children exist. To this end, all that can be stated is that the 5-year-old is bright or that
her score locates her above the 99th percentile in comparison with other children her age.
The literature on test score scaling, linking, and equating is extensive, and this section of
the chapter provides an overview of these techniques. The primary focus is on test score
equating supplanted by information on score linking. Many techniques and procedures
have been developed for equating test scores. Holland and Dorans (2006) classify the
procedures and techniques according to (1) common-population versus common-item
data collection designs, (2) observed-score versus true-score equating procedures, and
(3) linear versus nonlinear techniques. In this chapter, we focus on linear and nonlinear
observed score techniques and a true score equating technique based on item response
theory.
Test score linking describes the transformation from a score on one test to a score
on another test; test score equating is a special type of score linking (Dorans, Moses,
& Eignor, 2011). The term link is used to describe the transformation from a score
on one test to a score on another test. In score linking, techniques exist that allow for
426 PSYCHOMETRIC METHODS
(1) predicting scores on one test from other information about examinees, (2) aligning
scales, and (3) equating scores (Holland & Dorans, 2006). Scale aligning and score
equating are often confused because equating is a type of scale aligning that requires
exceptionally strong requirements of the test forms (i.e., scores) being linked. In scale
aligning, the goal is to transform the scores from two different tests onto a common
scale. Figure 11.4 illustrates the different uses of scale linking and the scores produced
from each technique.
This section of the chapter focuses on the strongest form of linking between test
scores—score equating. For a thorough exposition on score linking, equating, and cali-
bration, readers are encouraged to see Kolen and Brennan (2004) and von Davier (2011).
Many testing and assessment programs have different versions of the same test that
produce scores useful in an interchangeable manner, even though the exact items on each
test version differ. The goal of equating is to “produce a linkage between two test forms
such that the scores from each test form can be used as if they had come from the same
test” (Dorans et al., 2011). At the heart of score equating is the idea that in order for scores
to be equivalent, they must be exchangeable. In order for scores on two tests to be truly
exchangeable, the following conditions are required (Dorans et al., 2011).
1. The two tests should measure the same construct, latent trait, or ability.
2. The two tests should exhibit equal estimates of score reliability.
3. The equating transformation function used for mapping the scores of test Y onto
those of test X should be the inverse of the equating transformation for mapping
scores of X to those of Y.
4. It should make no difference to the examinee regarding which of the two tests
he or she takes.
5. The equating function used to link the scores of X and Y should be the same
regardless of choice of population or subpopulation from which it was derived.
Figure 11.4. Three categories of test score linking methods and associated goals.
Norms and Test Equating 427
Observed Score
Common Group/
Classical Test Theory
Population
True score
Test Equating
Chain Equating
Anchor /
Observed
Reference
Score
Test
Poststratification Equating
True score
Figure 11.5. Types of test score linking methods used in test equating.
Table 11.5 (Crocker & Algina, 1986, p. 458) summarizes these three designs.
Test score equating techniques are classified according to three primary categories. For
example, equating is conducted using (1) linear, (2) equipercentile, or (3) item response
Norms and Test Equating 429
theory-based methods. For each technique, specific assumptions are required in order
to produce accurate results from the score equating exercise. An equating function is a
transformation of raw scores on test X to the scale of raw scores on test Y. When equating
is successful, the equating function estimated from any other population is very similar.
This is true even though the equating function was estimated from a random sample of
examinees from the population. The role of random assignment is critical in equating
studies. To understand why, recall that our goal in equating is to compare the perfor-
mance of examinees (or groups) who have taken different tests. To accomplish this goal,
we must make adjustments to the test scores (or group statistics) so that the resulting
differences in scores reflect differences in the examinees or groups. So, the adjustment we
seek in the test scores must only be a function of differences in the tests—unaffected by
the attributes of the group of examinees used to make the adjustment.
In practice, equating proceeds according to two steps. In the first step (known as raw
score–to–raw score equating), the equating function is derived that links raw scores on
the “new” test (X) to those of an “old” test (Y). Step 2 involves conversion of the newly
equated X-scores to the scale to be used for reporting.
The linear equating technique assumes that the only differences between the distribution
of scores on test X and test Y are the means and standard deviations. Linear equating
involves identifying equivalent scores by identifying pairs of scores on one form X and
one form Y that have identical z-scores. If the z-scores are identical, the percentile ranks
will also be the same for scores on tests X and Y. In the example that follows, we proceed
according to Design I in Table 11.5, where 200 examinees are randomly assigned to take
test forms X and Y. We use Equation 11.5 to transform X to Y*.
Next we consider an example application of Equation 11.5. The mean of test form
X for the test of crystallized intelligence (for group 1) is M = 34, and the standard devia-
tion is s = 8. Group 2 takes form Y of the crystallized intelligence test, and the summary
statistics are: M = 36 and s = 85. Now we can apply Equation 11.5 to estimate how M =
34 a score of 33 on test X will equate to a score on test Y*.
430 PSYCHOMETRIC METHODS
β β
β α
α α
• Y* = equated Y-score.
β
• = slope of the conversion line (i.e., the standard devia-
α
tion of Y/the standard deviation of X).
• X = score on test X.
• β = mean of test Y.
• α = mean of test X.
β β
β α
α α
Norms and Test Equating 431
We see from application of Equation 11.6 that the equated Y*-score for an X-score of
33 is 34.98. Equating procedures are affected by random error, so it is important to know
what the size of this error is to evaluate the accuracy of our equated scores. The standard
error of equating (Lord, 1980) is defined as the standard deviation of converted scores
on the scale of Y corresponding to a fixed value of X, in which each converted Y-score is
taken from a conversion line that results from an independent sampling of groups A and
B from a population that is normally distributed in X and Y. Equation 11.7 illustrates the
standard error of equating for a fixed X-score of 33 based on Design I where there are 200
examinees taking both test forms A and B. Equation 11.8 illustrates an application of the
standard error of equating with example data.
Finally, in addition to differences in means and standard deviations, if the two distri-
butions for test X and Y differ on their degree of either skewness or kurtosis (or both), the
linear method of equating is not appropriate to use. In this case, equipercentile equating
Y
Y
Y
Y
432 PSYCHOMETRIC METHODS
is more suitable because the technique makes no assumptions regarding the shape of the
score distributions for X and Y (i.e., the equipercentile method makes no assumptions
about the equality of the first four moments of each distribution; mean, standard devia-
tion, skewness, kurtosis).
Under Design II, application of the linear equating technique assumes that the only dif-
ferences between the distribution of scores on test X and test Y for the groups are the
means and standard deviations. Additionally, both scores on both tests are equally reli-
able. In Design II (see Table 11.6), different groups of examinees take different forms
of the test in different randomly assigned orders. The goal of a Design II equating study
involves identifying equivalent scores by identifying pairs of scores on one form X and
one form Y that have identical z-scores (and percentile ranks). We use Equation 11.9a
Yb Yb Y Y
Xb Xb
Yb Yb X X
Xb Xb
• Y* = equated Y-score.
to transform X to Y*. Note that the slope of the conversion line is different in Design II
than was the case in Design I. Specifically, in Design II, the pooled variance is calculated
for each test form for the two testing occasions. The square root of the ratio of these two
variances is used for the slope estimate.
To illustrate Equation 11.9a in an equating study under Design II, consider the case
where the following score distributions result from the test administrations. The sum-
mary statistics are provided in Table 11.6.
In this example, our goal is to calculate an equated score (Y*) for an X-score of 52.
Equation 11.9a applied in Equation 11.9b using an X-score of 52 accomplishes our goal.
The standard error of equating for Design II (Lord, 1980) is, as in Design I, defined
as the standard deviation of converted scores on the scale of Y corresponding to a fixed
value of X, in which each converted Y-score is taken from a conversion line that results
from an independent sampling of groups A and B from a population that is normally dis-
tributed in X and Y. Equation 11.10 illustrates the standard error of equating for a fixed
X-score of 52 based on Design II, where there are 200 examinees taking both test forms
A and B.
Application of Equation 11.10 is illustrated in Equation 11.11 using a fixed X-score
of 52.
In Equation 11.11, we see that the standard error of equating for Design II is much
smaller than was the case in Design I (although the distributions of X- and Y-scores were
not exactly the same). However, by incorporating counterbalancing as in Design II, a
Yb Yb Y1 Y2 Yb Yb X X
Xb Xb Xb Xb
X XY
Y Y XY
X XY
Y Y XY
favorable reduction in error variance is usually achieved for equating X- and Y-scores as
compared to using Design I. In fact, if the sample size is the same for Designs I and II,
the standard error of equating will always be smaller. Additionally, the standard error of
equating will be substantially smaller when the two test forms are highly correlated (e.g.,
.80 or higher). From a practical perspective, this means that if you are using Design I,
you will need more examinees than when using Design II. Equation 11.12 (Angoff, 1984;
Crocker & Algina, 1986) illustrates the number of examinees required to achieve equality
in X- and Y-scores when (a) the correlation between test forms is .80 and (b) score X is a
z-score of zero (i.e., 0).
Norms and Test Equating 435
NA 2( z X2 + 2)
=
N B (1 - rXY ) éë z X2 (1 + rXY ) + 2ùû
2(0 + 2)
=
(1 - .80) [0(1 + .8) + 2]
= 10
1. The slope, intercept, and standard error of estimate for the regression of X on U
in subgroup 1 are equal to the slope, intercept, and standard error of estimate for
the regression of X on U in the population.
2. The slope, intercept, and standard error of estimate for the regression of Y on U
in subgroup 1 are equal to the slope, intercept, and standard error of estimate
for the regression of Y on U in the population (Crocker & Algina, 1986, p. 460).
If random assignment of groups to test forms A and B is not possible, Design III
may still be used. However, the results obtained from applying the regression equations
under Assumptions 1 and 2 above must be evaluated prior to applying the method. For
example, the larger the discrepancy between the groups on the anchor test score (U), the
less likely the assumptions will hold. The results from such discrepancy will be inaccu-
rate score equating. Next, an example for Design III is provided.
436 PSYCHOMETRIC METHODS
2 M 48.5 49.00
s 10.5 9.50
bYU 1.25
Total M 50.00
s 9.25
Note. M = mean; s = standard deviation; bXU = regression slope of X on U; bYU =
regression slope of Y on U. Total represents the mean and standard deviation on the
anchor test for both groups.
To illustrate equating Design III, consider the summary statistics for two tests of
short-term memory (Table 11.7).
Next we use Equation 11.13a (modified from Crocker & Algina, 1986, pp. 460–461)
to estimate a Y*-score for an X-score of 55.
Now we use the sample statistics in Table 11.7 and Equation 11.13a to solve for Y*
for a score of 55 on crystallized intelligence in Equation 11.13b.
Y* = a(X – c) + d
• a =
• c =
• d =
Norms and Test Equating 437
Y * = a( X - c) + d
sY22 + bYU
2
( sU2 - sU2 2 ) 110.25 + 1.56(85.56 - 90.25)
a= 2
=
s X2 1 + bXU
2
1
( sU2 - sU21 ) 100 + .722(85.56 - 81)
110.25 - 7.31 102.94
= = = .996
100 + 3.28 103.28
c = 50.5 + .85(50 - 51) = 49.65
d = 48.5 + 1.25(50 - 49) = 49.75
and
Y* = a(X − c) + d = .996 (55 − 49.65) + 49.75 = 55.07
highly flexible. Since the technique does not require the assumptions for linear equating, the
equipercentile method is classified as a more general, nonlinear technique. For example, the
equipercentile method makes no assumptions about the equality of the means and standard
deviations of the two test score distributions. The equipercentile method is “more general”
and can accommodate score distributions that are nonlinear. When the assumptions of the
linear method are met, linear equating is a special case of the equipercentile method. The
primary shortcoming of the equipercentile method is that the standard error of equating is
larger than the standard errors based on the linear equating techniques previously presented.
Nevertheless, in some situations linear equating methods are inappropriate when the assump-
tions are untenable. In such cases, the equipercentile method provides a useful alternative.
To illustrate equipercentile equating, we use two test forms (Y and X) for crystallized
intelligence test 1. Each group of examinees takes only one form—Y or X. The sample
size is 500 examinees in each study group. The score distributions for each form are
normally distributed but have different means, though approximately the same standard
deviations. Specifically, the mean of test form X is 12, and the standard deviation is 5.4.
The mean of test form Y is 13 and the standard deviation is 5.5. Although the means and
standard deviations are similar, Table 11.8 reveals that the percentile ranks for the raw
scores are quite different at certain locations along the score scale, so we use this data to
illustrate the logic and steps in conducting equipercentile equating.
The first step in equipercentile equating is to determine the percentile ranks for the
score distributions for each of the two forms. Table 11.8 provides the distributions and
midpercentile ranks for forms X and Y.
Percentile rank raw score curves are illustrated for each test form in Figure 11.6.
Figure 11.7 illustrates the smoothed Y- and X-scores resulting from the equipercen-
tile equating method. Results were obtained using The RAGE_RGEQUATE program
(Kolen & Brennan, 2004). Equipercentile equating can also be conducted using the program
438 PSYCHOMETRIC METHODS
2 1 1
3 3 3
4 8 8
5 15 16
6 18 21
7 23 26
8 27 31
9 32 36
10 38 43
11 44 49
12 49 54
13 55 60
14 63 66
15 67 72
16 71 76
17 78 81
18 84 86
19 88 90
20 92 92
21 95 96
22 98 98
23 99 99
24 99 99
25 99 99
Figure 11.6. Plot of percentile ranks for two 25-item tests of crystallized intelligence.
Norms and Test Equating 439
Y X
Figure 11.7. Equated and smoothed Y- and X-scores. Pre- (loglinear) and postsmoothing (cubic-
spline) techniques were applied. The RAGE_RGEQUATE program is available from Dr. Michael
Kolen (www.uiowa.edu/˜c07p358).
EQUIPERCENT (Price, Lurie, & Wilkins, 2001). The program is available on the com-
panion website for this book (www.guilford.com/price2-materials) and is provided in the
SAS and SPSS language. The program uses the SAS and SPSS MACRO language and can
process multiple versions of tests simultaneously. The program does not incorporate any
type of smoothing (preequating or postequating). Smoothing algorithms are often used
to refine the equipercentile technique. For example, in practice when using equipercen-
tile equating, either presmoothing of the raw score distribution is conducted or only
postsmoothing after equating has been completed. In raw score presmoothing, the goal is
to reduce some of the sampling variability that raw score frequency distributions display.
These techniques include loglinear presmoothing (von Davier, 2011) or cubic-spline
postsmoothing (Kolen & Brennan, 2004).
This section introduces you to test score equating using IRT. The information is intended
as an introduction to the IRT approach to equating and focuses on enabling you to
understand the concepts and advantages of using IRT for test score equating. For those
interested in a comprehensive treatment on the topics of IRT-based scaling, linking an
equating of test scores, see Kolen and Brennan (2004) and von Davier (2011).
IRT posits that an underlying latent trait (e.g., a proxy for a person’s ability) can
be explained by the responses to a set of test items used to capture measurements on
some social, behavioral, or psychological attribute. The latent trait is represented as a
continuum (i.e., a continuous distribution) along a measurement scale. The Rasch and
one-, two-, and three-parameter logistic IRT models are frequently in use today. Equat-
ing can be conducted within any of these scaling models. Unidimensional IRT models
440 PSYCHOMETRIC METHODS
You can see from the points above that the linear and equipercentile methods often fall
short on several of these points. IRT offers a framework for improving the equating exer-
cise by addressing many of the above issues.
In Chapter 10 comparisons of CTT and IRT were presented. Arguably, the most
important difference between the two theories and the results they produce is the prop-
erty of invariance. In IRT, invariance means that the characteristics of item parameters
(e.g., difficulty and discrimination) do not depend on the ability distribution of exam-
inees, and conversely, the ability distribution of examinees does not depend on the item
parameters. In Chapter 7, CTT item indexes introduced included the proportion of
examinees responding correctly to items (i.e., proportion-correct) and the discrimination
of an item (i.e., the degree to which an item separates low- and high-ability examinees).
In CTT, these indexes change in relation to the group of examinees taking the test (e.g.,
they are sample dependent). However, when the assumptions of IRT hold and the model
adequately fits a set of item responses (i.e., either exactly or as a close approximation), the
same IRF/ICC (item response function/item characteristic curve) for the test items is observed
regardless of the distribution of ability of the groups used to estimate the item parameters.
For this reason, the IRF is invariant across populations of examinees. This situation is
illustrated in Figure 11.8.
Practically speaking, the invariance property ensures that examinees who respond to
different items on different test forms for which the item parameters are known will have abil-
ity estimates on the same scale (i.e., person ability estimates are linked). As presented in
Chapter 10, person ability and item parameter estimates are unknown and must be esti-
mated when using IRT. The property of invariance of the item response function relative
to linear transformations introduces an indeterminacy of the scale of ability. Indeterminacy
of the scale of ability occurs because estimation of the person ability parameter involves
an equation with two unknowns (i.e., a + b; the slope and intercept of a line). The result
is that during the process of parameter estimation using a set of item response data, the
item and ability parameter estimates are not able to be uniquely determined using maxi-
mum likelihood estimation (see Chapter 10 and the Appendix for more detail). A solu-
tion to this challenge is to set (i.e., standardize) either (a) the person ability estimates
(θ̂), for example, to a mean of 0 and a standard deviation of 1, or (b) the item difficulty
parameters to a mean of 0 and a standard deviation of 1. As previously stated, item and
ability parameters are invariant only up to a linear transformation (i.e., item and ability
parameter estimates of the same items and the same examinees will be linearly related in
two groups; Hambleton et al., 1991).
Next, we turn to an example application of IRT equating using a linear transfor-
mation for relating person ability estimates from two groups of examinees taking two
test forms (X and Y). The sample size for the two examinee groups is N = 500. For
simplicity of explanation, the following illustration assumes that item parameters and
person ability estimates are based on the one-parameter logistic model. For details
about placing the item parameters and person ability estimates on the same scale
for the two-parameter IRT model, see de Ayala (2009) and Hambleton et al. (1991,
p. 127).
442 PSYCHOMETRIC METHODS
– – –
Figure 11.8. Invariance of item response function across different ability distributions. A test
item has the same IRF/ICC regardless of the ability distribution of the group. For an item location/
difficulty of 0.0, the low-ability group will be less likely to respond correctly because a person in
the low-ability group is located at −1.16 on the ability scale whereas a person in the high-ability
group is located at 0.0 on the ability scale.
difficulty parameter in IRT was expressed as b). In this case, two examinees of the same
ability have responded to the same item, but the scale of measurement is different for
each examinee group (i.e., because we chose to standardize ability at the outset of our
IRT analysis rather than standardize on item difficulty). We can compute the difference
between the item difficulty parameter estimates for group 1 and group 2 ( )
to obtain a scaling or adjustment factor for transforming scores from group 1 to scores
for group 2 (Hambleton et al., 1991; Crocker & Algina, 1986). The adjustment or scaling
factor m is derived based on averaging over all values of m for all of the anchor items (or
all of the common items on the two tests). Table 11.9 illustrates this scenario for a test
comprised of 15 items (with 5 anchor items) taken by two groups of examinees.
Next, Table 11.10 illustrates the estimated ability scores (standardized to a m = 0;
s = 1 metric) for examinees taking test forms X and Y.
Finally, Table 11.11 provides the equated ability scores for forms X and Y after apply-
ing the adjustment calculated in Table 11.9 based on averaged values of =
0.112. For example, consider an ability score on form X of 1.48 (for a raw score of 14).
The equated ability Y-score estimate is 1.41 for the same raw score of 14 after applying to
form the Y ability score (i.e., 1.30 + .112 = 1.41).
Recall that our goal in equating test scores using IRT is to equate scores on two uni-
dimensional tests that measure the same person ability (q). In IRT, an examinee’s true
Two true scores are considered equivalent if they correspond to the same ability score (q).
For example, a number-correct or raw score of 7 on test form X may be equivalent to a score
of 8 on test form Y. The observed score (or number-correct score) is an unbiased estimator of
true score (i.e., e(X) = t). Because IRT is the nonlinear regression of observed score on true
score, ability and true score are related by a monotonically increasing function (i.e., an item
response function or item characteristic curve). For this reason, true score can be mapped
onto the number-correct scale. Also, transformation of the ability scale from a m = 0/s = 1
metric to a number-correct score facilitates the interpretability of results. If so desired, the
number-correct score can be divided by the number of items on a test to yield a proportion-
correct score (e.g., sometimes used on criterion-referenced-type tests). Alternatively, for
normative scores, other score transformation metrics are often employed (e.g., an IQ metric
of m = 100/s = 15 or GRE score metric m = 500/s = 100).
In IRT, the probability of a correct response is given by the item response function;
using this information, we can insert estimates of these probabilities based on fitting an
IRT model to a set of examinee response data. The implications are that we can transform
person ability (q) and item parameters b and c without changing the probability of a cor-
rect response to an item. Transformation of the ability scale in IRT to the true score scale
is based on the sum of the item characteristic curves (Equation 11.10). For a stepwise
presentation of transformation of the ability scale to true score scale beginning with the
number-correct raw score, see Hambleton et al. (1991, p. 84).
Continuing with our true score equating example, we can use the item difficulty
parameter estimates in Table 11.9 in Equation 11.14 to calculate the number-correct score
on tests Y and X—for a person ability (q) of 1.0. Equation 11.15 illustrates this step.
446 PSYCHOMETRIC METHODS
• T = true score.
• = sum of the item response functions/item char-
acteristic curves.
• Pj(q) = probability of a particular item given person
ability.
• q = sum of the probabilities for each item.
and
(1.0 − BG 1) (1.0−BG 15 )
E E
TY = (1.0 − BG 1)
+ + (1.0−BG 15 )
1+ E 1+ E
= 8.32
Norms and Test Equating 447
Figure 11.9. Relationship between ability and the true scores on two tests.
Figure 11.9 depicts the equated true scores of (X = 6.14) and (Y = 8.32) based on a
person ability (q) of 1.0 using the TCC method of score equating.
This chapter intoduced two types of scores used in psychological measurement and test-
ing. Examples were provided regarding the transformation of raw scores to standard
scores (including scale scores) and the advantages standard scores provide in commu-
nicating the results of test scores within the context of testing in general. Next, norms
were defined, and the specifics of planning a norming study were provided. The role of
the normal distribution was explained in relation to deriving and using normalized scale
scores.
Test score equating involves establishing scores that are equivalent (based on the
condition of exchangeability) on different tests or measurement instruments. Three types
of equating were discussed: (1) linear, (2) equipercentile, and (3) IRT-based or latent
trait. The distinction between score linking and equating was described. Score linking
was described as the transformation from a score on one test to a score on another test,
whereas score equating was defined as a special type of score linking with additional req-
uisite assumptions about equity, symmetry, and invariance of scores.
This chapter is foundational to understanding (1) types of score distributions and
their transformations to standard score metrics in a way that aids in communicating test
score results, (2) planning a norming study in a way that yields meaningful normative
448 PSYCHOMETRIC METHODS
scores, and (3) different designs and score transformation (linking) techniques for test
score equating. The exercise of developing standard scores or standardized scale scores
(e.g., norms) involves careful planning and execution of a norming study. Similarly, plan-
ning and conducting an equating study involves selecting the appropriate design and
score transformation technique to ensure that the resulting equated scores are appropri-
ate for their intended use. The material and examples in this chapter are good preparation
for designing and implementing a norming or horizontal equating study.
Norms. Test performance data of a group of examinees used as a reference for evaluat-
ing, interpreting, or placing in context of persons’ test scores (Cohen & Swerdlik,
2010, p. 111).
Percentile rank scale. A type of normative scale that provides the percentage of exam-
inees in a specific group scoring below the midpoint of each score or score interval.
Raw score scale. A score metric that has no meaning without supporting data that trans-
lates into meaningful information.
Scale aligning. The goal is to transform the scores from two different tests onto a common
scale.
Scale scores. A score scale with a specified metric that facilitates explanation of exam-
inee performance relative to a reference group (also known as derived scores).
Standard score. A raw score converted from one scale to another scale, where the latter
scale employs an arbitrary mean and standard deviation.
Test score linking. The transformation from a score on one test to a score on another test.
Unadjusted linear transformation. Relocating the raw score mean at the desired scale
score location in a way that ensures a uniform change in the size of the score units to
yield the desired scale score standard deviation. Only the mean and standard devia-
tion of the raw score distribution is changed.
Vertical equating. A test of reading achievement used at different levels of a child’s
development or possibly at different grade levels. If a child is formally classified or
enrolled in a particular grade level (e.g., based on his or her age) but is tested at
the previous grade level because of lagging progress, the child’s score at the grade
or developmental level in which he or she is tested can be equated to scores at the
actual grade level at which he or she is enrolled.
Appendix
Measurement refers to rules for assigning numbers to objects. Researchers are able to
represent quantities of attributes numerically (through scaling) or to determine whether
objects fall into the same or different categories given a particular attribute (classifica-
tion). Although numerical scaling has dominated psychometric methods in the past,
innovations in software and increased computing technology (speed and power) have
opened analytic possibilities previously unrealizable. However, before capitalizing on the
new analytic possibilities afforded by computers and software, a review and update of
the mathematical and statistical foundations related to psychometric methods is essential
and is, therefore, the motivation for this Appendix.
A primary goal of psychometric methods is the measurement and scaling of attri-
butes. Attributes are identifiable qualities or characteristics represented by either numer-
ical elements or categorical classifications of objects of interest that can be measured.
During the process of scale or instrument development, careful consideration regarding
what terms define or constitute an attribute of an object is a crucial step. For example,
different words may mean different things to different people within or between different
cultures. Consider the case of the construct of intelligence and the manner in which it
has been defined and extensively used within a particular theoretical framework in the
United States. This theoretical framework is often inaccurate or lacks evidence of valid-
ity for people residing in other nations or even in the same nation! Nevertheless, given a
particular theoretical framework, the attributes that represent intelligence are evaluated
by examining the relationships among variables. The variables are mapped onto a specific
theoretical dimension using measurement operations and/or protocols. In this way, mea-
surements obtained theoretically reflect one unitary attribute.
451
452 Appendix
A second goal of psychometrics focuses on the scaling of objects (e.g., people) into
classification schemes related to their preferences. Such outcomes are often based on a
person’s preference for certain products or services. A third goal is the measurement and
scaling of a person’s physiological–psychological response or threshold to a stimulus as
in a sensory perception measured by psychophysical scaling. Louis Thurstone’s law of
comparative judgment constituted the seminal work in this area by linking the stimulus of
objects (the psychophysical tradition) onto linear scales that tap such areas as sociability,
affective values, and the quality of written constructed responses to questions. Thurstone
is also credited with originating the mental testing tradition within psychometric methods.
Figure A.1 provides a taxonomy of psychometric methods from the 18th century forward.
Statistical methods
Biometry & Sociometry
(Quetelet, 1796–1874)
Experimental psychology
Individual differences in
stimulus/response
Common ground
Psychological scaling methods
(attributes, objects or persons, & stimuli)
(analytic methods include the study of individual or between
subject differences and within subject change over time)
are the same or highly similar. Repeatability is one characteristic assessed by indexes of
measurement reliability (a topic covered in Chapter 7). Measurement precision (or reli-
ability) is not to be confused with accuracy, which is the degree of conformity or agree-
ment, a quantity exhibited in relation to its actual (true) value. Accuracy of measurement
is assessed by quantitative evidence that is summarized by indexes of validity (the topic
covered in Chapters 3 and 4).
As an example, consider a set of scores obtained from a sample of 12th-grade students
on two parallel forms of a test designed to measure knowledge of mathematical concepts. An
examination of the responses on the two forms yielded a relationship such that those students
scoring high on form A also scored high on form B. Similarly, those students scoring low on
form A also scored low on form B. Thus, repeatability or consistency is exhibited between
forms A and B. Such repeatability is also known as score reliability. Similarly, consider the case
where a researcher is interested in whether two different instruments of the same length and
format measure severe clinical depression in a consistent and repeatable manner. A sample of
persons exhibiting severe clinical depression responds to the items comprising the two instru-
ments. An analysis of the responses on the two instruments demonstrated that those patients
scoring in the top quartile on the first instrument scored below the 50th percentile on the
second instrument. In this case, repeatability or consistency was not exhibited between the
two instruments designed to measure severe clinical depression.
Importantly, if a set of scores lacks repeatability or reliability, they provide no use-
ful information because there is no way for researchers to make inferences related to an
individual’s ability, achievement, attitude, or other attribute. The results of a measure-
ment can exhibit accuracy but lack precision, or they may lack precision but be accurate.
Evidence for accuracy and precision of obtained scores or classifications exists when the
outcomes of a measurement method or process demonstrate that the numerical score or
classification represents what it was theoretically intended to represent. Further, empiri-
cal evidence should be available to support the elements of accuracy and precision. When
accuracy and precision exist in the measurement process, at least one piece of evidence
for validity of the scores is substantiated (AERA, APA, & NCME, 1999; Kane, 2006).
Evidence for the objectivity of a particular scaling method (and the resulting scores
obtained) is demonstrated by the independent replication of results using a specific
454 Appendix
f1 i item 1
fi1 item 10
fi2 item 1
fi3 item 1
ci1 item 25
ci2 item 1
crystallized intelligence test 2
ci2 item 25
Crystallized Intelligence (Gc)
ci3 item 1
crystallized intelligence test 3
ci3 item 15
ci4 item 1
crystallized intelligence test 4
ci4 item 15
stm1 item 1
short-term memory test 1
stm1 item 20
stm2 item 1
Short-Term Memory (Stm) short-term memory test 2
stm2 item 10
stm3 item 1
stm3 item 15
there are two major subcomponents labeled as fluid intelligence (Gf) and crystallized intel-
ligence (Gc). Fluid intelligence is defined as process oriented and crystallized intelligence
as knowledge or content oriented. Additionally, a general memory component is recog-
nized as part of the generalized theory of intelligence. In Figure A.3, the GfGc classification
scheme is represented by a set of cognitive, affective, and conative (i.e., the connection of
cognition and affect) trait complexes, along with the development of domain knowledge.
connections between a theoretical model and actual data. The related dataset includes
a randomly generated set of item responses based on a sample size N = 1,000 persons.
The data file is available in SPSS (GfGc.sav), SAS (GfGc.sd7), or delimited file (GfGc.
dat) formats and are downloadable from the companion website (www.guilford.com/
price2-materials).
In GfGc theory, fluid intelligence is operationalized as process oriented and crys-
tallized intelligence as knowledge or content oriented. Short-term memory is com-
posed of recall of information, auditory processing, and mathematical knowledge (see
Table A.1). In Figure A.3, the GfGc model is represented by a set of cognitive, affec-
tive, and conative (i.e., the connection of cognition and affect) trait complexes, along
with the development of domain knowledge. In Figure A.3, the small rectangles on
the far right represent individual items, which are summed to create linear composites
represented as the second, larger set of rectangles. The ovals in the diagram represent
latent constructs as measured by the second- and first-level observed variables. Table
A.1 (introduced in Chapter 1) provides an overview of the subtests, level of measure-
ment, and descriptions of the variables for a sample of 1,000 persons or examinees in
Figure A.3.
The tasks specific to psychological measurement are varied and often challenging.
Examples of tasks in psychological measurement include but are not limited to (1)
developing normative scale scores for measuring intelligence and short-term memory
ability across the lifespan, (2) developing a scale accurately reflecting a child’s reading
ability in relation to his or her socialization process, and (3) developing scaling mod-
els useful for evaluating mathematical achievement. Often these tasks are complex and
involve multiple variables interacting with one another. This section provides the defini-
tion of a variable, including the different types and the role they play in measurement
and probability.
To begin, consider how individual test items in each respective subtest in Figure A.3
and Table A.1 are used to acquire specific information from persons. Person responses
to individual items are summed to create a total test (also called a subtest in models
such as Figure A.3) score for each person in a sample of persons. The resulting sum of
a collection of items is known as a total score or linear composite. Linear composites
are depicted as the large rectangles in Figure A.3, labeled by test or subtest name. Any
scaling model that produces reliable and accurate scores on measures representing con-
structs can be used to study causal relationships with different constructs defined in
other theoretical models. For example, Figure A.3 can be expanded to include posited
relationships of the Gf and Gc components with other constructs such as personality or
educational achievement.
A variable is a measurable factor, characteristic, or attribute of an individual, system,
or process. Variables represent something that varies between individuals on a quality or
characteristic that can assume two or more different values. For example, the individual
test items (i.e., small rectangles) comprising a subtest in Figure A.3 are variables. Simi-
larly, the linear composites (i.e., subtests) in the figure are also variables. In mathematical
statistics, random variables are measurable functions that are classified as discrete or
continuous. A discrete random variable takes values from a countable set of specific val-
ues, each with some probability greater than zero (Probstat, n.d.). For example, in Figure
A.3 a discrete random variable is the sum of a set of individual item scores acquired from
a randomly sampled group of persons. Note that these are the scores actually observed
rather than all scores that are theoretically possible based on the model in Figure A.3. The
sum of items labeled fi1 item 1 through fi1 item 10 yields the subtest or test total score
labeled fluid intelligence test 1. For discrete variables, each score within the set of test
scores takes on a value from a countable set of actually observed values or scores—each
with some probability greater than zero. Conversely, a continuous random variable takes
on values from a theoretically uncountable or unlimited set. Therefore, the probability of
a single value is zero, but the probability of a set of values is greater than zero. Using a set
of scores, we can model the probability of scores based on continuous distributions based
on random samples of persons from populations.
Random variables are measurable functions obtained from probability spaces (i.e., a
theoretical distribution or density) that are then mapped onto a measurable space. The
measurable space is composed of the actual observations (i.e., sample space) of interest in
a study. The observations are assigned a probability distribution based on their behavior
or shape. In this way, probability theory provides a crucial link in the development of
statistical and psychometric models under varying conditions. For example, a researcher
may be interested in the fluid intelligence scores on the quantitative reasoning subtest
458 Appendix
Continuous Data
Data that can take on any values within a particular mathematical range are continuous.
In measuring length, for example, it is possible for an object (e.g., a board, wire, or rod)
to be 6 feet, 1 inch, or 6 feet, 2 inches, long or any conceivable length in between these
two points on the scale. Therefore, continuous data have no gaps in their units of scale.
Additional examples include weight, chronological age, and temperature. However, even
though a variable is continuous in theory, the process of measurement always reduces it to a
discrete level due to the accuracy and precision of the instrumentation used and the integrity
of the data acquisition/collection method. Therefore, continuous scales are in fact discrete
ones with varying degrees of precision or accuracy. Returning to Figure A.3, any of the
linear composites (i.e., variables representing total test or subtest score) may appear to
be continuous but are actually discrete because a person can only obtain a numerical
value based on the sum of his or her responses across to each item, then summed to a
total score for the subtest (e.g., it is not possible for a person to obtain a score of 15.5 on
a total test score). In this case, total test scores are often treated as continuous measures
with a certain level of precision. To this end, although continuous measures are only
approximate, such a level of approximation provides sufficient precision to be useful for
the application of psychometric methods.
In preparation for the remaining parts of this Appendix, the requisite symbols and opera-
tions are presented. The following symbolic notation and operations are used throughout
this text.
• N = size of a population.
• n = size of a sample from a population.
• å = summation of variables, where an example of variables is items or tests.
• Xi = variable X indexed by a lowercase i, where i represents an individual score for
a person, on variables such as items or tests.
• Xij = doubly scripted variable.
5
• ∑ X I = limits of summation, for example, X1 + X2 + X3 + X4 + X5. Application involves
1 starting with i = 1 proceeding by 1 until the fifth variable or number is reached.
5
• ∑ ( X I2 − YI + 6) = c omplex terms can be included in summation operations; in this
1 example, the complex equation in parentheses is conducted six
times, once for each score for X and Y.
N N
• ∑∑ X IJ = here, i represents the ith person and j is the jth test or subtest; the dou-
1 J=1 ble summation signifies the sum of all persons on all tests or subtests.
460 Appendix
The subscript below the summation symbols represents the last person
and last test, respectively.
N
• ∑ C = NC = s ummation of a constant equals the product of n times a constant (c =
I =1 constant).
N N
• ∑ CX I = C∑ X I = s ummation of scores or variables multiplied by a constant equals
I =1 I=1 the sum of the scores times a constant (c = constant).
N N N
• ∑ ( X I + YI ) = ∑ X I + ∑ YI = w
hen applying summation to more than one term,
I =1 I=1 I=1 summation can be distributed to each term.
• m = mean of a population.
• X = mean of a sample.
• s = standard deviation of a population.
• s = standard deviation of a sample.
• s2 = variance of a population.
• s2 = variance of a sample, also represented as var(X).
• p = proportion or percentage.
• q = 1 – p.
• P = probability.
• E = event or outcome in probability theory.
• F (X) = function of X; also the integral.
• f(X) = frequency of event or score X, also expressed as a frequency-based prob-
ability function of X.
X
• ∫ F ( X) = frequency of event or score x, expressed as an indefinite integral for a
−∞ continuous random variable with limits x and negative infinity.
X
• ∫ F ( X)DX = f requency of event or score x expressed as a definite integral for a con-
−∞ tinuous random variable with limits x and negative infinity.
• Lx = likelihood of the observed data or score based on the height of a distribution
function (e.g., normal) at a specific score.
• e = expectation operator.
• Q = random variable theta representative of a parameter in Bayesian probability
and item response theory.
Using the number system for counting, measuring, and summarizing is so common today
that it seems hardly worth mentioning. In early cultures, the number system was created
for use as a symbolic and systematic way to describe or communicate about the real world
in an objective, precise, and consistent manner. The branch of mathematics that focuses
Mathematical and Statistical Foundations 461
on the study of numbers or integers has a long history and remains important. However,
theoretical research on the number system contrasted with its application to counting
and measuring constitute two very different foci. Although using the number system is
familiar to most people, a systematic introduction, including forms of counting, and the
relationship between collections of numbers and probability theory are important.
A logical starting point is to define the term data (as opposed to a datum, being a
single number). The term data is defined as a collection of numbers, words, images, and
so on that began with early philosophers and is now accepted as convention within the
social, behavioral, physical, and biological sciences. Descriptions of numerical data can be
communicated with a degree of precision and objectivity in two primary categories. First,
events or things (e.g., physical, psychological, or sociological attributes) that are counted
are summarized based on frequency of occurrence or the number of times an event (e.g.,
an attribute) is observed as having occurred. Second, events or things measured through
some scaling procedure yield scale (or scalar) values on a particular metric relevant to
the measurement task of interest. Psychometric and statistical modeling deals with both
forms of numerical data—frequency counts of events (both ordered and unordered), and
interval-level scale values such as normative data on psychological tests. Frequency dis-
tributions provide a tabular summary of how many times values (e.g., real numbers) on
a discrete variable occur for a set of subjects or examinees. The proportion of examinees
receiving a particular score is defined as the relative frequency of a score. The term rela-
tive represents the position that a subject(s) occupies within the placement of the cumula-
tive (total) frequency distribution. Relative frequencies, discussed next, are foundational
to the classical, frequentist, or sampling theory approach to probability theory.
P(Ei)*P(Ej)
• P = probability.
• Ei = outcome of an event indexed by i.
• Ej = outcome of an event indexed by j.
• The probability of several independent events occurring
jointly is the product of their separate probabilities.
P(Ei) + P(Ej) = pi + pj = 1
• P = probability.
• Ei = outcome of an event indexed by i.
• Ej = outcome of an event indexed by j.
• The probability of occurrence of any one of several par-
ticular events is the sum of their individual probabilities,
provided the events are mutually exclusive.
Mathematical and Statistical Foundations 463
possible outcomes must be exactly 1. Also, the probability that an event does not occur
is 1 minus the probability that the event does occur.
Next, consider a simple coin toss experiment (also known as a Bernoulli trial experi-
ment) with outcome 0 = tails; 1 = heads. This experiment can be described by the proba-
bility function f(xi) specifying the probabilities with which X can assume only the values
0 or 1. We assign a value xi to the ith outcome, and then we order the xi in ascending fash-
ion. The discrete random variable X is defined as that quantity which takes on the value
xi with probability p at each trial. To illustrate, if x1 = 0, and x2 = 1, and p1 = 1 − p, p2 = p,
and if p is not assigned a value, then over the long run of many independent trials p = .5.
Accordingly, this means that one-half of the time the outcome will be heads and one-half
of the time tails. Assumptions required for this type of experiment in order to have valid
outcomes are that the conditions of the coin tossing are represented by the intrinsic val-
ues of the coin being fair or balanced and the manner in which it is repeatedly tossed. The
total probability is unity (i.e., value of 1.0), and irrelevant events such as the coin rolling
out of sight or falling off of the surface are assigned probabilities of 0. Now consider the
following example using the intelligence test data from Figure A.3 and Table A.1. Next
we can use the frequency of scores provided in Table A.2 to answer the question, “What
is the probability that a score of 40 is obtained on crystallized intelligence test 1 based
on this sample?” The relative frequency (i.e., based on long-run frequency probability
theory) distribution for this subtest and sample is provided in Table A.2.
If we treat the intelligence test data as interval-level or continuous data, applying Equa-
tion A.4 to the data in Table A.2 yields a probability of .037 (e.g., a score of 40 occurred 37
times out of the 1,000 examinees or 37/1,000 @ .037) of obtaining a score of 40.
A continuous random variable is represented by values over some continuous region.
Also, a continuous random variable X as defined on the domain of real numbers is char-
acterized in Equation A.3 by its probability distribution function.
The symbols –¥ and ¥ represent the limits of the lower and upper bounds of the func-
tion of the variable. A familiar example is the standard normal (i.e., Gaussian) distribu-
tion. For further explanation of the differential and integral calculus applied to statistical
• F = the function of x.
• P(X £ x) = the probability that a continuous random
variable X is less than or equal to some value
of x in its domain.
• –¥ < x < ¥ = the range of the function of x.
464 Appendix
methods, see Calculus and Statistics by Michael Gemignani (1998) and Advanced Calculus
with Applications in Statistics by Andre Khuri (2003).
Equation A.3 illustrates that X is less than or equal to some value x of its domain.
If F(x) is an absolutely continuous function, the continuous analog of the discrete prob-
ability function is the density function in Equation A.4.
Also, by the absolute-continuity property, the continuous function can be represented
by a cumulative probability distribution (density) function for F(x) in Equation A.5.
The integral symbol ò is defined as the summation of all quantities and differs from
the symbol S in that ò represents the summation of a vast number of small quantities
(i.e., dx) of infinitely small magnitudes—as in calculating the total area under the normal
(Gaussian) curve (see Figure A.3). The process of numerical integration allows us to
calculate totals that otherwise we would be unable to estimate. In contrast, the symbol S
represents the summation of a number of finite or discrete quantities. These two methods
of numerical summation have implications for how psychometric scales are developed
and how analytic methods are applied. The actual area of the curve may be calculated
DF( X)
F( X ) =
DX
X
( )= ∫ ( )
−∞
by making the intervals infinitely small (no distance between the intervals) and then
computing the area using calculus methods such as Simpson’s rule or the trapezoid rule
(Gemignani, 1998).
To make the idea of integration more concrete, consider an example from Figure A.3
and Table A.1 using a score of 40 obtained on crystallized intelligence test 1 based on the
sample of 1,000 persons. Phrased in a probabilistic way, we want to know, “What is the
probability that at least one person will score between 39 and 41, given that the range of
scores is between 4 and 50?” Using 3/66 (i.e., an area of .04545) as the cumulative prob-
ability distribution function for f(x) in our example, the definite integral or the probability
that a random variable (1 score) will fall within the interval 39 and 41 is derived in Equa-
tion A.6 and illustrated in Figure A.4.
Finally, if a random variable is defined only on some interval of the real line, then
values outside that interval represented by the cumulative probability distribution func-
tion for f(x) to either the left or right are defined as being either 0 or 1, respectively.
41
P([41,39]) = ∫ F ( X) D X
39
41
3
= ∫ 66 X
2
DX
39
3 X 2 41
=
66 3 39
3 3 41
= [X ]
66 39
3
= (413 − 393 )
66
= 3520
3520
= ≅ 3.52
1000
3.52
= ≅ .0352
100
Mathematical and Statistical Foundations 467
Normal
60
40
Frequency
20
Mean =35.23
Std. Dev. =8.609
0 N =1,000
0.00 10.00 20.00 30.00 40.00 50.00 60.00
Crystallized intelligence test 1
X
Figure A.4. A histogram of ( ) = ∫ ( ) based on score data from crystallized intelligence
−∞
test 1. The horizontal line illustrates the intersection of a frequency of 37 on the Y-axis and a score of
40 on the X-axis. Probability of 1 person obtaining a score of 40 is ~.385.
Now that probability density (distribution) functions have been introduced to maximum
likelihood estimation (MLE), the method of maximum likelihood provides a general method
of estimating parameters based on the general linear model that leads to the ordinary least
squares function in the linear regression model (assuming normally distributed errors; see
Chapter 2 for linear regression basics). MLE is used extensively in many statistical and psy-
chometric techniques. Extensive use is made of maximum likelihood because the method,
under many circumstances, often produces parameter estimates that exhibit smaller bias
(i.e., the expected value of all possible estimates equals the population parameters) and
smaller variance (i.e., values obtained from randomly different samples have small variance)
than other estimation methods (e.g., ordinary least squares or generalized least squares). An
exception to this statement is that in some scenarios maximum likelihood is not necessar-
ily the optimal method to use (e.g., very small sample sizes or non-normal distributions).
Therefore, the distributional characteristics unique to a specific set of data must be evaluated
uniquely prior to deciding on using a particular method of parameter estimation.
468 Appendix
Maximum likelihood is useful for a wide array of statistical problems such as estimat-
ing parameters in IRT (introduced in Chapter 10) and logistic regression (see Chapter 4).
For example, MLE (or slight modifications of it) is useful in situations where the goal is to
estimate an unobservable or latent trait or attribute from sample data as in IRT. The goal
of MLE is to locate population parameters that will most probably generate a particular
sample estimate (under certain assumptions such as those in the normal distribution). For
example, the likelihood is conceptualized as the relative probability of drawing a certain
score from a distribution with known mean and variance. The distribution may be univari-
ate or multivariate normal or any other distribution. Equation A.7a provides the compo-
nents necessary for estimating the likelihood of a score with a known population mean and
variance. Next, an example is provided using population information.
To understand how Equation A.7a works, let’s assume that we have intelligence test
data that are normally distributed with a population mean of 100 and variance of 225. Next,
suppose we want to know the likelihood of obtaining a score of 115 on the intelligence test.
Inserting the mean and variance into Equation A.7 and carrying out the operations yields a
likelihood of .04402. Figure A.5 illustrates this result of applying Equation A.7a.
In Figure A.5, the likelihood is represented by the Y-axis and indexes the height of
the normal curve at a particular score.
Recall that the goal of MLE is to locate population parameters that have the greatest
probability of yielding a set of sample data. Recall from Equation A.1 that independent
events (the examinee scores in the present case) can be multiplied to ascertain a measure
of the joint probability. To accomplish this goal, Equation A.7a is expanded as in Equa-
tion A.7b. Application of Equation A.7b yields a single summary likelihood value that
represents a summary index of fit based on individual scores comprising a sample.
-.5( YI -m )2
1
LI = E s
2
2ps2
• Li = likelihood of score i.
• m = population mean.
• s2 = population variance.
• p = 3.14159265.
1
• = scaling term that allows the area under the
2ps2 curve to integrate (sum) to 1.
( YI -m )2
• = squared distance of an individual score from the
s2
population mean.
Mathematical and Statistical Foundations 469
N ìï 1 -.5( YI -m )2 ü
ï
L = Õí E s
2
ý
î 2ps
2
I =1 ï ïþ
In practice, the values of likelihood are small and difficult to work with. For this
reason, the logarithm of the likelihood is used in practice. For example, the logarithm of
the value of .04402 in Figure A.4 is −1.35. An additional benefit of using the logarithm
scale is that the logarithm of examinee scores can be summed to yield a composite log
likelihood value. For example, using Equation A.2 (the additive model of probability for
independent events) provides a framework summing individual likelihoods, resulting in
an additive (linear) model. Equation A.7c illustrates how Equation A.7b is changed to
include the logarithm.
In some cases, measurement and statistical problems are very difficult to address
within the frequentist or sampling theory probability framework. Under such circum-
stances the Bayesian probability and inference provides a powerful alternative. The
history and development of Bayesian statistical methods (Hald, 1998; Bayes, 1763)
are substantial and are closely related to frequentist statistical methods. In fact, Gill
(2002) notes that the fundamentals of Bayesian statistics are older than the current
(i.e., classical or frequentist) paradigm. In some ways, Bayesian statistical thinking
can be viewed as an extension of the traditional (i.e., frequentist) approach in that
470 Appendix
N ìï 1 -.5( YI -m )2 ü
ï
LOG L = å LOG í E s
2
ý
ïî 2ps
2
I =1 ïþ
Likelihood
0.080
0.070
0.060
0.050 Li = .04402
0.040
0.030
0.020
0.010
0.000
IQ
Figure A.5. The likelihood for an IQ score of 115 based on a normal distribution with mean-
100 and variance-225.
Mathematical and Statistical Foundations 471
it formalizes aspects of the statistical analysis that are left to uninformed judgment
by researchers in classical statistical analyses (Press, 2003). The formal relationship
between Bayesian (subjective) and classical (direct) probability theory is provided in
Equation A.8.
The goal of parametric statistical inference is to make statements about unknown
parameters that are not directly observable from observable random variables—the
behavior of which is influenced by these unknown parameters. In the Bayesian statisti-
cal approach, researchers view any unknown quantity (e.g., a population parameter)
as random and these quantities are assigned a probability distribution (e.g., normal,
poisson, gamma, multinomial, binomial). The analytic focus is on the probability dis-
tribution that gives rise to or generates the observed data. In this way, population param-
eters are modeled as being random and then assigned a joint probability distribution
with the observed data thereby allowing researchers to summarize their current state
of knowledge about the model parameters. The result obtained in a Bayesian analysis
is a full probability model for population parameters and observed data. The utility
of the Bayesian approach is emphasized in Chapter 10 on IRT, where a probabilistic
model for responses to test items is presented. In comparison, in frequentist, or direct
probability and statistical theory, population parameters are assumed to be fixed (non-
random) and the data are viewed as being random—provided that random sampling has
occurred.
In the Bayesian framework, the sampling-based approach to estimation provides a
solution for the random parameter vector q by estimating the posterior density (distribu-
tion) of a parameter. This posterior distribution is defined as the product of the likeli-
hood function (accumulated over all possible values of q ) and the prior density (i.e.,
distribution) of q (Press, 2003; Gelman, Carlin, Stern, & Rubin, 2004).
To illustrate Bayes’s theorem graphically, suppose that you are interested in the pro-
portion of people in the United States who have been diagnosed with bipolar disorder.
You denote this proportion as q, and it can take on any value between 0 and 1. Next,
using information from a national database, 30 out of 100 people are identified as having
bipolar disorder. Two pieces of information are required—a range for the prior distribu-
tion and the likelihood, which is derived from the actual frequency distribution of the
observed data. Using Equation A.9, Bayes’s theorem multiplies the prior density and the
likelihood to obtain the posterior distribution.
The process of Bayesian statistical estimation approximates the posterior density
or distribution of say, y, p(q |y) m p(q )L(q |y), where p(q ) is the prior distribution of q,
and p(q |y) is the posterior density of q given y. Continuing with our bipolar example,
the prior density or belief (i.e., the solid curve) is for q to lie between .35 and .45 and is
unlikely to lie outside the range of .3 to .5 (Figure A.6).
The dashed line represents the likelihood, with q being at its maximum at approxi-
mately .3, given the observed frequency distribution of the data. Applying Bayes’s theorem
involves multiplying the prior density by the likelihood. If either of these two values
is near zero, the resulting posterior density will also be negligible (i.e., near zero, for
example, for q < .2 or q > .6). Finally, the posterior density (i.e., the dotted-dashed line)
472 Appendix
covers a much narrower range and is more informative than either the prior or the likeli-
hood alone.
The proportionality symbol in Equation A.9 is interpreted as follows: If the posterior
density (distribution) is proportional to the likelihood of the observed data times the prior
imposed upon the data, the posterior density differs from the product of the likelihood times
the prior by a multiplicative constant. When the prior density for the data is multiplied
0 .1 .2 .3 .4 .5 .6 .7
times the likelihood function, the result is improper, or “off” by a scaling constant. A
normalizing constant only rescales the density function and does not change the relative
frequency of the values on the random variable. Equation A.9 exemplifies the principle
that updated knowledge results from or is maximized by combining prior knowledge
with the actual data at hand. Finally, Bayesian sampling methods do not rely on asymp-
totic distributional theory and therefore are ideally suited for investigations where small
sample sizes are common (Price, Laird, Fox, & Ingham, 2009; Lee, 2004; Dunson, 2000;
Scheines, Hoijtink, & Boomsma, 1999).
An illustration of Bayes’s theorem is now provided to estimate a single-point prob-
ability with actual data using Equation A.10.
P( A|B)P(B)
P(B|A) =
P( A|B)P(B) + P( A|B)P(B)
Consider the scenario where the proportion of practicing psychologists in the U.S.
population is .02, the proportion of practicing psychologists in the United States who
are female is .40, and the proportion of females among nonpracticing psychologists in
the United States is .60, then P (female | practicing psychologist) = .40, P (practicing
psychologists) = .02, P (female | not practicing psychologist) = .60, and P (not practicing
psychologists) = .98. Given these probabilities and applying Equation A.10 as below in
Equation A.11, the probability that a psychologist is female and is in current practice in
the United States is .0134.
Notice that the result obtained (P = .0134) is very different from the proportion of
practicing psychologists in the United States who are female (i.e., P (female | practicing psy-
chologist) = .40). In Bayesian terminology, the unconditional probabilities P(B) and P(B) in
Equation A.10 are proportions (probabilities) and represent prior probabilities (i.e., what
is currently known about the situation of interest). The probabilities are
the probabilities actually observed in the sample, the product P(A|B) * P(B) is the likeli-
hood, and P(B|A) is the posterior probability. Alternatively, from a frequentist perspective,
the probability that a psychologist is female and is currently practicing is calculated
using the multiplication probability rule: P(A|B) * P(B) = (.40) * (.02) = .008. Notice
that this is the likelihood given the observed frequency (probability) distribution.
(.40)(.02)
P(PSYCHOLOGIST|FEMALE) =
(.40)(.02) + (.60)(.98)
Bayesian ideas have been incorporated into psychometric methods as a means of model-
ing the distribution that gives rise to a particular set of observed scores among individuals
who have differing levels of identical true scores. Bayesian methods have been particularly
useful in statistical estimation, decision theory, and item response theory. Regarding
test theory and development, this probabilistic approach is very different from classical
or empirical probabilistic methods where the distribution of observed scores represents
an empirical probability distribution. In the Bayesian approach, the process of estimating
a person’s true score proceeds by making a priori assumptions about subjects’ unknown
true score distribution based on sampling distribution theory. For example, a function
such as the normal distribution function can be used as prior or subjective information in
the model and the probability of an observed score given the true score as the likelihood
distribution. Finally, the posterior distribution is derived across differing levels of sub-
jects’ true scores through open-form iterative numerical maximization procedures such as
MLE (introduced in Section A.10), iteratively reweighted least squares (IRLS—for ordinal
data), and restricted maximum likelihood (REML) and quasi-maximum likelihood or mar-
ginal maximum likelihood (MML). Using IRT as an illustration, we find that the method
of maximum likelihood estimation leads to parameter estimates (i.e., in IRT for items
and persons) that maximize the probability of having obtained a set of scores used in the
estimation process. Specifically, the MLE method (and variants of it) uses the observed
score data as the starting point for the iterative parameter estimation/maximization pro-
cess. The resulting parameter estimates have optimal item or score weights, and per-
son ability estimates, and are unbiased asymptotical estimates (i.e., q̂). Chapter 10,
on IRT, provides more detail on the process of open-form iterative numerical estimation
procedures.
The type of distribution that describes the way a variable maps onto a coordinate system
(i.e., 2-D or 3-D) has implications for the development and application of psychometric
scaling models and methods. The following section provides an overview of some distri-
butions commonly encountered in psychometrics and psychophysics.
Properties of random variables are numerically derived in terms of a density func-
tion. Five distribution functions of random variables common to psychometrics are (1) rect-
angular, (2) logistic, (3) logarithmic, (4) gamma, and (5) normal (Figure A.7).
These distributions are determined by two parameters: location (i.e., either the
arithmetic, geometric, or harmonic mean) and scale (i.e., the variance). The location
parameter positions the density function on the real number line X-axis, whereas the
dispersion (variance) parameter maps the spread or variation of the random variable. The
arithmetic mean (i.e., expectation or expected long-run value) of a random variable is
476 Appendix
represented by the continuous density function in Equation A.12 and for the discrete case
in Equation A.13. Although Equations A.12 and A.13 appear to be essentially the same,
Equation A.13 is helpful in understanding the principle of continuity underlying a set of
measurements or scores taking on a range of real numbers. For example, although Equa-
tion A.13 is used extensively for the calculation of the mean of a set of scores, Equation
A.13 reminds us that, theoretically, usually a continuous underlying process is assumed
to give rise to the observed score values.
When the random variable is discrete, then Equation A.13 applies.
Distribution function
1 1
Probability density
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 1 2 3
Prob. density Distribution function
Logistic
1.2
0.8
Probability density
0.6
0.4
0.2
0
20 10 0 10 20
Distribution function
Log-normal
0.9
0.8
0.7
0.6
Probability density
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25
Distribution function
Gamma
1.2
0.8
Probability density
0.6
0.4
0.2
0
0 5 10 15 20
Distribution function
Normal
Probability Density
-5 -4 -3 -2 -1 0 1 2 3 4 5
Poisson
1.2
0.8
Probability mass function
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35 40 45 50
Distribution function
¥
m X = e( ) = ò X F X(X ) DX
-¥
m X = e( X ) = å XK PX ( XK )
K
within the context of the law of large numbers. For example, the expected value may or
may not occur in a set of empirical data. So, it is helpful to interpret the expected value
of a random variable as the long-run average value of the variable over many indepen-
dent repetitions of an experiment. Next, some properties are provided that are useful
when working with expectations. Specifically, Equations A.14 and A.15 illustrate alge-
braic properties used in conjunction with the expectation operator when manipulating
scores or variables.
480 Appendix
P(T | X)
In the case where variables X and Y are independent, Equation A.14 applies.
As an extended example, we use a rigid rod analogy (from the topic of statics in phys-
ics) to conceptualize the properties of a variable. Using Equation A.16, the function, f(x)
measures the density of a continuously measured rod mapped onto the X-axis. The kth
moment of the rod about the origin of its axis is provided in Equation A.16.
If actual relative frequencies of a variable are used (as is most always the case)
totaling N, then a = 0. This relationship means that the value of the arithmetic mean
depends on the value of a, the point from which it is measured. In Equation A.16, µ′ k
is the first moment or the mean, a term first used by A. Quetelet (1796–1874; Hald,
1998). Subsequently, Karl Pearson adopted this terminology for use in his work on the
coefficient of correlation. The term moment describes a deviation about the mean of a
distribution of measurements or scores. Similarly, a deviate is a single deviation about
the mean, and as such, deviates are defined as the first moments about the mean of a
distribution. The variance is the second moment of a real-valued random variable. The
¥
mK¢ = òX F( X )DX, and m ¢K = A = 0
K
-¥
variance is defined as the average of the square of the distance of each data point from
the mean. For this reason, another common term for the variance is the mean squared
deviation. The skewness of a distribution of scores is the third moment and is a mea-
sure of asymmetry (i.e., left or right shift in the shape of the distribution) of a random
variable. Finally, the fourth moment, kurtosis, is a measure of asymmetry where scores
in a distribution display an excessively tall and peaked shape. Once the first through
fourth moments are known, the shape of a distribution for any set of scores can be
determined.
At the outset of this appendix (and in Chapter 1), attributes were described as iden-
tifiable qualities or characteristics represented by either numerical elements or clas-
sifications. Studying differences between persons on attributes of interest constitutes a
diverse set of research problems. Whether studying individual differences on an indi-
vidual or group level, variation among attributes plays a central role in understanding
differential effects. In experimental studies, variability about group means is often the
preference. Whether a study is based on individuals or groups, research problems are
of interest only to the extent that a particular set of attributes (variables) exhibit joint
variation or covariation. If no covariation exists among a set of variables, conduct-
ing a study of such variables would be useless. To this end, the goal of theoretical and
applied psychometric research is to develop models that extract the maximum amount
of covariation among a set of variables. Subsequently, covariation is explained in light
of theories of social or psychological phenomena. Ultimately, this information is used to
develop scales that can extract an optimum level of variability between people related
to a construct of interest.
The variance of a random variable is formally known as the second moment about
the distribution of a variable and represents the dispersion about the mean. The variance
is defined as the expected value of the squared deviations about the mean of a random
variable and is represented as var(X) or s2X. The variance of a continuous random variable
is given in Equation A.17.
In the case where constants are applied to the variance, we have the properties shown
in Equation A.18.
Alternatively, Equation A.19 provides a formula for the variance of a distribution
of raw scores. In Equation A.19, each participant’s score is subtracted from the mean of
all scores in the group, squared, and then summed over all participants, yielding a sum
of squared deviations about the mean (i.e., sum of squares—the fundamental unit of
manipulation in the analysis of variance).
The variance is obtained by dividing the sum of squares by the sample size for the
group (N), yielding a measure of the average squared deviation of the set of scores. The
square root of the variance is the standard deviation, a measure of dispersion represented
in the original raw score units of the scale. When calculating the standard deviation for
482 Appendix
∞
VAR( ) = ∫[ − ( )]2 ( )
−∞
∞
= ∫ − [ ( )]2
2
( )
−∞
= ( 2
) − [ ( )]2
a sample, the denominator in Equation A.19 is changed to reflect the degrees of freedom
(i.e., N – 1) rather than N and is symbolized as s rather than s. The reason for using N – 1
in calculating the variance for a set of scores sampled from a population compared to N
is because of chance factors in sampling (i.e., sampling error). Specifically, we do not
expect the variance of a sample to be equal to the population variance (a parameter vs. a
statistic). In fact, the sample variance tends to underestimate the population variance. As
it turns out, dividing the sum of squares by N – 1 (in Equation A.19) provides the neces-
sary correction for the sample variance to become an unbiased estimate of the popula-
tion variance. An unbiased estimate means that there is an equal likelihood of the value
falling above or below the value of the population variance. Finally, in large samples, the
variance of a sample (s2) and the population converge to unity.
When variables are scored dichotomously (i.e., 0 = incorrect/1 = correct), computa-
tion of the variance is slightly different. For example, the item-level responses on our
example test of crystallized intelligence 1 are scored correct as a 1 and incorrect as a 0
for each of the 25 items. Computation of the variance for dichotomous variables (i.e.,
proportion of persons correctly and incorrectly responding to a test item) is given in
Equation A.20.
The standard deviation and variance are useful for describing or communicating the
dispersion of a distribution of scores for a set of observations. Both statistics are also useful
in conducting linear score transformations by using the linear equation Y = a(X) + b
Mathematical and Statistical Foundations 483
VAR (C) = 0
s = VAR (X)
s2 =
å 2
=
å - 2
N N
• ∑
X2
= sum of the squared raw scores divided by the
N number of measurements in the population
(N) or sample (N − 1).
• ∑
( X − X )2
= the sum of each raw score minus the mean
N of the raw score distribution squared divided
by the number of measurements in the
population (N) or sample (N − 1).
484 Appendix
s2 = p(1 – p)
(e.g., see Chapter 11 on norming). Linear transformations are those in which each raw
score changes only by the addition, subtraction, multiplication, or division of a constant.
The original raw-score metric is changed to a standard score metric such as Z(m = 0,
s = 1), T(m = 50, s = 10), IQ(m = 100, s = 15). Such transformations are useful when
creating normative scores for describing a person’s relative position to the mean of a
distribution (i.e., norms tables). Common forms of transformed scores used in psycho-
metrics include (1) normalized scores, (2) percentiles, (3) equal-interval scales, and
(4) age and/or grade scores. For example, a researcher may want to transform a raw
score of 50 from an original distribution exhibiting a mean of 70 and a standard devia-
tion of 8 to an IQ-scale metric (i.e., mean of 100/standard deviation of 15). Equation
A.21 can be used to accomplish this task. Using data on the crystallized intelligence
test 1 in Table A.1 and Figure A.3, conversion of a raw score of 40 to a standard (i.e.,
z-score) in the distribution with a mean 35.23 and standard deviation of 8.60 is given
by Equation A.21.
Next, Equation A.22 illustrates a linear score transformation that changes the orig-
inal raw score of 40 to an IQ score metric with a mean of 100 and standard deviation
of 15.
X - m 40 - 35.23
Z= = = .55
s 8.60
The third moment about a distribution of scores is the coefficient or index of skewness.
The measure of skewness indexes the degree of asymmetry (degree of left/right shift on the
X-axis) of a distribution of scores. Equation A.23 provides an index of skewness useful
for inferential purposes (Glass & Hopkins, 1996; Pearson & Hartley, 1966). Note that the
index in Equation A.22 can be adjusted for samples or populations in the manner that
the z-score is calculated. For example, one can use the sample standard deviation or the
population standard deviation depending on the research or psychometric task.
The fourth moment is the final moment about a distribution of scores providing the
ability to describe the shape in its complete form. The fourth moment about a distribu-
tion of scores is the coefficient or index of kurtosis. Kurtosis indexes the degree of asym-
metry as reflected in a distribution’s degree of platykurtic shape (flatness), mesokurtic
(intermediate flatness), or leptokurtic (narrowness) on the y-axis of a distribution of
scores. Equation A.24 provides an index of kurtosis useful for inferential purposes (Glass
& Hopkins, 1996; Pearson & Hartley, 1966).
• Xt = transformed score.
• st = standard deviation of the transformed score variance.
• zo = z-score transformation of original observed score
based on the mean and standard deviation of the origi-
nal raw-score distribution.
• X t = mean of transformed score distribution.
g1 =
å I ZI3
N
g2 =
å I ZI4 - 3
N
The program below provides the SAS source code for computing assorted descrip-
tive statistics for fluid intelligence, crystallized intelligence, and short-term memory total
scores. The program also produces two output datasets that include the summary statis-
tics that can be used in additional calculation if desired.
LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;
The previous section provides an explanation for the process whereby functions are used
to derive the density or distribution of a single random variable. These elements can be
Mathematical and Statistical Foundations 487
extended to the case of multiple independent variables, each with its own respective den-
sity functions. To illustrate, the joint density function for several independent variables
is provided in Equation A.25.
In Equation A.25, F(x1) represents the density function of a single variable. Deriving
the variance of the sum of independent random variables is required when, for example,
the reliability of a sum of variables (i.e., a composite) is the goal. Equation A.26a pro-
vides the components for calculating the variance of a linear composite (e.g., the sum of
several variables or subtests) in order to derive an estimate of the variance of a composite.
Equations A.26b and A.26c provide an example using data from variables fluid intelli-
gence tests 1 and 2 from Figure A.1.
XP X1
( 1 ,…, P) = ò ò ( 1,…, P) 1… P
-¥ -¥
s2Y = å s2I + 2å r IJ sI sJ I ¹ J
r=
å xy
(å x 2 )(å y2 )
• åxy = sum of the product of the paired x and y scores.
• ∑ X = sum of the squared x scores.
2
correlation matrix for the first five items on fluid intelligence tests 1 and 2 from Figure
A.3 are provided in Table A.3.
The covariance between any pair of items is given in Equation A.27b and is expressed
as the correlation between two items times their respective standard deviations. The
matrix presented in Table A.4 is a variance–covariance matrix because the item variances
are included along the diagonal of the matrix.
covij = rijsisj
Table A.3. Pearson Correlation Matrix for Items 1–5 on Fluid Intelligence Tests 1
and 2
FI FI FI FI FI FI FI FI FI FI
test 2, test 2, test 2, test 2, test 2, test 1, test 1, test 1, test 1, test 1,
item 1 item 2 item 3 item 4 item 5 item 1 item 2 item 3 item 4 item 5
FI test 2, item 1 1 0.24 0.22 0.25 0.29 0.22 0.19 0.26 0.20 0.19
FI test 2, item 2 — 1 0.36 0.43 0.39 0.28 0.27 0.30 0.24 0.28
FI test 2, item 3 — — 1 0.37 0.36 0.25 0.27 0.28 0.24 0.26
FI test 2, item 4 — — — 1 0.47 0.31 0.29 0.39 0.33 0.32
FI test 2, item 5 — — — — 1 0.29 0.31 0.32 0.26 0.34
FI test 1, item 1 — — — — — 1 0.35 0.45 0.32 0.40
FI test 1, item 2 — — — — — — 1 0.42 0.27 0.33
FI test 1, item 3 — — — — — — — 1 0.30 0.37
FI test 1, item 4 — — — — — — — — 1 0.38
FI test 1, item 5 — — — — — — — — — 1
Note. Standard deviation values are equal to 1 in a correlation matrix and are provided along the diagonal of the
matrix.
Table A.4. Covariance Matrix for Items 1–5 on Fluid Intelligence Tests 1 and 2
FI FI FI FI FI FI FI FI FI FI
test 2 test 2 test 2 test 2 test 2 test 1 test 1 test 1 test 1 test 1
item 1 item 2 item 3 item 4 item 5 item 1 item 2 item 3 item 4 item 5
FI test 2 item 1 0.10 0.03 0.03 0.04 0.04 0.06 0.05 0.07 0.06 0.05
FI test 2 item 2 — 0.16 0.06 0.08 0.07 0.09 0.10 0.11 0.08 0.10
FI test 2 item 3 — — 0.16 0.07 0.07 0.08 0.10 0.10 0.08 0.09
FI test 2 item 4 — — — 0.23 0.11 0.12 0.12 0.16 0.14 0.14
FI test 2 item 5 — — — — 0.23 0.12 0.14 0.13 0.11 0.15
FI test 1 item 1 — — — — — 0.69 0.26 0.33 0.23 0.29
FI test 1 item 2 — — — — — — 0.80 0.33 0.20 0.26
FI test 1 item 3 — — — — — — — 0.77 0.23 0.29
FI test 1 item 4 — — — — — — — — 0.73 0.29
FI test 1 item 5 — — — — — — — — — 0.78
Note. Bold numbers are variances of an item and are provided along the diagonal of the matrix.
LIBNAME X 'K:\Guilford_Data_2011';
DATA temp; set X.GfGc;
RUN;
VAR fi1_01 fi1_02 fi1_03 fi1_04 fi1_05 fi2_01 fi2_02 fi2_03 fi2_04
fi2_05;
TITLE 'COVARIANCES AND CORRELATIONS';
RUN;
Mathematical and Statistical Foundations 491
As introduced earlier, the term moment describes deviations about the mean of a
distribution of scores. Similarly, a deviate is a single deviation about the mean, and such
deviates are defined as the first moments about the mean of a distribution. The second
moments of a distribution are the moments squared, whereas the third moments are the
moments cubed. Because standard scores (such as z-scores) are deviates with a mean of
zero, standard scores are actually first moments about a distribution, and therefore the
multiplication of two variables, say X and Y, results in the calculation of the product-
moment correlation coefficient.
Covariance
The covariance is defined as the average cross product of two sets of deviation scores and
therefore can also be thought of as an unstandardized correlation. The equation for the
covariance using raw scores is provided in Equation A.28.
An important link between the correlation coefficient r and the covariance is illus-
trated in Equation A.29.
The Pearson r is not well suited for describing a nonlinear relationship (i.e., a joint dis-
tributional shape that does not follow a straight line of best fit) between two variables.
Using r in these situations can produce misleading estimates and tests of significance.
Figure A.8 illustrates this nonlinearity of regression using the fluid intelligence test total
score data. Note in the figure how across-the-age-span scores on fluid intelligence are
å å - -
sXY = =
N N
sX Y
RX Y =
s X sY
slightly curvilinear, and as a person’s age increases their score plateaus. In Figure A.9, a
polynomial regression line (r-square = .46) describes or fits the data better than a straight
line (r-square = .42).
The SPSS syntax for producing the graph in Figure A.8 is provided below using the
dataset GfGc.SAV.
CURVEFIT
/VARIABLES=fi_tot WITH AGEBAND
/CONSTANT
/MODEL=LINEAR CUBIC
/PLOT FIT.
60.00
50.00 Observed
Cubic
40.00
30.00
20.00
10.00
0.00
0 20 40 60 80 100
Age in years
Figure A.8. Nonlinear regression of fluid intelligence total score (Y) on age (X).
Mathematical and Statistical Foundations 493
Observed
60.00 Linear
Cubic
50.00
40.00
30.00
20.00
10.00
0.00
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00
Age in 10-year bands
Figure A.9. Comparison of linear versus nonlinear trend of fluid intelligence total score (Y)
on age (X).
Another way to evaluate the relationship between two variables is to examine the pat-
tern of the errors of estimation. Errors of estimation between X and Y should be approxi-
mately equal across the range of X and Y. Using the intelligence test example, we find that
unevenly distributed errors may arise when the estimation (or prediction) error between
ability scores (X) and actual scores (Y) is not constant across the continuum of X and
Y. Ultimately, heteroscedastic (i.e., a large amount of variability) errors of estimation
are often due to differences among subjects on the underlying latent trait or construct
representing X or Y. Such differences among subjects (and therefore measurements on
494 Appendix
variables) are manifested through the actual score distributions, which in turn affect
the accuracy of the correlation coefficient. Again using our fluid intelligence test total
score data, Figure A.10 illustrates that the errors of regression are constant and normally
distributed. For example, notice that points in the graph (i.e., errors) are consistently
dispersed throughout the range of age and score for the subjects.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT fi_tot
/METHOD=ENTER AGEYRS
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS NORMPROB(ZRESID).
The normality of the distribution of a set of scores is an assumption central to tests of sta-
tistical significance and confidence interval estimation. Both X and Y variables should be
evaluated for normality (i.e., excessive univariate skewness and kurtosis) using standard
data screening methods. Recommended cutoff values for excessive univariate skewness
and kurtosis are provided in Tabachnick and Fidell (2007, pp. 79–81). These authors
2
Regression standardized residual
–2
–4
–2 –1 0 1 2
Regression standardized predicted value
Figure A.10. Comparison of linear versus nonlinear trend of fluid intelligence total score (Y)
on age (X).
Mathematical and Statistical Foundations 495
recommend using conventional and conservative alpha levels of .01 and .001 to evaluate
skewness and kurtosis with small to moderate samples. When the sample size is large (i.e.,
> 100), the shape of the distribution should be examined graphically since with a large
sample size the null hypothesis of normality will usually be rejected. Should the assump-
tion of normality be untenable, options available to researchers include transforming the
variable(s) or applying nonparametric or nonlinear analytic techniques. The primary con-
cern in conducting score transformations, however, is the issue of interpreting the results
of an analysis after the analysis is complete. Transformations may lead to difficult inter-
pretation and often do not lead to any improvement in meeting the assumption of normal-
ity. Another option is to consider using a nonparametric (i.e., assumption-free) analytic
method for the analysis. Choosing the best analytic model and technique given the data is
perhaps the wisest choice, particularly with the statistical software now available.
When two variables do not meet the linearity assumption and equal-interval level of
measurement requirement, the Pearson r is mathematically expressed by three special
formulas: Spearman’s rank order correlation rS, the point–biserial correlation rpbis, and the
phi coefficient rf.
rs =
å (Ri - R )(Si - S )
å (Ri - R )2 å (Si - S )2
To provide an example using SPSS, the syntax below is used to derive the Spearman
correlation coefficient using the short-term memory total score categorized into low,
medium, and high categories, with a person’s age in years. Here age is treated as an ordinal
rather than interval measure to illustrate that as age increases short-term memory decreases.
Table A.5 provides the Spearman Correlations for relationship between memory and age.
SPSS syntax for Spearman correlation coefficient using data file GfGc.SAV
NONPAR CORR
/VARIABLES=STM_TOT_CAT AGEYRS
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.
SAS program for Spearman correlation coefficient using data file GfGc.SD7
LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;
The point–biserial correlation is used to assess the correlation between a dichotomous vari-
able (e.g., a test item with a 1 = correct/0 = incorrect outcome) and a continuous variable
(e.g., the total score on a test or another criterion score). The point–biserial coefficient
is not restricted to the underlying distribution of each level of the dichotomous variable
Mathematical and Statistical Foundations 497
XS - X m P
Rpbis = SY . Q
or test item being normal. Therefore, it is more useful than the biserial coefficient (pre-
sented next) where a coefficient assumes a normal distribution underlying both levels of
the dichotomous variable. In test development and revision, the point–biserial is useful
for examining the contribution of a test item to the total test score. Recommendations for
using the point–biserial correlation in item evaluation are provided in Allen and Yen (1979,
pp. 118–127). The formula for the point–biserial correlation is illustrated in Equation A.31.
The corresponding standard error of rpbis is given in Equation A.32.
PQ
− RPB
2
Y
RPB =
N
x s - x m pq
rbis = ×
sY z
1 PQ
SRBIS =
N Y
categories. For example, the situation may occur where a cutoff score or criterion is used
to separate or classify groups of people on certain attributes. Mathematical corrections are
made for the dichotomization of the one variable, thereby resulting in a correct Pearson
correlation coefficient. Equation A.33 provides the formula for the biserial correlation.
The corresponding standard error of rbis is given in Equation A.34.
The BILOG syntax below provides the output presented in Table A.6 (introduced in
Chapter 6). The results in Table A.6 are from phase I output of the program (Du Toit, 2003).
Table A.6. BILOG-MG Point–Biserial and Biserial Coefficients for the 25-Item
Crystallized Intelligence Test 2
PEARSON r
Name N # Right PCT LOGIT (pt.–biserial) Biserial r
ITEM0001 1000 0.00 0.00 99.99 0.00 0.00
ITEM0002 1000 995.00 99.50 -5.29 0.02 0.11
ITEM0003 1000 988.00 98.80 -4.41 0.09 0.30
ITEM0004 1000 872.00 87.20 -1.92 0.31 0.49
ITEM0005 1000 812.00 81.20 -1.46 0.37 0.54
ITEM0006 1000 726.00 72.60 -0.97 0.54 0.72
ITEM0007 1000 720.00 72.00 -0.94 0.57 0.76
ITEM0008 1000 826.00 82.60 -1.56 0.31 0.45
ITEM0009 1000 668.00 66.80 -0.70 0.48 0.62
ITEM0010 1000 611.00 61.10 -0.45 0.52 0.67
ITEM0011 1000 581.00 58.10 -0.33 0.51 0.64
ITEM0012 1000 524.00 52.40 -0.10 0.55 0.69
ITEM0013 1000 522.00 52.20 -0.09 0.67 0.85
ITEM0014 1000 516.00 51.60 -0.06 0.62 0.77
ITEM0015 1000 524.00 52.40 -0.10 0.53 0.67
ITEM0016 1000 482.00 48.20 0.07 0.56 0.71
ITEM0017 1000 444.00 44.40 0.22 0.60 0.76
ITEM0018 1000 327.00 32.70 0.72 0.57 0.74
ITEM0019 1000 261.00 26.10 1.04 0.49 0.66
ITEM0020 1000 241.00 24.10 1.15 0.46 0.64
ITEM0021 1000 212.00 21.20 1.31 0.53 0.75
ITEM0022 1000 193.00 19.30 1.43 0.47 0.68
ITEM0023 1000 164.00 16.40 1.63 0.46 0.69
ITEM0024 1000 122.00 12.20 1.97 0.37 0.59
ITEM0025 1000 65.00 6.50 2.67 0.34 0.65
Note. This table is a portion of BILOG-MG phase I output.
Phi Coefficient F
The phi coefficient is appropriate for use when two variables are qualitative (i.e., cat-
egorical) and/or dichotomous (as in test items scored 1 = correct/0 = incorrect). As
an example of how the phi coefficient may be useful, consider the situation where a
researcher is interested in whether there is statistical dependency between the variables
sex and short-term memory (categorized as low, medium, and high). To examine this
relationship, the cell frequency counts within categories are required to examine this
500 Appendix
relationship. Table A.7 illustrates how the phi coefficient is used to examine the asso-
ciation between the variables sex and short-term memory using actual cell frequency
counts within categories from the dataset PMPT.SAV. The phi coefficient is given in
Equation A.35.
SPSS syntax and partial output for phi coefficient using data file GfGc.SAV
CROSSTABS
/TABLES=SEX BY STM_LOW_HIGH_CAT
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ CC PHI UC CORR
/CELLS= COUNT EXPECTED ROW COLUMN SRESID
/COUNT ROUND CELL.
Symmetric Measures(c)
Value Approx. Sig.
Nominal by Phi -.116 .000
Nominal Cramer’s V .116 .000
Contingency Coefficient .116 .000
N of Valid Cases 1000
a Not assuming the null hypothesis.
b Using the asymptotic standard error assuming the null hypothesis.
c Correlation statistics are available for numeric data only.
SAS program and partial output for phi coefficient and related coefficients using data
file GfGc.SAV
LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;
Mathematical and Statistical Foundations 501
PX Y − PX PY
RΦ =
PX Q X PY Q Y
PROC FREQ;
TABLES stm_low_high_cat*sex
/CHISQ ALL OUT=X.nparm_corr_output;
run;
STM_LOW_HIGH_CAT SEX(GENDER)
Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚1 ‚2 ‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚ 99 ‚ 168 ‚ 267
‚ 9.90 ‚ 16.80 ‚ 26.70
‚ 37.08 ‚ 62.92 ‚
502 Appendix
‚ 21.20 ‚ 31.52 ‚
ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
ƒ
2 ‚ 368 ‚ 365 ‚ 733
‚ 36.80 ‚ 36.50 ‚ 73.30
‚ 50.20 ‚ 49.80 ‚
‚ 78.80 ‚ 68.48 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 467 533 1000
46.70 53.30 100.00
Finally, when the goal is to statistically test the association based on a cross-tabulation
analysis of two variables, a 2 × 2 contingency table can be created. Using Equation A.36
provides a way to conduct a statistical test of association using the important functional
connection between F and c 2.
By using Equation A.36, a researcher can test the phi coefficient against the null
hypothesis of no association using the chi-square distribution. The degrees of freedom
for the chi-square test is df = (r – 1)(k – 1), where r is the number of rows and k is the
number of columns. Finally, when cell size is less than 10, Yates’s correction for conti-
nuity should be applied. Applying Yates’s correction is recommended in the case of small
cell size because the chi-square statistic is based on frequencies of whole numbers and is
represented in discrete increments, whereas the chi-square table is based on a continu-
ous distribution. Yates’s correction is applied by adding a value of .5 to each obtained fre
quency that is greater than the expected frequency and increasing by .5 the frequencies
c 2 = NF2
• N = sample size.
• F2 = square of the phi coefficient from Equation 2.28.
Mathematical and Statistical Foundations 503
that are less than expected. The cumulative effect yielded is a reduction in the amount of
each difference between obtained and expected frequency by .5.
In the case of larger contingency tables, as presented in Table A.8, Cramer’s contingency
coefficient (Conover, 1999) is used as a measure of association. The contingency coefficient
(i.e., symbolized as C or Cramer’s V) is the statistic of choice when two variables consist of at
least three or more categories and have no particular underlying distributional continuum.
The SPSS syntax below produces estimates of Cramer’s V and the contingency coef-
ficient from Table A.8. A partial listing of the output follows the syntax.
CROSSTABS
/TABLES=SEX BY STM_TOT_CAT
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ CC PHI CORR
/CELLS= COUNT EXPECTED ROW COLUMN SRESID
/COUNT ROUND CELL.
Symmetric Measures(c)
Value Approx. Sig.
Nominal by Phi .106 .003
Nominal Cramer’s V .106 .003
Contingency Coefficient .106 .003
N of Valid Cases 1000
a Not assuming the null hypothesis.
b Using the asymptotic standard error assuming the null hypothesis.
c Correlation statistics are available for numeric data only.
The polyserial r is a generalization of the biserial r and is used when one variable is con-
tinuous and the other is categorical, but where the categories are greater than 2. The aim
when using rpoly is to estimate what the correlation would be if the two variables were
continuous and normally distributed. For example, a continuous variable such as a stan-
dardized test score might be correlated with a categorical outcome such as socioeconomic
status or an external criterion such as a national ranking having three or more discrete
levels. The point estimate versions of these statistics are special cases of the Pearson r
that attempt to overcome the artificial restriction of range created by categorizing vari-
ables that are assumed to be continuous and normally distributed. Also, two variables
may exist where one is composed of three categories but has been artificially reduced to
two categories and the other exists in three or more categories. This reduction may arise
when a cutoff score or criterion is used to separate or classify groups of people on certain
attributes. Equation A.37 provides the formula for polyserial r (Du Toit, 2003, p. 563).
Table A.9 provides an example of the polyserial correlation coefficient.
The PARSCALE program syntax that provides the contents of Table A.9 is provided
below.
M J -1
1
Rpoly = RP, J
sJ
å H(Z JK)( TJ,K+1 - TJK)
K= 0
Table A.9. PARSCALE Program Phase I Output for 14 Items on the Crystallized
Intelligence Test 3
Pearson &
Total Score Polyserial
Item Response Mean/SD Correlation Initial Slope Initial Location
1 2.91 30.10 0.41 1.05 -3.08
0.314* 5.550* 0.73
2 2.95 30.10 0.35 1.20 -2.82
0.265* 5.550* 0.77
3 2.32 30.10 0.51 0.73 -1.07
0.600* 5.550* 0.59
4 2.80 30.10 0.48 1.14 -1.76
0.534* 5.550* 0.75
5 2.50 30.10 0.52 0.89 -1.02
0.792* 5.550* 0.66
6 1.87 30.10 0.51 0.72 1.28
0.576* 5.550* 0.58
7 2.28 30.10 0.58 0.95 -0.22
0.873* 5.550* 0.69
8 2.02 30.10 0.65 1.04 0.49
0.773* 5.550* 0.72
9 2.21 30.10 0.72 1.71 0.24
0.919* 5.550* 0.86
10 2.07 30.10 0.69 1.29 0.41
0.883* 5.550* 0.79
11 1.60 30.10 0.62 1.05 1.65
0.741* 5.550* 0.73
12 1.66 30.10 0.58 0.95 1.49
0.847* 5.550* 0.69
13 1.59 30.10 0.66 1.25 1.48
0.763* 5.550* 0.78
14 1.34 30.10 0.53 1.02 1.24
0.666* 5.550* 0.72
The polychoric correlation is used when both variables are dichotomous or ordinal,
or both, but both are assumed to have a continuous underlying metric (i.e., theoretically
in the population). The polychoric correlation is based on the optimal scoring (or canoni-
cal correlation) of the standard Pearson correlation coefficient ( Jörskog & Sörbom, 1999a,
p. 22; Kendall & Stuart, 1961, pp. 568–573). Equation A.38 illustrates the polychoric
correlation coefficient (Du Toit, 2003, pp. 563–564).
Often in test development, the underlying construct that a set of items with response
outcomes of correct = 1/incorrect = 0 is designed to measure is assumed to be normally
506 Appendix
M J -1
RP, J å K= 0 H(Z JK )
Rpolychoric, J =
sJ
distributed in the population of examinees. When this is the case, it is desirable to use a
correlation coefficient that exhibits the property of invariance (remains consistent) for
groups of examinees that have different levels of average ability (Lord & Novick, 1968,
p. 348). The tetrachoric correlation is appropriate in this case and is preferable to using the
phi coefficient. Tetrachoric correlation coefficients exhibit invariance properties that phi
coefficients do not. Specifically, the tetrachoric correlation is designed to remain invariant
for scores obtained from groups of participants of different levels of ability but that oth-
erwise have the same bivariate normal distribution for the two different test items. The
property of equality of bivariate distributional relationships between groups of examinees
is highly desirable. The correct use of the tetrachoric correlation assumes that the latent
distribution underlying each of the pair of variables in the analysis is continuous (Divgi,
1979). The tetrachoric correlation is used frequently in item-level factor analysis and
IRT to ensure the appropriate error structure of the underlying distribution is estimated.
Failure to correctly estimate the error structure has been shown to produce incorrect
standard errors and therefore incorrect test statistics (Muthen & Hofacker, 1988).
The equation for computing tetrachoric correlation is lengthy because of the inclu-
sion of various powers of r (Kendall & Stuart, 1961). Fortunately, several statistical com-
puting programs can perform the calculations, such as TESTFACT (Scientific Software
International, 2003a; specifically designed for conducting binary item factor analysis),
BILOG (Scientific Software International, 2003b), and Mplus (Muthen & Muthen,
2010), to name a few. For users unfamiliar with TESTFACT, BILOG, and Mplus, an
SPSS routine is available that uses the output matrix obtained from using the program
TETCORR (Enzmann, 2005). Also, one can use the Linear Structural Relations Program
(LISREL) to produce a polychoric correlation matrix that is very similar to the tetrachoric
correlation, only differing in the restriction that the means are 0 and the variances are 1
(Kendall & Stuart, 1961, pp. 563–573). Situations that call for avoiding using tetrachoric
r include (a) when the split in frequencies of cases in either X or Y is very one-sided (i.e.,
95–5 or 90–10) because the standard error is substantially inflated in these instances.
Furthermore, any cells with a frequency of zero counts should preclude the use of this
statistic. Equation A.39 provides the tetrachoric correlation.
Mathematical and Statistical Foundations 507
¥¥
1 æ X 2 + Y 2 - 2RXYö
RTET = L (H, K, R) = òò çè - 2(1 - R 2) ÷ø DXDY
exp
2p 1 - R 2 K H
To illustrate the differences that are produced between the tetrachoric, polychoric,
and Pearson correlation coefficients, Table A.10 compares the tetrachoric correlation, poly-
choric, and Pearson correlation for items 6 through 10 of the crystallized intelligence test 2.
TESTFACT program example syntax for Equation 2.34 producing the matrix in Table A.10
>TITLE
>EQUATION2_34.TSF - CRYSTALLIZED INTELLIGENCE SUBTEST 2,
ITEMS 6-10 FULL-INFORMATION ITEM FACTOR ANALYSIS
WITH TETRACHORIC CORRELATION COEFFICIENT
>PROBLEM NITEMS=5, RESPONSE=2;
>COMMENTS
Data layout:
COLUMNS 1 TO 5 --- ITEM RESPONSES
PRE LISREL polychoric program example syntax used to produce the polychoric matrix in
Table A.10
The results from TESTFACT are different from the other matrices due to the advanced
methods of multidimensional numerical integration estimation included in the program.
Also, TESTFACT provides important linkages to item response theory and Bayes estimation
(for small items sets) and is therefore particularly useful in producing correlation matrices
for factor analysis of dichotomous items where an underlying normal distribution of a con-
struct is assumed to exist. For the computational details of TESTFACT, see Du Toit (2003).
The correlation ratio, eta (h), is applicable in describing the relationship between X and
Y in situations where there is a curvilinear relationship between two interval-level or
continuous quantitative variables (i.e., curvilinear regression). A classic example is the
regression of chronological age between ages 3 and 15 on a performance or ability score.
The correlation ratio of y on x is provided in Equation A.40a.
The standard error of the correlation ratio is given in Equation A.40b.
SSregression
h2Y .X = 1 -
SStotal
1 - h2
sh =
N -1
As mentioned previously, departures from linearity between Y and X can have detrimental
effects in theoretical and applied research. A useful test for assessing the degree of non-
linearity in an X and Y relationship is by the F-test and is provided in Equation A.41a.
510 Appendix
h2Y .X - R 2
J -2
F=
1 - h2Y .X
N- J
.594 − .434
.02667
F = 8 − 2 = = 62.79
1 − .594 .00043
1000 − 8
An F-ratio of 62.79 exceeds F-critical (readers can verify this by referencing an F-table);
therefore the hypothesis that the regression of fluid intelligence on age is linear is rejected.
This result leads one to apply a nonlinear form of regression to estimate the relationship.
Extending the simple linear regression model to accommodate multiple predictor vari-
ables to estimate a criterion variable is straightforward. Furthermore, this extension is
applicable when the criterion is either continuous or categorical. The multiple predictor
equation in standard score (Z) form is provided in Equation A.42.
In the raw score case, b is substituted with b. However, both equations result in
an expected one-unit change in the criterion per unit change in the predictor, while
Mathematical and Statistical Foundations 511
Zˆ Y = b1Z1 + b2 Z2 + b3 Z3
holding all other predictors constant as in the partial correlation explanations presented
next.
The partial correlation between two variables partitions out or cancels the effect of a third
variable upon the ones being evaluated. For example, the correlation between weight and
height of males where age is allowed to vary would be higher than if age were not allowed
to vary (i.e., held constant or partitioned out of the relationship). Another example is the
correlation between immediate or short-term memory and fluid intelligence where age is
permitted to vary. The first-order partial correlation is given in Equation A.42.
Equation A.43 can be extended, as illustrated in Equation A.44, to calculate partial cor-
relations of any order. Notice that in Equation A.44, the combined effect of two variables
on the correlation of another set of variables is of interest. For example, a researcher may
want to examine the correlation between short-term memory and fluid intelligence while
controlling for the effect of crystallized intelligence and age.
R12 − R13R23
R12.3 =
(1 − R132 )(1 − R23
2
)
R12.3 - R14.3R24.3
R12.34 =
(1 - R14.3
2
)(1 - R24.3
2
)
Equation A.45 can be modified to express yet another version of partial correlation and
is often used in multivariate analyses such as multiple linear regression. Equation A.45
expresses the unique contribution of adding successive predictors into a regression
equation.
Note the difference between Equations A.44 and A.45, notably the elimination of
the first half of the term in the denominator. Because of this change, the partial correla-
tion is always larger than the semipartial correlation. In regression problems where the
specific amount of influence each predictor variable in a set of variables exhibits on an
outcome, the semipartial correlation (as opposed to the partial correlation coefficient) is
the preferred statistic. Using the semipartial correlation allows a researcher to determine
the precise amount of unique variance each predictor accounts for in the outcome vari-
able (i.e., y). Table A.11 illustrates Pearson, partial, and semipartial coefficients based
on a regression analysis using total scores for fluid intelligence and short-term memory
as predictors of crystallized intelligence. The SPSS syntax that produced this output is
provided in Table A.11.
Mathematical and Statistical Foundations 513
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA ZPP
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT cri_tot
/METHOD=ENTER stm_tot fi_tot.
Below is a SAS program using PROC REG that produces several estimates of partial
correlation coefficients presented as squared partial correlations (i.e., the estimates will
be the square root of those in Table A.11 above).
SAS program source code that produced partial correlation presented as squared
partial correlations
LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;
PROC REG;
MODEL cri_tot=stm_tot fi_tot/PCORR1 PCORR2 SCORR1 SCORR2;
TITLE 'SQUARED PARTIAL & SEMI PARTIAL CORRELATION';
RUN;
Equation A.44 illustrates that the correlation between two measures is the covari-
ance divided by their respective standard deviations. Alternatively, the correlation is actu-
ally a standardized covariance. By rearranging Equation A.43 as sXY = rXYsXsY = rXYsXsY,
the covariance is the product of the correlation coefficient rXY and the two respective
standard deviations sXsY.
This Appendix presented the mathematical and statistical foundations necessary for a thor-
ough understanding of how psychometric methods work. First, three goals for researchers
developing and using psychometric methods were presented. The three goals were then
considered in light of three important components related to developing and using psycho-
metric methods: precision, communication, and objectivity. Importantly, an illustration was
presented regarding how concepts can be represented within a conceptual model by using
operational and/or epistemological definitions and rules of correspondence. Figure A.3
illustrates a conceptual model integrating concepts and rules of correspondence that pro-
vide a framework for applying mathematical rules and operations onto a measurable space.
Examples of tasks in psychological measurement include but are not limited to (1) devel
oping normative scale scores for measuring short-term memory ability across the lifespan,
(2) developing a scale to accurately reflect a child’s reading ability in relation to his or her
socialization process, and (3) developing scaling models useful for evaluating mathemati-
cal achievement. Often these tasks are complex and involve multiple variables interacting
with one another. In this section, the definition of a variable was provided, including the
different types and the role they play in measurement and probability. Finally, some distri-
butions commonly encountered in psychometric methods were provided.
Attributes were described as identifiable qualities or characteristics represented by
either numerical elements or classifications. Studying individual differences among peo-
ple on their attributes plays a central role in understanding differential effects. In experi-
mental studies, variability about group means is often the preference. Whether a study is
based on individuals or groups, research problems are of interest only to the extent that a
particular set of attributes (variables) exhibit joint variation or covariation. If no covaria-
tion exists among a set of variables, conducting a study of such variables would be use-
less. Importantly, the goal of theoretical and applied psychometric research is to develop
models that extract the maximum amount of covariation among a set of variables.
Central to psychometric methods is the idea of mathematically expressing the rela-
tionship between two or more variables. Most analytic methods in psychometrics and
statistics involve the mathematical relationship between two or more variables. The coef-
ficient of correlation provides a mathematical and statistical basis for researchers to be
able to estimate and test bivariate and multivariate relationships. A considerable portion
of this Appendix provided a treatment of the various coefficients of correlation and when
their use is appropriate.
Mathematical and Statistical Foundations 515
Continuous. Data values from a theoretically uncountable or infinite set having no gaps
in its unit of scale.
Covariation. The degree to which two variables vary together.
Improper solution. The occurrence of zero or negative error variances in matrix algebra
and simultaneous equations estimation.
Independent events. Given two events A and B, A does not affect the probability of B.
Independent trial. In probability theory, an event that is independent of another event is
a sample space.
Independent variable. A predictor or moderator variable (X) that is under some form of
direct manipulation by the researcher.
Item response theory. Application of mathematical models to empirical data for measur-
ing attitudes, abilities, and other attributes. Also known as latent trait theory, strong
true score theory, or modern test theory.
Joint density function. Multiplication of the conditional distributions for two variables (X
and Y ), resulting in marginal distributions for X and Y, respectively.
Kurtosis. A characteristic of a distribution where the tails are either excessively flat or
narrow, resulting in excessive “peakedness” or “flatness.” Also known as the fourth
moment or cumulant of a distribution.
Latent. Variables that are unobservable characteristics of human behavior such as a
response to stimulus of some type.
Linear score transformation. A change in a raw score by multiplying the score by a
multiplicative component (b) and then adding an additive component (a) to it.
Mean squared deviation. The average of the sum of the squared deviations for a ran-
dom variable.
Measurable space. A space comprised of the actual observations (i.e., sample space)
of interest in a study.
Metric. A standard of measurement or a geometric function that describes the distances
between pairs of points in space.
Moment. The value of a function of a real variable about a value such as c, where c is
usually zero.
Multiplicative theorem of probability. The probability of several particular events occur-
ring successively or jointly is the product of their separate probabilities.
Objectivity. A property of the measurement process demonstrated by the independent
replication of results using a specific measurement method by different researchers.
Pearson product–moment coefficient of correlation. A measure of strength of linear
dependence between two variables, X and Y.
Posterior distribution. In Bayesian statistics, the product of the prior distribution times
the likelihood.
Precision. The degree of mutual agreement among a series of individual measurements
on things such as traits, values, or attributes.
Probability distribution function. An equation that defines a continuous random vari-
able X.
Mathematical and Statistical Foundations 517
Probability function. The probabilities with which X can assume only the value 0 or 1.
Probability space. A space from which random variables or functions are obtained.
Unbiased estimate. An estimator exhibiting a property such that the expected value and
the true value is zero.
Variable. A measurable factor, characteristic, or attribute of an individual, system, or
process.
Variance. A measure of dispersion of a random variable achieved by averaging the
deviations of its possible values from its expected value.
Yates’s correction for continuity. (Yates’s chi-square test). Adjusts the Pearson chi-square
test to prevent overestimation of statistical significance when analyzing data based
on samples with small cell sizes (< 10).
References
Adams, R. J., Wilson, M. R., & Wang, W. C. (1997). The multidimensional random coefficients
multinomial logit. Applied Psychological Measurement, 21, 1–24.
Aiken, L. R. (2002). Attitudes and related psychosocial constructs: Theories, assessment and research.
Thousand Oaks, CA: Sage.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
B. N. Petrov & F. Csaki (Eds.), Proceedings of the 2nd International Symposium on Information
Theory (pp. 267–281). Budapest: Akademiai.
Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sam-
pling. Journal of Educational Statistics, 17, 261–269.
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669–679.
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Belmont, CA: Wadsworth.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1985). Standards for educational and psychological
testing. Washington, DC: Authors.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing (2nd ed.). Washington, DC: Authors.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (2014). Standards for educational and psychological
testing (3rd ed.). Washington, DC: Authors.
Anastasi, A. (1986). Emerging concepts of test validation. Annual Review of Psychology, 37, 1–15.
Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage.
Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible para-
digms? Medical Care, 42, 1–16.
Angoff, W. H. (1984). Scales, norms and equivalent scores. Princeton, NJ: Educational Testing
Service.
Atkins v. Virginia, 536 U.S. 304.
Baker, F. (1990). EQUATE computer program for linking two metrics in item response theory. Madison:
University of Wisconsin, Laboratory of Experimental Design.
Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation technique (2nd ed.).
New York: Marcel Dekker.
519
520 References
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chance. Philosophical
Transactions of the Royal Society of London, 53, 370–418.
Bennett, J. F., & Hayes, W. I. (1960). Multidimensional unfolding: Determining the dimensionality
of ranked preference data. Psychometrika, 25, 27–43.
Benson, J. (1988). Developing a strong program of construct validation: A test anxiety example.
Educational Measurement: Issues and Practice, 17, 10–17.
Berk, R. A. (1984). A guide to criterion-referenced test construction. Baltimore: Johns Hopkins Uni-
versity Press.
Birnbaum, A. (1957). Efficient design and use of tests of mental ability for various decision making
problems (Series Report No. 58-16, Project No. 7755-23). Randolph Air Force Base, TX: USAF
School of Aviation Medicine.
Birnbaum, A. (1958a). On the estimation of mental ability for various decision making problems
(Series Report No. 15, Project No. 7755-23). Randolph Air Force Base, TX: USAF School of
Aviation Medicine.
Birnbaum, A. (1958b). Further considerations efficiency in tests of mental ability (Technical Report
No. 17, Project No. 7755-23). Randolph Air Force Base, TX: USAF School of Aviation
Medicine.
Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical
theories of mental test scores (pp. 397–479). Reading, MA: Addison-Wesley.
Birnbaum, M. H. (Ed.). (1998). Measurement, judgment, and decision making (2nd ed.). San Diego,
CA: Academic Press.
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of
educational objectives: The classification of educational goals: Handbook I. Cognitive domain.
New York: Longmans, Green.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two
or more nominal categories. Psychometrika, 37, 29–51.
Bock, D., Gibbons, R., & Muraki, E. (1988). Full information item factor analysis. Applied Psycho-
logical Measurement, 12(3), 261–280.
Bock, D., Gibbons, R., & Muraki, E. (1996). TESTFACT computer program. Chicago: Scientific
Software International.
Bock, R. D., & Aitkin, M. (1982). Marginal maximum likelihood estimation of item parameters:
Application of the EM algorithm. Psychometrika, 46, 443–445.
Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice. San
Francisco: Holden-Day.
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model. Mahwah, NJ: Erlbaum.
Boring, E. G. (1950). A history of experimental psychology. New York: Appleton-Century-Crofts.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.
Psychometrika, 64, 153–168.
Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing.
Brennan, R. L. (1998). Misconceptions at the intersection of measurement theory and practice.
Educational Measurement: Issues and Practice, 17(1), 5–29.
Brennan, R. L. (2010). Generalizability theory. New York: Springer.
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York: Guilford Press.
Browne, M. W., & Zhang, G. (2007). Developments in the factor analysis of individual time series.
In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and
future directions (pp. 265–292). Mahwah, NJ: Erlbaum.
Bruce, V., Green, P. R., & Georgeson, M. A. (1996). Visual perception (3rd ed.). Mahwah, NJ:
Erlbaum.
Bush, R. R., & Mosteller, F. (1955). Stochastic models for learning. New York: Wiley.
Camilli, G. (1994). Origin of the scaling constant d = 1.7 in item response theory. Journal of Edu-
cational and Behavioral Statistics, 19, 293–295.
References 521
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–
multimethod matrix. Psychological Bulletin, 56, 81–105.
Card, N. A., & Little, T. D. (2007). Longitudinal modeling of developmental processes. Interna-
tional Journal of Behavioral Development, 31(4), 297–302.
Carnap, R. (1950). Logical foundations of probability. Chicago: University of Chicago Press.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor analytic studies. Cambridge, UK:
Cambridge University Press.
Cattell, R. B. (1943). The description of personality: Basic traits resolved into clusters. Journal of
Abnormal and Social Psychology, 38, 476–506.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1,
245–276.
Cattell, R. B. (1971). Abilities: Their structure, growth and action. Boston: Houghton Mifflin.
Cizek, G. J., & Bunch, M. B. (2006). Standard setting: A guide to establishing and evaluating perfor-
mance standards on tests. Thousand Oaks, CA: Sage.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Mahwah, NJ: Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum.
Cohen, R. J., & Swerdlik, M. (2010). Psychological testing and assessment: An introduction to test and
measurements (7th ed.). New York: McGraw-Hill.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Mahwah,
NJ: Erlbaum.
Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York: Wiley.
Coombs, C. (1964). A theory of data. New York: Wiley.
Coombs, C. H. (1950). The concepts of reliability and homogeneity. Educational and Psychological
Measurement, 10, 43.
Costa, P. T., & McCrae, R. R. (1992). The revised NEO Personality Inventory (NEO-PI-R) and NEO
Five-Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment
Resources.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Boston: Harcourt
Brace Jovanovich.
Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth.
Cronbach, L. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 6,
297–334.
Cronbach, L. (1970). Essentials of psychological testing (3rd ed.). New York: Harper.
Cronbach, L. J. (1971). Test validation. In R. L. Linn (Ed.), Educational measurement (2nd ed.,
pp. 443–507). Washington, DC: Macmillan.
Cronbach, L. J. (1980). Selection theory for a political world. Public Personnel Management, 9(1),
37–50.
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions. Urbana: Uni-
versity of Illinois Press.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bul-
letin, 52, 281–302.
Cudeck, R. (2000). Exploratory factor analysis. In H. Tinsley & H. Brown (Eds.), Applied multivar-
iate statistical modeling and mathematical modeling (pp. 265–295). San Diego, CA: Academic
Press.
Darwin, C. (1859). On the origin of species by means of natural selection. London: Murray.
Dawes, R. M. (1972). Fundamentals of attitude measurement. New York: Wiley.
de Ayala, R. (2009). The theory and practice of item response theory. New York: Guilford Press.
522 References
Glass, G. V., & Hopkins, K. D. (1996). Statistical methods in education and psychology (3rd ed.).
Needham Heights, MA: Allyn & Bacon.
Glenberg, A. M., & Andrzejewski, M. E. (2008). Learning from data: An introduction to statistical
reasoning (3rd ed.). Hillsdale, NJ: Erlbaum.
Glutting, J., McDermott, P., & Stanley, J. C. (1987). Resolving differences among methods of
establishing confidence limits for test scores. Educational and Psychological Measurement,
47, 607.
Gregory, R. J. (2000). Psychological testing: History, Principles and Applications (3rd ed.). Needham
Heights, MA: Allyn & Bacon.
Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009).
Survey methodology (2nd ed.). New York: Wiley.
Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York: McGraw-Hill.
Guilford, J. P. (1978). Fundamental statistics in psychology and education (4th ed.). New York:
McGraw-Hill.
Guion, R. (1977). Content validity: The source of my discontent. Applied Psychological Measurement,
1, 1–10.
Guion, R. (1998). Assessment, measurement and prediction for personnel decisions. Mahwah, NJ:
Erlbaum.
Gulliksen, H. (1950a). Intrinsic validity. American Psychologist, 5, 511–517.
Gulliksen, H. (1950b). The theory of mental tests. New York: Wiley.
Gulliksen, H. (1987). Theory of Mental Tests. Hillsdale, NJ: Erlbaum.
Guttman, L. (1941). The quantification of a class of attributes: A theory and method for scale con-
struction. In P. Horst (Ed.), The prediction of personal adjustment (pp. 321–348). New York:
Social Science Research Council.
Guttman, L. A. (1944). A basis for scaling qualitative data. American Sociological Review, 9,
139–150.
Guttman, L. (1946). An approach for quantifying paired comparisons and rank order. Annals of
Mathematical Statistics, 17, 144–163.
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese
Psychological Research, 22, 144–149.
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate data analysis
(5th ed.). Upper Saddle River, NJ: Prentice-Hall.
Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Mahwah, NJ: Erlbaum.
Hald, A. (1998). A history of mathematical statistics from 1750 to 1930. New York: Wiley.
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan
(Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on
Education/Praeger.
Hambleton, R. K., & Plake, B. S. (1995). Using an extended Angoff procedure to set standards on
complex performance assessments. Applied Measurement in Education, 8, 41–56.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and practice. Boston:
Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory
(Vol. 2). Newbury Park, CA: Sage.
Han, C. (2008). IRTEQ computer program, version 1.2.21.55. www.umass.edu/remp/software/irteqt.
Hattie, J. A. (1985). A methodological review: Assessing unidimensionality of tests and items.
Applied Psychological Measurement, 9, 139–164.
Hebb, D. O. (1942). The effects of early and late brain injury upon test scores, and the nature of
normal adult intelligence. Proceedings of the American Philosophical Society, 85, 275–292.
Heise, D. R. (1970). Chapter 14, The semantic differential and attitude research. In G. F. Summers
(Ed.), Attitude measurement (pp. 235–253). Chicago: Rand McNally.
Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics, 32,
1–49.
524 References
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 189–220). Westport, CT: Praeger.
Holland, P. W., & Hoskins, M. (2003). Classical test theory as a first-order item response theory: Appli-
cation to true-score prediction from a possibly non-parallel test. Psychometrika, 68, 123–149.
Horn, J. L. (1998). A basis for research on age differences in cognitive abilities. In J. J. McCardle &
R. W. Woodcock (Eds.), Human cognitive abilities in theory and practice (pp. 8–20). Mahwah,
NJ: Erlbaum.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York: Wiley.
Hotelling, H. (1933). Analysis of complex statistical variables into principal components. Journal
of Educational Psychology, 24, pp. 417–441; 498–520.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
Hoyt, C. (1941). Test reliability obtained by analysis of variance. Psychometrika, 6, 153–160.
Huberty, C. J. (1994). Applied discriminant analysis. New York: Wiley.
Jannarone, R. J. (1997). Models for locally dependent responses: Conjunctive item response
theory. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern test theory
(pp. 465–480). New York: Springer.
Jörskog, K., & Sörbom, D. (1996). LISREL8: User’s reference guide. Chicago: Scientific Software
International.
Jörskog, K., & Sörbom, D. (1999a). LISREL8: New statistical features. Chicago: Scientific Software
International.
Jörskog, K., & Sörbom, D. (1999b). PRELIS2: User’s reference guide. Chicago: Scientific Software
International.
Kane, M. (2006). Validity. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64).
Westport, CT: Praeger.
Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160.
Katz, R. C., Santman, J., & Lonero, P. (1994). Findings on the Revised Morally Debatable Behaviors
Scale. Journal of Psychology, 128, 15–21.
Kelderman, H. (1992). Computing maximum likelihood estimates of loglinear IRT models from
marginal sums. Psychometrika, 57, 437–450.
Kelderman, H. (1997). Loglinear multidimensional item response model for polytomously scored
items. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern test theory
(pp. 287–303). New York: Springer.
Kelley, T. L. (1927). The interpretation of educational measurements. New York: World Book.
Kendall, M. G., & Stuart, A. (1961). The advanced theory of statistics: Vol. 2. Inference and relation-
ship. London: Charles Griffin.
Kerlinger, F. N., & Lee, H. (2000). Foundations of behavioral research (4th ed.). Belmont, CA: Cen-
gage Learning.
Khuri, A. (2003). Advanced calculus with applications in statistics (2nd ed.). New York: Wiley.
Kim, D., de Ayala, R. J., Ferdous, A. A., & Nering, M. L. (2007). Assessing relative performance of
local item independence (LID) indexes. Paper presented at the annual meeting of the National
Council on Measurement in Education, Chicago.
King, B., & Minium, E. (2003). Statistical reasoning in psychology and education (4th ed.). New
York: Wiley.
Kleinbaum, D. G., & Klein, M. (2004). Logistic regression (2nd ed.). New York: Springer-Verlag.
Kline, P. (1986). A handbook of test construction. New York: Methuen.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling and linking: Methods and practices (2nd
ed.). New York: Springer-Verlag.
Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement
for scale scores. Journal of Educational Measurement, 29, 285–307.
Kothari, C. R. (2006). Research methodology: Methods and techniques (3rd ed.). New Delhi, India:
New Age International.
References 525
Lattin, J., Carroll, D. J., & Green, P. E. (2003). Analyzing multivariate data. Pacific Grove, CA:
Brooks/Cole.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28,
563–575.
Lazarsfeld, D. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton-Mifflin.
Lee, P. M. (2004). Bayesian statistics: An introduction (3rd ed.). New York: Wiley.
Levy, P. S., & Lemeshow, S. (1991). Sampling of populations. New York: Wiley.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140),
1–55.
Linn, R. L., & Slinde, J. (1977). The determination of the significance of change between pre- and
post-testing periods. Review of Educational Research, 47, 121–150.
Lomax, R. (2001). Statistical concepts: A second course for education and the behavioral sciences (2nd
ed.). Mahwah, NJ: Erlbaum.
Lord, F. M. (1952). A theory of test scores [Monograph]. Psychometrika, 7(7), 1–84.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Erlbaum.
Lord, F. M., & Novick, M. (1968). Statistical theories of mental test scores. New York: Addison-Wesley.
Magnusson, D. (1967). Test theory. Reading, MA: Addison-Wesley.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174.
McAdams, D. P., & Pals, J. L. (2007). The role of theory in personality research. In R. Robins,
R. C. Fraley, & R. Kruger (Eds.), Handbook of research methods in personality psychology
(pp. 3–20). New York: Guilford Press.
McArdle, J. J. (2007). Five steps in the structural factor analysis of longitudinal data. In R. Robins &
R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions.
Mahwah, NJ: Erlbaum.
McDonald, R. P. (1967). Non-linear factor analysis. [Psychometric Monograph No. 15]. Iowa City,
IA: Psychometric Society.
McDonald, R. P. (1982). Linear versus nonlinear models in item response theory. Applied Psycho-
logical Measurement, 6, 379–396.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Erlbaum.
McDonald, R. P. (1999). Multidimensional item response models. In Test theory (pp. 309–324).
Mahwah, NJ: Erlbaum.
McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty factors in binary data. British Journal of Math-
ematical and Statistical Psychology, 27, 82–99.
Mertler, C. A., & Vannatta, R. A. (2010). Advanced and multivariate statistical methods (4th ed.).
Glendale, CA: Pryczak.
Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences
of measurement. In H. Wainer & H. I. Braun (eds.), Test validity (pp. 33–45). Hillsdale,
NJ: Erlbaum.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).
New York: Macmillan.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment.
Educational Measurement: Issues and Practice, 14(4), 5–8.
Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical test
theory and structural equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 2, 255–273.
Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and
ability. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council
on Education and Macmillan.
Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Features of selected
methods. Applied Measurement in Education, 1, 261–275.
526 References
Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item and scoring of binary items and one-, two-,
and three-parameter logistic models. Chicago: Scientific Software International.
Mokken, R. J., & Lewis, C. (1982). A nonparametric approach to the analysis of dichotomous item
responses. Applied Psychological Measurement, 6, 417–430.
Molenaar, I. W. (2002). Introduction to nonparametric item response theory (vol. 5). Thousand Oaks,
CA: Sage.
Molenaar, P. C. M. (2004). Five steps in the structural factor analysis of longitudinal data. In R.
Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and future
directions (pp. 99–130). Mahwah, NJ: Erlbaum.
Mosier, C. I. (1940). A modification of the method of successive intervals. Psychometrika, 5, 101–107.
Mulaik, S. A. (1987). A brief history of the foundations of exploratory factor analysis. Multivariate
Behavioral Research, 22, 267–305.
Muthen, B. O. (2007). Mplus computer program version 5.2. Los Angeles: Muthen & Muthen.
Muthen, B. O., & Hofacker, C. (1988). Testing the assumptions underlying tetrachoric correla-
tions. Psychometrika, 53(4), 563–578.
Muthen, B. O., & Muthen, L. (2010). Mplus computer program version 6.2. Los Angeles: Muthen &
Muthen.
Nandakumar, R., & Stout, W. (1993). Refinement of Stout’s procedure for assessing latent trait
unidimensionality. Journal of Educational Statistics, 18, 41–68.
Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological
Measurement, 14, 3–19.
Nunnally, J. C., & Bernstein, I. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
Osgood, C. E., Tannenbaum, P. H., & Suci, G. J. (1957). The measurement of meaning. Urbana:
University of Illinois Press.
Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand Oaks, CA: Sage.
Paxton, P. M., Curran, P., Bollen, K. A., Kirby, J. A., & Chen, F. (2001). Monte Carlo simulations in
structural equation models. Structural Equation Modeling, 8, 287–312.
Pearson, K. (1902). On the systematic fitting of curves to observations and measurements.
Biometrika, 1, 265–303.
Pearson Education, Inc. (2015). Stanford Achievement Test (10th ed.). San Antonio, TX: Author.
Pearson, E. S., & Hartley, H. O. (1966). Biometrika tables for statisticians. Cambridge, MA: Cambridge
University Press.
Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction
(2nd ed.). Fort Worth, TX: Harcourt Brace Jovanovich.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design and analysis: An integrated approach.
Mahwah, NJ: Erlbaum.
Peters, C. L. O., & Enders, C. (2002). A primer for the estimation of structural equation models
with missing data. Journal of Targeting, Measurement and Analysis for Marketing, 11, 81–95.
Peterson, N. G., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn
(Ed.), Educational measurement (3rd ed). New York: American Council on Education/Macmillan.
Press, J. (2003). Subjective and objective Bayesian statistics: Principles, models, and applications. New
York: Wiley.
Price, L. R., Laird, A. R., Fox, P. T., & Ingham, R. (2009). Modeling dynamic functional neuroimag-
ing data using structural equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 16, 146–172.
Price, L. R., Lurie, A., & Wilkins, C. (2001). EQUIPERCENT Computer Program. Applied Psycho-
logical Measurement, 25(4), 332–332.
Price, L. R., Raju, N. S., & Lurie, A. (2006). Conditional standard errors of measurement for com-
posite scores. Psychological Reports, 98, 237–252.
References 527
Price, L. R., Tulsky, D., Millis, S., & Weiss, L. (2002). Redefining the factor structure of the Wechsler
Memory Scale–III: Confirmatory factor analysis with cross-validation. Journal of Clinical and
Experimental Neuropsychology, 24(5), 574–585.
Probstat. (n.d.). Retrieved from http://pirun.ku.ac.th/~b5054069.
Raju, N. S., Price, L. R., Oshima, T. C., & Nering, M. (2007). Standardized conditional SEM: A case
for conditional reliability. Applied Psychological Measurement, 31(3), 169–180.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Dan-
ish Institute of Educational Research.
Raudenbush, S. W. (2001). Toward a coherent framework for comparing trajectories of individual
change. In L. Collins & A. Sayer (Eds.), Best methods for studying change (pp. 33–64).
Washington, DC: American Psychological Association.
Raykov, T. (1997). Estimation of composite reliability for congeneric measures. Applied Psychologi-
cal Measurement, 21, 173–184.
Raykov, T. (1998). Coefficient alpha and composite reliability with interrelated nonhomogeneous
items. Applied Psychological Measurement, 22(4), 375–385.
Raykov, T., & Marcoulides. G. A. (2011). Introduction to psychometric theory. New York: Routledge.
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied
Psychological Measurement, 9(4), 401–412.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Rogosa, D. R., Brandt, D., & Zimowski, M. (1982). A growth curve approach to the measurement
of change. Psychological Bulletin, 92, 726–748.
Roskam, E. E. (1997). Models for speeded and timed-limited tests. In W. J. van der Linden & R. K.
Hambleton (Eds.), Handbook of modern test theory (pp. 187–208). New York: Springer.
Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by split-halves.
Harvard Educational Review, 9, 99–103.
Rudas, T. (2008). Handbook of probability: Theory and applications. Thousand Oaks, CA: Sage.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psy-
chometrika Monograph, No. 17, pp. 1–97.
Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph, No. 18.
Sax, G. (1989). Principles of educational and psychological measurement (3rd ed.). Belmont, CA:
Wadsworth.
Scheines, R., Hoijtink, H., & Boomsma, A. (1999). Bayesian estimation and testing of structural
equation models. Psychometrika, 64, 37–52.
Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related validity
studies. Journal of Applied Psychology, 61, 473–485.
Schumacker, R. E., & Lomax, R. G. (2010). A beginner’s guide to structural equation modeling (3rd
ed.). New York: Routledge.
Schwartz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461–464.
Scientific Software International. (2003a). TESTFACT version 2.0 computer program. Chicago:
Author.
Scientific Software International. (2003b). BILOG version 3.0 computer program. Chicago: Author.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs
for generalized causal inference. New York: Houghton Mifflin.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:
Sage.
Spearman, C. (1904). General intelligence: Objectively determined and measured. American Jour-
nal of Psychology, 15, 201–293.
Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American
Journal of Psychology, 18, 161–169.
528 References
Yen, W. (1993). Scaling performance assessments: Strategies for managing local item indepen-
dence. Journal of Educational Measurement, 30, 187–213.
Zieky, M. J., Peirie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of
performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
Zimmerman, D. W., & Williams, R. H. (1982). Gain scores in research can be highly reliable. Jour-
nal of Educational Measurement, 19, 149–154.
Zimmerman, D. W., Williams, R. H., & Zumbo, B. (1993). Gains scores in research can be highly
reliable. Journal of Educational Measurement, 19(2), 149–154.
Author Index
Adams, R. J., 344 Bock, R. D., 149, 335t, 343, 344, 363, 369
Ahlawat, K. S., 338 Bollen, K. A., 300
Aiken, L. R., 152, 153 Bond, T. G., 334, 366
Aiken, L. S., 76 Boomsma, A., 473
Aitkin, M., 335t, 363 Boring, E. G., 4
Akaike, H., 403t Bradlow, E. T., 333, 335t, 364
Albert, J. H., 335t Brandt, D., 251
Algina, J., 9, 146, 156, 187, 212, 214, 216, 218, 221, 263, Brennan, R. L., 22, 136, 139, 252, 258, 261f, 262, 263,
277n, 279n, 280n, 290, 299, 300, 308, 415, 419, 269t, 270, 287, 331, 424, 426, 427, 437, 439
428, 434–436, 443 Brown, T. A., 291, 292, 302, 319, 321
Allen, M. J., 22, 114, 497 Browne, M. W., 292
Anastasi, A., 127 Bruce, V., 147
Anderson, R. E., 102, 293f, 316f Bunch, M. B., 193, 194, 199
Andrich, D., 366 Bush, R. R., 333
Andrzejewski, M. E., 19f, 38f, 40f
Angoff, W. H., 180, 181, 186, 424, 428, 434
C
531
532 author Index
E H
Ebel, R. L., 65, 170, 172t, 185, 185t, 196, 199, 200 Haebara, T., 445
Eignor, D. R., 425 Hair, J. F., 102–104, 106, 138–140, 293f, 316f, 327
Enders, C. K., 162 Haladyna, T. M., 175, 176
Engelhart, M. D., 170 Hald, A., 452f, 469, 480, 488
Enzman, D., 506 Hambleton, R. K., 193, 197, 330, 331, 337, 345, 351,
354, 371, 440, 441, 443, 445
Han, C., 445
F
Hanson, B. A., 252
Fabrigar, L. R., 290, 291, 297, 301, 314, 316, 317, 322 Hartley, H. O., 485
Fechner, G. T., 147, 452f Hattie, J. A., 337
Feldt, L. S., 331 Hayes, W. I., 156
Ferdous, A. A., 347 Hebb, D. O., 454
Fidell, L., 85, 99, 107, 122, 123t, 125, 307t, 315, 494 Heise, D. R., 159
Fischer, G. H., 335t Henry, N. W., 335t
Fisher, R. A., 107 Hill, W. H., 170
Fiske, D. W., 60, 63, 134, 140 Hocking, R. R., 102
Flanagan, D. P., 6, 454 Hofacker, C., 506
Flynn, J. R., 6 Hoijtink, H., 473
Forrest, D. W., 4 Holland, P. W., 334, 425, 426
Fox, C. M., 334, 366 Hoover, H. D., 3
Fox, J.-P., 335t Hopkins, K. D., 106, 138, 485
Author Index 533
L
N
Laird, A. R., 473
Lattin, J., 151, 156, 302 Nanda, H., 136
Lawshe, C. H., 126, 138 Nandakumar, R., 341
Lazarsfeld, P. F., 335t Nedelsky, L., 195
Lee, H. B., 178, 292, 299, 301, 322, 326 Nering, M. L., 252, 347, 404
Lee, P. M., 473 Novick, M., 101, 146, 160, 206, 208, 212, 214, 216, 217,
Lemeshow, S., 122, 123t, 124, 125, 174 219, 221, 244, 245, 248, 249, 334, 335t, 338, 398,
Levy, P. S., 174 506
Lewis, C., 335t Nunnally, J. C., 146, 147, 166, 228, 245
534 author Index
Tabachnick, B., 85, 99, 107, 122, 123t, 125, 307t, 315,
Q 494
Tannenbaum, P. H., 159
Quetelet, A., 452f, 480
Tatham, R. L., 102, 293f, 316f
Taylor, H. C., 113
R Thissen, D., 68, 333
Thompson, B., 292
Rajaratnam, N., 136
Thurstone, L. L., 148, 452f
Raju, N. S., 252, 331, 362
Torgerson, W., 142, 144, 146, 153, 160
Rasch, G., 333, 334, 335t, 366
Tulsky, D., 132
Raudenbush, S. W., 243
Raykov, T., 219, 315
Reckase, M. D., 335t, 344 U
Rogers, H. J., 331
Rogosa, D. R., 241 Urry, V. W., 65
Roskam, E. E., 365
Rubin, D. B., 471 V
Rudas, T., 212
Rulon, P. J., 231 Vannatta, R. A., 325
Russell, J. T., 113 Verhelst, N. D., 365
Verstralen, H. H. F. M., 365
von Davier, A., 426, 427, 439
S
537
538 Subject Index
C Coefficients, 261f
Common factor model, 291, 309–312, 313f, 325
Canonical function, 111t, 114t, 116f. See also
Common factors, 291, 325
Discriminant analysis
Communality, 309–312, 313f, 324, 325
Categorical data, 173, 458
Communication, 452–454, 453f, 454, 515
Categorization, 14, 14f, 61f, 458
Comparative judgment, 148–150
Ceiling effects, 66, 102
Complex multiple-choice format, 176t. See also Test items
Central limit theorem, 363–364
Components, 312, 314, 314t, 326. See also Principal
Central tendency, 32, 33–34. See also Mean; Median; Mode
components analysis (PCA)
Chi-square statistics, 344, 346–347, 347t, 370, 402
Composite score
Choices, 150
coefficient alpha and, 238–239
Classes, 458
common standard score transformations or
Classical approach, 461
conversions, 423
Classical probability theory, 345
definition, 10, 253
Classical test theory (CTT)
norms and, 423, 448
compared to item response theory (IRT), 330–331, 441
overview, 7, 208
definition, 253, 287
reliability and, 223–228, 224t, 227t
factor analysis and, 296, 312, 314
Computer adaptive testing (CAT), 331, 404
generalizability coefficient and, 273–274
Concepts, 129t, 261f
generalizability theory and, 260, 261f, 273
Conditional distribution, 94
invariance property, 349–351, 350f, 351t
Conditional probability theory, 345
item response theory and, 404
Confidence interval
overview, 10, 257–258, 329
definition, 256, 287
reliability and, 67, 204
generalizability theory and, 281
standard error of measurement and, 281
overview, 245
strong true score theory and, 332–333
reliability and, 244, 246–248
Classical true score model, 204, 253
Confidence limits, 245, 248, 254
Classification
Confirmatory bias, 126
definition, 10
Confirmatory factor analysis (CFA). See also
discriminant analysis and, 106–114, 110t, 111t, 112t,
Factor analysis
113f, 114t
construct validity and, 132
overview, 2
definition, 138, 325
purpose of a test and, 169t
overview, 290, 293f, 319, 325
scaling models and, 162
principal components analysis and, 315–316
statistics and, 112t, 115t
structural equation modeling and, 319–322, 320f, 321f,
techniques for, 105–106, 106f
322f, 323f
Classification table
Congeneric tests, 219, 220t, 254
definition, 138
Consequential basis, 127t
logistic regression and, 122t, 124t
Consistency. See also Reliability
overview, 109–110, 112t, 116t, 257
Constant, 23, 55, 458, 515
Cluster analysis, 289–290
Constant error, 204–205, 254
Coefficient alpha
Construct validity. See also Constructs; Validity
composite scores based on, 238–239
correlational evidence of, 130–131
definition, 253
definition, 102, 138
estimating criterion validity and, 234–236, 235t, 236t,
evidence of, 127–130, 127f, 129t
237t
factor analysis and, 131–134, 133t, 134f
overview, 233, 233–235, 234t, 253
generalizability theory and, 136–137
Coefficient of contingency, 503, 503t
group differentiation studies of, 131
Coefficient of determination, 52–53, 53t, 55
overview, 10, 60, 126–127, 137, 141
Coefficient of equivalence, 229, 253
reliability and, 206
Coefficient of generalizability, 272–273, 287
Constructs. See also Construct validity; Individual
Coefficient of multiple determination, 80–83, 82t, 83t, 102
differences
Coefficient of reliability, 228–229, 240, 241t, 253. See
covariance and, 42
also Reliability
definition, 10
Coefficient of stability, 228–229, 253
overview, 5–6
Subject Index 539
test development and, 172–173 partial correlation and, 70–77, 73t, 75f
units of measurement and, 18–19 regression equation and, 85, 86t
validity continuum and, 61f standard-setting approaches and, 194
Content analysis, 61f, 173, 199 statistical estimation of, 66–68
Content validity. See also Validity Criterion-referenced test, 3, 10, 169t, 200. See also
definition, 103, 138 Norm-referenced test
limitations of, 126 Cross tabulation, 346t
overview, 63, 125–126, 137, 141 Cross validation, 85, 103, 138
Content validity ratio (CVR), 126, 138 Crossed designs, 260, 266, 287
Continuous data, 335t, 459 Cross-products matrices, 140
Continuous probability, 465–466 Cross-validation, 114
Continuous variable, 23–24, 55, 457, 515. See also Variance Crystallized intelligence. See also GfGc theory;
Convenience sampling, 409. See also Sampling Intellectual constructs
Convergent validity evidence, 134–135 correlation and, 45f
Conversions, 422–423 criterion validity, 66–67
Correction for attenuation, 68–70, 76–77, 103 factor analysis and, 292, 294, 294t, 295f, 296–301,
Correlated factors, 306–308 296t, 297t, 298t
Correlation. See also Correlation coefficients; Multiple item response theory and, 346, 346f
correlation; Partial correlation; Semipartial overview, 6–7, 7t, 8f, 455–456, 456t
correlation partitioning sums of squares, 54t
item discrimination and, 186 reliability and, 204
measures of, 495–503, 496t, 499t, 500t rules of correspondence and, 454–455, 455f
overview, 42–43, 44t, 45f, 488–491, 490t, 492, 513t, 514 scatterplot and, 45f
partial regression slopes, 90–92 standard error of estimate, 53f
Correlation coefficients. See also Correlation; Pearson structural equation modeling and, 319–322, 320f, 321f,
correlation coefficient 322f, 323f
correction for attenuation and, 76 subject-centered scaling and, 156–160, 157f, 158f, 159f
estimating criterion validity and, 83t subtests in the GfGc dataset, 23t
factor analysis and, 324 test development and, 166–167, 168–172, 168f, 169t,
semipartial correlation, 73–74, 75f 170t, 171t, 172t, 177, 191–192, 191t, 192f
Correlation matrix, 294, 296–301, 296t, 297t, 298t true score model and, 210t
Correlation ratio, 509 validity continuum and, 61–62
Correlational evidence, 130–131. See also Evidence Cumulative probability distribution (density) function,
Correlational studies, 127 465–466, 515
Counterbalancing, 432–435 Cumulative relative frequency distribution, 26, 36–37,
Counting, 460–461 55
Covariance Cumulative scaling model, 156, 162. See also
definition, 55 Scaling models
overview, 42, 45–47, 488–491, 490t, 492 Cutoff score, 193, 198–199, 200
Covariance matrix, 314, 490t
Covariance structural modeling, 46, 55, 133. See also
D
Structural equation modeling (SEM)
Covariation, 481–484, 515 Data, 461
Cramer’s contingency coefficient, 503, 503t Data analysis, 61f
Criterion, 61f, 166, 199 Data collection, 9, 322
Criterion contamination, 64, 66, 103 Data layout, 373–374, 374f
Criterion content, 60 Data matrix, 161, 161t, 163, 373–374, 374f
Criterion measure, 69–70 Data organization, 160–162, 161t
Criterion validity. See also Validity Data summary, 9
classification and selection and, 105–106, 106f Data types, 458–459
definition, 103 Data-driven approach, 333–334. See also Sampling
higher-order partial correlations and, 77–80, 79t Datum, 461, 515
high-quality criterion and, 63–66 Decision studies. See D-study
multiple linear regression and, 84, 84f, 85f Decision theory, 105, 138, 475, 515
overview, 63, 141 Decision-making process, 193–194
540 Subject Index
Examinee population, 173–174. See also Sampling Fixed facets of measurement, 260, 266, 287
Expectation (mean) error, 212 Floor effects, 66, 103
Expected a posteriori (EAP), 364 Fluid intelligence. See also GfGc theory;
Explication, 142 Intellectual constructs
Exploratory factor analysis (EFA). See also Factor analysis correlation and, 45f
construct validity and, 131 estimating criterion validity and, 72
definition, 139, 326 factor analysis and, 292, 294, 294t, 295f, 296–301,
overview, 290, 293f 296t, 297t, 298t
principal components analysis and, 315–316 overview, 6–7, 7t, 8f, 455–456, 456t
Extended matching format, 176t partitioning sums of squares, 54t
External stage, 129t regression and, 49, 50f
reliability and, 204
rules of correspondence and, 454–455, 455f
F
scatterplot and, 45f
Facets standard error of estimate, 53f
definition, 287 structural equation modeling and, 319–322, 321f,
generalizability theory and, 266–271, 268t, 269f, 270t 322f, 323f
of measurement and universe scores, 259–260 subject-centered scaling and, 156–160, 157f, 158f, 159f
overview, 258 subtests in the GfGc dataset, 23t
two-facet designs, 281–284, 282t, 283t, 284t, 285t, 286t test development and, 166–167, 168–172, 168f, 169t,
Factor, 296, 326 170t, 171t, 172t, 177
Factor analysis Forward selection, 125
applied example, 292, 294, 294t, 295f Fourth moment, 481, 485, 515. See also Kurtosis
communality and uniqueness and, 309–312, 313f Frequency, 417t, 420t, 461, 515
compared to principal components analysis, 315– Frequency distributions
318, 316f, 317t, 318t definition, 515
components, eigenvalues, and eigenvectors, 312, 314, graphing, 26–30, 27f, 28f, 40f
314t overview, 24–26, 24t, 25t, 27f, 28f, 461, 464t
construct validity and, 131–134, 133t, 134f Frequency polygon, 26, 28–29, 40f, 56. See also Relative
correlated factors and simple structure, 306–308 frequency polygon
correlation matrix and, 337–341, 338f Frequentist approach, 461
errors to avoid, 322–325 Frequentist probability, 515
factor loadings and, 294, 296–301, 296t, 297t, 298t F-test, 89, 509–510
factor rotation and, 301–306, 302f, 303f, 304f, 305t,
306t, 307t
G
history of, 291–292, 293f
overview, 10, 289–291, 325 G coefficient. See Coefficient of generalizability
structural equation modeling and, 319–322, 320f, Galton, Francis, 4, 10
321f, 322f, 323f General theory of intelligence (GfGc theory). See GfGc
test development and, 180 theory
Factor extraction, 297 Generalizability coefficient, 136–137, 258, 273–274, 287
Factor indeterminacy, 300, 326 Generalizability study, 139, 263. See also G-study
Factor loading Generalizability theory. See also D-study; G-study
construct validity and, 133t analysis of variance and, 260–262, 261f
definition, 139, 326 classical test theory and, 260, 273–274
overview, 133, 294, 296–301, 296t, 297t, 298t construct validity and, 136–137
Factor matrix, 293f, 301–302 definition, 254, 287
Factor rotation, 301–306, 302f, 303f, 304f, 305t, 306t, facets of measurement and universe scores, 259–260
307t, 326 overview, 10, 257–258, 286
Factor-analytic studies, 127 proportion of variance for the person effect and,
False negative, 110, 113–114, 139 271–273
False positive, 110, 113–114, 139 purpose of, 258
Falsifiability, 333, 404 reliability and, 251–252
First moment, 480–481, 515 single-facet crossed design and, 274–278, 275t, 276t
First-order partial correlation, 71, 76–77, 103. See also single-facet design with multiple raters rating on two
Partial correlation occasions, 280, 281t
542 Subject Index
overview, 14–17, 15t, 16f, 21, 146 three-parameter logistic IRT model and, 389–399,
subject-centered scaling and, 160 393t–396t, 397f, 398f
unfolding technique and, 153 true score equating, 443, 445
Intraindividual differences, 3, 42 two-parameter logistic IRT model and, 381–389,
Invariance property, 349–351, 350f, 351t, 441, 442f 381f, 385t–386t, 387t–388t, 389f
Invariant comparison, 366 when traditional models of are inappropriate to use,
Item. See Test items 364–365
Item analysis, 180, 182, 183t, 184t, 191–192 Item validity index, 191–192, 191t, 192f, 200. See also
Item characteristic curve (ICC), 332, 404 Test items; Validity
Item difficulty, 182, 183t, 184t, 257–258
Item discrimination, 184–186, 185t, 186t
J
Item facet, 262, 282, 287
Item format, 175, 200. See also Test items Joint density function, 487, 516
Item homogeneity, 130–131, 139, 254 Joint maximum likelihood estimation (JMLE), 363, 405.
Item information, 373, 388–389, 389f See also Maximum likelihood estimation (MLE)
Item information function (IIF) Joint probability, 351–358, 352f, 354t, 355t, 357f
definition, 404 Judgment scaling, 163
item response theory and, 358–362, 360t–361t, 361f Judgments, 148–150, 150
three-parameter logistic IRT model and, 397–399, 398f Just noticeable difference (JND), 147, 163
Item parameter estimates, 358–362, 360t–361t, 361f,
362–364
K
Item reliability index, 190–192, 191t, 192f, 200. See also
Test items Küder–Richardson 20, 233, 238–239, 253, 254
Item response function (IRF), 332 Küder–Richardson 21, 233, 238–239, 253, 254
Item response theory (IRT) Kurtosis, 410, 481, 485–486, 516
assumptions of, 336–337
Bayesian methods and, 475
L
bookmark method and, 198–199
compared to classical test theory (CTT), 330–331 Language development, 66–67
conceptual explanation of, 334, 336, 336f Latent class analysis (LCA), 344, 405
correlation matrix and, 337–341, 338f, 339f, 340f Latent factor, 291, 326
data layout, 373–374, 374f Latent trait. See also Item response theory (IRT)
definition, 163, 405, 516 definition, 405, 516
dimensionality assessment specific to, 341–344 item response theory and, 338, 439–440
invariance property, 349–351, 350f, 351t overview, 148, 331
item parameter and ability estimation and, 362–364 Latent variable, 336f, 458. See also Variable
item response theory and, 344 Least-squares criterion, 51–52, 56, 139
joint probability of based on ability, 351–358, 352f, Likelihood ratio tests, 123t, 402
354t, 355t, 357f Likelihood value, 118, 139
linear models and, 366–371, 368f, 369f, 370f Likert-type items, 178f, 404. See also Test items
local independence of items, 345–348, 346t, 347f Linear equation, 109, 264
logistic regression and, 366–371, 368f, 369f, 370f Linear models, 366–371, 368f, 369f, 370f, 428–429
maximum likelihood estimation (MLE) and, 468 Linear regression. See also Regression; Simple linear
model comparison approach and, 400–403, 403t regression
observed score, true score, and ability, 445–447, 447f assessing, 509–510
one-parameter logistic IRT model and, 374–381, 376f, generalizability theory and, 263–265
378f, 380t–381t, 381f overview, 47
overview, 10, 148, 329–330, 331–332, 404 Pearson r and, 492–493, 492f, 493f
philosophical views on, 333–334, 335t Linear scaling equation, 412
Rasch model and, 366–373, 368f, 369f, 370f, 372t Linear transformation, 411–415, 413t, 415t, 482, 484,
reliability and, 243, 252 485, 516
scaling and, 160 Linear z-scores, 416, 417t. See also z-score
standard error of ability, 358–362, 360t–361t, 361f Local independence, 331, 345–348, 346t, 347f, 405
strong true score theory and, 332–333 Local norms, 419, 448
test dimensionality and, 337 Location, 475
test score equating and, 439–443, 442f, 443t, 444t Log likelihood, 351–354
544 Subject Index
Multiple-group discriminant analysis, 114–116, 115t, Oblique rotational matrix. See also Rotational method
116f. See also Discriminant analysis definition, 326
Multiplication theorem of probability, 461–462, 516 factor analysis and, 293f, 324
Multitrait–multimethod (MTMM) studies overview, 302–306, 304f, 305t, 307t
construct validity and, 127 Observations, 13–14, 14f, 24. See also Measurement
definition, 140 observations
overview, 134–135, 134t, 135t Observed score
Multivariate analysis of variance (MANOVA), 107, 140 overview, 445–447, 447f
Multivariate normality, 107 true score model and, 209–210, 210t, 211f, 219–221,
Multivariate relationships, 488–491, 490t 220t
Observer (rater) facet, 282, 282t
N Obtained score units, 248
Occasion facet, 262, 288
National Council on Measurement in Education (NCME),
Odds ratio, 120, 122t, 140
59
One-facet design, 266–271, 268t, 269f, 270t
Nedelsky method, 195–196
One-factor models, 344
Nested designs, 260, 279–280, 287
One-parameter logistic IRT model
Nominal scale. See also Measurement; Scaling
for dichotomous item responses, 374–381, 376f, 378f,
definition, 56
380t–381t, 381f
item response theory and, 404
model comparison approach and, 400–403, 403t
overview, 14–17, 15t, 16f, 17f, 21
test score equating and, 439–440
Nonequivalent anchor test (NEAT) design, 427, 448
Open-ended questions, 173
Nonlinear regression, 492–493, 492f, 493f
Order, 150–151
Nonmetric measurement, 153, 156, 163
Ordered categorical scaling methods, 158. See also Scal-
Nonparametric model, 335t, 341
ing models
Nonprobability sampling, 174, 200. See also Sampling
Ordinal, 56
Normal distribution, 39–42, 41f, 56, 148–150. See also
Ordinal scale. See also Measurement; Scaling
Score distributions; Standard normal distribution
compared to interval levels of measurement, 19–20, 19f
Normality of errors, 494–495
definition, 57
Normalized scale scores, 418–421, 420t, 421f, 422–423,
overview, 14–17, 15t, 16f, 17f, 21, 146, 150–151
422f
subject-centered scaling and, 160
Normalized standard scores
Thurstone’s law of comparative judgment and, 148–150
definition, 448
unfolding technique and, 153
overview, 418–421, 420t, 421f, 422f
Orthogonal rotational matrix. See also Rotational method
Normative population, 180–181, 200
definition, 326
Normative sample, 408, 448
factor analysis and, 293f, 324
Normative scores, 410, 415–416, 417t
overview, 302–306, 303f, 306t, 307t
Norming, 408, 408–410, 449. See also Norms
Norm-referenced test. See also Criterion-referenced test;
Norms P
definition, 10, 200, 449
Paired comparisons, 150–151, 151t, 152t, 163
overview, 3, 408
Parallel forms method, 229
standard-setting approaches and, 194
Parallel test, 214, 216–219, 254
test development and, 169t
Parameter, 33
Norms. See also Norming; Norm-referenced test
Parameter estimates, 57, 124t, 394t–396t
definition, 449
Parametric factor-analytic methods, 341
normalized standard or scale scores, 418–421, 420t,
Parametric statistical inference, 471
421f, 422f
Partial correlation. See also Correlation; First-order
overview, 1–2, 10, 407–408
partial correlation
planning a norming study, 408–410
correction for attenuation and, 76–77
test development and, 180–181
estimating criterion validity and, 70–80, 73t, 75f, 79t,
Numbers, 14–17, 15t, 16f, 17f
83t
overview, 511–512, 513t
O
Partial regression slopes, 90–92
Object of measurement, 262, 288 Partially nested facet, 262, 288
Objectivity, 366, 405, 452–454, 453f, 516 Partitioning sums of squares, 54, 54t
546 Subject Index
Short-term memory. See also GfGc theory; Intellectual Standard error of the mean, 410
constructs Standard normal distribution, 42, 57, 143f, 144f. See also
factor analysis and, 292, 294, 294t, 295f Normal distribution
generalizability theory and, 266t Standard score
overview, 6–7, 7t, 8f, 455–456, 456t definition, 449
reliability and, 204 under linear transformation, 411–415, 413t, 415t
rules of correspondence and, 454–455, 455f overview, 408
subject-centered scaling and, 156–160, 157f, 158f, Standard score conversion tables, 410
159f Standard setting, 193–194, 194, 201
subtests in the GfGc dataset, 23t Standardized regression equation, 93–94. See also
test development and, 166–167, 168–172, 168f, 169t, Regression equation
170t, 171t, 172t, 191–192, 191t Standardized regression slope, 104
validity continuum and, 62 Standardized regression slopes, 93
Sigma notation, 29–31, 57. See also Summation Standardized regression weights, 305
Significance, 87–90, 89t, 90t, 92 Standards for Educational and Psychological Testing, 60
Simple linear regression, 47, 57. See also Linear Standards-referenced method, 194, 201
regression; Regression Statistic
Simple structure definition, 57
correlated factors and simple structure and, 306–308 generalizability theory and, 261f
definition, 326 notation and operations overview, 459–460
factor analysis and, 306–308 overview, 33, 55, 514
overview, 301–302 planning a norming study and, 409–410
Single random variable, 486–487 reliability and, 231t
Single-facet crossed design, 274–278, 275t, 276t subject-centered scaling and, 160
Single-facet design, 278–280, 281t Statistical control, 70–71, 104
Single-facet person, 266–271, 268t, 269f, 270t Statistical estimation, 66–68, 475, 517
Skewness, 410, 481, 485–486, 517 Statistical foundations, 22–23
Slope of a line, 47–48, 57, 90–92 Statistical inference, 41
Slope–intercept equation, 376–377, 378f Statistical model, 263–265, 265t, 266t
Smoothing techniques, 424 Statistical power, 76
Spearman–Brown formula, 255 Stepwise selection, 125
Spearman’s rank order correlation coefficient, 495–496, Stimulus intensity, 143
496t Stimulus-centered scaling method. See also Scaling
Specific objectivity, 371, 405 models
Specific variance, 310–312, 326 definition, 164
Split-half method, 204 overview, 145t, 146, 147–148, 162
Split-half reliability, 226, 253, 255 test development and, 165
Square root of the reliability, 249 Thurstone’s law of comparative judgment and,
Squared multiple correlation, 76, 103 149–150
Stability of scores, 228–229. See also Reliability Stratified random sampling, 174, 201. See also Sampling
Standard deviation Strong true score theory, 332–333, 406
definition, 517 Structural equation modeling (SEM). See also
estimating criterion validity and, 79t Covariance structural modeling
overview, 34, 481–482 confirmatory factor analysis and, 319–322, 320f, 321f,
variance and, 35–36 322f, 323f
Standard error, 92, 99–100, 387t–388t, 433, 509 definition, 58, 140, 327
Standard error of ability, 358–362, 360t–361t, 361f factor analysis and, 133, 289–290, 325
Standard error of equating, 433–435 overview, 46
Standard error of estimation, 244 Structural model, 320, 327
Standard error of measurement (SEM), 244–249, 255, Structural stage, 129t
263, 281, 288 Subject-centered scaling method. See also Scaling models
Standard error of prediction, 244, 250–251, 255 definition, 164
Standard error of the estimate (SEE) overview, 145t, 146, 156–160, 157f, 158f, 159f, 162
definition, 57, 104, 255 test development and, 165
overview, 52–53, 53t Subjectivity, 126
regression analysis and, 94–95, 95f Subject-matter experts (SMEs), 195–196, 198–199, 201
550 Subject Index
U overview, 22, 23–30, 23f, 24t, 25t, 27f, 28f, 32, 34, 40f
reliability and, 204–206, 205t
Unadjusted linear transformation, 411–412, 449. See
Variable
also Linear transformation
definition, 11, 58, 517
Unbiased estimate, 482, 517
factor analysis and, 323, 323–324
Unfolding technique, 153–156, 154t, 155f. See also
overview, 6, 23, 456–458
Scaling models
research studies and, 9
Unidimensional model, 335t
validity continuum and, 61f
Unidimensional scale, 142, 164
Variance
Unidimensional unfolding technique, 153–156, 154t,
definition, 58, 517
155f, 164. See also Scaling models
factor analysis and, 293f
Unidimensionality, 217, 331, 406
generalizability theory and, 261f, 265t
Unimodal distribution, 58
normal distribution and, 41, 41f
Unique factor, 309–312, 313f, 327
overview, 35–36, 36t
Units of measurement, 18–19. See also Measurement
planning a norming study and, 410
Universe scores, 259–260, 262, 266, 288
reliability and, 223–225, 224t
Unobservable variables. See also Constructs
two-facet designs and, 285t, 286t
covariance and, 42
Variance component, 259, 288
factor loadings and, 294, 296–301, 296t, 297t, 298t
Variance–covariance matrix, 133, 223–225, 224t, 316,
overview, 5–6
317f
units of measurement and, 18–19
Variance partition, 312, 313f
Unobserved ability, 364
Variates, 107, 140
Unstandardized multiple regression equation, 87, 88.
Variations, 481–484
See also Regression equation
Verbal intelligence, 67
Unstandardized multiple regression equation (linear), 104
Vertical equating, 427, 449
Vignette or scenario item set format, 176t. See also Test
V items
Valid negatives, 110, 113–114, 140
Valid positives, 110, 113–114, 140 W
Validation, 60, 104, 127, 129t
Validity. See also Construct validity; Content validity; Wechsler Adult Intelligence for Adults—Third Edition
Criterion validity (WAIS-III), 67
classification and selection and, 105–106, 106f Wechsler Adult Intelligence Scale—Fourth Edition
construct-related variance and, 206 (WAIS-IV), 1
criterion validity, 63 Wechsler Memory Scale—Third Edition
definition, 104, 255 (WMS-III), 132
discriminant analysis and, 106–114, 110t, 111t, 112t, Working memory, 62
113f, 114t
high-quality criterion and, 63–66 Y
overview, 59–63, 61f, 102, 137, 141
scaling and, 22 Yates’s correction for continuity, 502–503, 517
test development and, 167–168, 190–192, 191t, 192f
validity continuum and, 61f
Z
Validity coefficient
correction for attenuation and, 68–70 Z-distribution, 58
definition, 104 Zero-order correlation, 72, 104
generalizability theory and, 136–137 z-score
overview, 63 common standard score transformations
reliability and, 67–68 or conversions, 422–423
Validity continuum, 61f definition, 58
Values, 461 normalized standard or scale scores, 418–421, 420t,
Variability 421f, 422f
definition, 58 overview, 37–38, 37t, 40t
About the Author
Larry R. Price, PhD, is Professor of Psychometrics and Statistics at Texas State University,
where he is also Director of the Initiative for Interdisciplinary Research Design and Analy-
sis. This universitywide role involves conceptualizing and writing the analytic segments
of large-scale competitive grant proposals in collaboration with interdisciplinary research
teams. Previously, he served as a psychometrician and statistician at the Emory University
School of Medicine (Department of Psychiatry and Behavioral Sciences and the Depart-
ment of Psychology) and at The Psychological Corporation (now part of Pearson’s Clinical
Assessment Group). Dr. Price is a Fellow of the American Psychological Association, Divi-
sion 5 (Evaluation, Measurement, and Statistics), and an Accredited Professional Statisti-
cian of the American Statistical Association.
552