Psychometric Methods Theory Into Practice PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 570

ebook

THE GUILFORD PRESS


Psychometric Methods
Methodology in the Social Sciences
David A. Kenny, Founding Editor
Todd D. Little, Series Editor
www.guilford.com/MSS

This series provides applied researchers and students with analysis and research design books that
emphasize the use of methods to answer research questions. Rather than emphasizing statistical
theory, each volume in the series illustrates when a technique should (and should not) be used and
how the output from available software programs should (and should not) be interpreted. Common
pitfalls as well as areas of further development are clearly articulated.

RECENT VOLUMES

DOING STATISTICAL MEDIATION AND MODERATION


Paul E. Jose

LONGITUDINAL STRUCTURAL EQUATION MODELING


Todd D. Little

INTRODUCTION TO MEDIATION, MODERATION, AND CONDITIONAL


PROCESS ANALYSIS: A REGRESSION-BASED APPROACH
Andrew F. Hayes

BAYESIAN STATISTICS FOR THE SOCIAL SCIENCES


David Kaplan

CONFIRMATORY FACTOR ANALYSIS FOR APPLIED RESEARCH, SECOND EDITION


Timothy A. Brown

PRINCIPLES AND PRACTICE OF STRUCTURAL EQUATION MODELING, FOURTH EDITION


Rex B. Kline

HYPOTHESIS TESTING AND MODEL SELECTION IN THE SOCIAL SCIENCES


David L. Weakliem

REGRESSION ANALYSIS AND LINEAR MODELS:


CONCEPTS, APPLICATIONS, AND IMPLEMENTATION
Richard B. Darlington and Andrew F. Hayes

GROWTH MODELING: STRUCTURAL EQUATION


AND MULTILEVEL MODELING APPROACHES
Kevin J. Grimm, Nilam Ram, and Ryne Estabrook

PSYCHOMETRIC METHODS: THEORY INTO PRACTICE


Larry R. Price
Psychometric Methods
Theory into Practice

Larry R. Price

Series Editor’s Note by Todd D. Little

THE GUILFORD PRESS


New York London
Copyright © 2017 The Guilford Press
A Division of Guilford Publications, Inc.
370 Seventh Avenue, Suite 1200, New York, NY 10001
www.guilford.com

All rights reserved

No part of this book may be reproduced, translated, stored in a retrieval system,


or transmitted, in any form or by any means, electronic, mechanical, photocopying,
microfilming, recording, or otherwise, without written permission from the publisher.

Printed in the United States of America

This book is printed on acid-free paper.

Last digit is print number: 9 8 7 6 5 4 3 2 1

Library of Congress Cataloging-in-Publication Data

Names: Price, Larry R., author.


Title: Psychometric methods : theory into practice / Larry R. Price.
Description: New York : The Guilford Press, [2017] | Series: Methodology in the social sciences |
  Includes bibliographical references and index.
Identifiers: LCCN 2016013346 | ISBN 9781462524778 (hardback)
Subjects: LCSH: Psychometrics. | BISAC: SOCIAL SCIENCE / Research. | MEDICAL / Nursing /
  Research & Theory. | PSYCHOLOGY / Assessment, Testing & Measurement. | EDUCATION /
  Testing & Measurement. | BUSINESS & ECONOMICS / Statistics.
Classification: LCC BF39 .P685 2016 | DDC 150.1/5195—dc23
LC record available at https://lccn.loc.gov/2016013346
To my parents, wife, and former students
Series Editor’s Note

The term psychometrics has an almost mystical aura about it. Larry Price brings his vast
acumen as well as his kind and gentle persona to demystify for you the world of psycho-
metrics. Psychometrics is not just a province of psychology. In fact, the theory-to-practice
orientation that Larry brings to his book makes it clear how widely applicable the fun-
damental principles are across the gamut of disciplines in the social sciences. Because
psychometrics is foundationally intertwined with the measurement of intelligence, Larry
uses this model to convey psychometric principles for applied uses. Generalizing these
principles to your domain of application is extremely simple because they are presented
as principles, and not rules that are tied to a domain of inquiry.
Psychometrics is an encompassing field that spans the research spectrum from inspi-
ration to dissemination. At the inspiration phase, psychometrics covers the operational
characteristics of measurement, assessment, and evaluation. E. L. Thorndike (1918) once
stated, “Whatever exists at all exists in some amount. To know it thoroughly involves
knowing its quantity as well as its quality.” I interpret this statement as a callout to mea-
surement experts: Using the underlying principles of psychometrics, “figure out how to
measure it!” If it exists at all, it can be measured, and it is up to us, as principled psy-
chometricians, to divine a way to measure anything that exists. Larry’s book provides
an accessible presentation of all the tools at your disposal to figure out how to measure
anything that your research demands.
Thorndike’s contemporary, E. G. Boring (1923) once quipped, “Intelligence is what
the tests test.” Both Thorndike’s and Boring’s famous truisms have psychometrics at the
core of their intent. Boring’s remarks move us more from the basics of measurement to
the process of validation, a key domain of psychometrics. I have lost count of the many
different kinds of validities that have been introduced, but fortunately, Larry’s book enu-
merates the important ones and gives you the basis to understand what folks mean when
they use the word validity in any phase of the research process.

vii
viii  Series Editor’s Note

Being a good psychometrician is a form of recession-proof job security. The demand


for well-trained psychometricians is higher now than at any time in history. Accountabil-
ity standards, evidence-based practice initiatives, and the like require that new measures
for assessment and evaluation be developed, and they require that many of the “standard”
measurement tools be revamped and brought up to the standards of modern measure-
ment principles. Larry Price’s book provides you with all of the necessary tools to become
a great psychometrician.
As always, “enjoy!”

Todd D. Little
On the road in Corvallis, Oregon

References

Boring, E. G. (1923). Intelligence as the tests test it. New Republic, 36, 35–37.
Thorndike, E. L. (1918). The nature, purposes, and general methods of measurement of educa-
tional products. In S. A. Courtis (Ed.), The measurement of educational products (17th Year-
book of the National Society for the Study of Education, Pt. 2, pp. 16–24). Bloomington, IL:
Public School.
Acknowledgments

Many individuals have positively affected my career. I express my sincere appreciation to


those whose assistance was critical to the completion of this book. I would like to thank
Barbara Rothbaum, Richard Lewine, and Frank Brown, each of whom I was privileged to
collaborate with at Emory University School of Medicine, Department of Psychiatry and
Behavioral Sciences, early in my career. I am very appreciative and grateful to T. Chris
Oshima at Georgia State University, who served as my mentor during my graduate stud-
ies. I thank the late Nambury Raju for his wisdom and mentorship in so many aspects of
psychometrics and in particular item response theory. Also, I am thankful for the profes-
sional experience afforded me during my time as a psychometrician at the Psychological
Corporation and particularly to my colleagues J. J. Zhu, the late Charles Wilkins, Larry
Weiss, and Aurelio Prifitera.
I am sincerely appreciative of the editorial reviews and suggestions for improvement
provided by Elizabeth Belasco, Elizabeth Threadgill, and Gail Ryser. Thanks to numerous
graduate students whom I have taught and mentored for their reading of and reaction to
the manuscript. I am most appreciative of the insights and suggestions provided by the
reviewers during several iterations of the manuscript. My gratitude also goes to my Lord
Jesus Christ for inspiration and fortitude through this lengthy process.
I want to thank C. Deborah Laughton, Publisher, Research Methods and Statistics,
at The Guilford Press, for her superb support and guidance throughout the process of
completing the manuscript. Most certainly, without her guidance and support the manu-
script would not have reached its completion. Also, a big thank you to Katherine Sommer
at Guilford for her administrative support during the final stages of production. I also
express my sincere thanks to Series Editor Todd Little for his wisdom, support, and guid-
ance through it all.

ix
Contents

1 • Introduction1
1.1  Psychological Measurement and Tests  1
1.2  Tests and Samples of Behavior  3
1.3  Types of Tests  3
1.4  Origin of Psychometrics  4
1.5  Definition of Measurement  5
1.6  Measuring Behavior  5
1.7  Psychometrics and Its Importance to Research and Practice  7
1.8  Organization of This Book  9
Key Terms and Definitions  10

2 • Measurement and Statistical Concepts13


  2.1  Introduction  13
  2.2  Numbers and Measurement  13
  2.3  Properties of Measurement in Relation to Numbers  14
  2.4  Levels of Measurement  20
  2.5  Contemporary View on the Levels of Measurement and Scaling  22
  2.6  Statistical Foundations for Psychometrics  22
  2.7  Variables, Frequency Distributions, and Scores  23
  2.8  Summation or Sigma Notation  29
  2.9  Shape, Central Tendency, and Variability of Score Distributions  31
2.10  Correlation, Covariance, and Regression  42
2.11  Summary  55
Key Terms and Definitions  55

xi
xii  Contents

3 • Criterion, Content, and Construct Validity59


  3.1  Introduction  59
  3.2  Criterion Validity  63
  3.3  Essential Elements of a High-Quality Criterion  64
  3.4  Statistical Estimation of Criterion Validity  66
  3.5  Correction for Attenuation  68
  3.6  Limitations to Using the Correction for Attenuation  70
  3.7  Estimating Criterion Validity with Multiple Predictors:
Partial Correlation  70
  3.8  Estimating Criterion Validity with Multiple Predictors:
Higher-Order Partial Correlation  77
  3.9  Coefficient of Multiple Determination and Multiple Correlation  80
3.10  Estimating Criterion Validity with More Than One Predictor:
Multiple Linear Regression  84
3.11  Regression Analysis for Estimating Criterion Validity:
Development of the Regression Equation  85
3.12  Unstandardized Regression Equation for Multiple Regression  87
3.13  Testing the Regression Equation for Significance  87
3.14  Partial Regression Slopes  90
3.15  Standardized Regression Equation  93
3.16  Predictive Accuracy of a Regression Analysis  94
3.17  Predictor Subset Selection in Regression  101
3.18  Summary  102
Key Terms and Definitions  102

4 • Statistical Aspects of the Validation Process105


  4.1  Techniques for Classification and Selection  105
  4.2  Discriminant Analysis  106
  4.3  Multiple-Group Discriminant Analysis  114
  4.4  Logistic Regression  117
  4.5  Logistic Multiple Discriminant Analysis:
Multinomial Logistic Regression  122
  4.6  Model Fit in Logistic Regression  125
  4.7  Content Validity  125
  4.8  Limitations of the Content Validity Model  126
  4.9  Construct Validity  126
4.10  Establishing Evidence of Construct Validity  127
4.11  Correlational Evidence of Construct Validity  130
4.12  Group Differentiation Studies of Construct Validity  131
4.13  Factor Analysis and Construct Validity  131
4.14  Multitrait–Multimethod Studies  134
4.15  Generalizability Theory and Construct Validity  136
4.16  Summary and Conclusions  137
Key Terms and Definitions  138
Contents  xiii

5 • Scaling141
  5.1  Introduction  141
  5.2  A Brief History of Scaling  142
  5.3  Psychophysical versus Psychological Scaling  144
  5.4  Why Scaling Models Are Important  146
  5.5  Types of Scaling Models  146
  5.6  Stimulus-Centered Scaling  147
  5.7  Thurstone’s Law of Comparative Judgment  148
  5.8  Response-Centered Scaling  150
  5.9  Scaling Models Involving Order  150
5.10  Guttman Scaling  151
5.11  The Unfolding Technique  153
5.12  Subject-Centered Scaling  156
5.13  Data Organization and Missing Data  160
5.14  Incomplete and Missing Data  162
5.15  Summary and Conclusions  162
Key Terms and Definitions  162

6 • Test Development165
  6.1  Introduction  165
  6.2  Guidelines for Test and Instrument Development  166
  6.3  Item Analysis  182
  6.4  Item Difficulty  182
  6.5  Item Discrimination  184
  6.6  Point–Biserial Correlation  186
  6.7  Biserial Correlation  188
  6.8  Phi Coefficient  189
  6.9  Tetrachoric Correlation  190
6.10  Item Reliability and Validity  190
6.11  Standard Setting  193
6.12  Standard-Setting Approaches  194
6.13  The Nedelsky Method  195
6.14  The Ebel Method  196
6.15  The Angoff Method and Modifications  196
6.16  The Bookmark Method  198
6.17  Summary and Conclusions  199
Key Terms and Definitions  199

7 • Reliability203
  7.1  Introduction  203
  7.2  Conceptual Overview  204
  7.3  The True Score Model  206
  7.4  Probability Theory, True Score Model, and Random Variables  207
xiv  Contents

  7.5  Properties and Assumptions of the True Score Model  209


  7.6  True Score Equivalence, Essential True Score Equivalence,
and Congeneric Tests  219
  7.7  Relationship between Observed and True Scores  219
  7.8  The Reliability Index and Its Relationship to the Reliability
Coefficient  221
  7.9  Summarizing the Ways to Conceptualize Reliability  221
7.10  Reliability of a Composite  223
7.11  Coefficient of Reliability: Methods of Estimation Based
on Two Occasions  228
7.12  Methods Based on a Single Testing Occasion  230
7.13 Estimating Coefficient Alpha: Computer Program
and Example Data  234
7.14  Reliability of Composite Scores Based on Coefficient Alpha  238
7.15  Reliability Estimation Using the Analysis of Variance Method  240
7.16  Reliability of Difference Scores  241
7.17  Application of the Reliability of Difference Scores  243
7.18  Errors of Measurement and Confidence Intervals  244
7.19  Standard Error of Measurement  244
7.20  Standard Error of Prediction  250
7.21  Summarizing and Reporting Reliability Information  251
7.22  Summary and Conclusions  252
Key Terms and Definitions  253

8 • Generalizability Theory257
  8.1  Introduction  257
  8.2  Purpose of Generalizability Theory  258
  8.3  Facets of Measurement and Universe Scores  259
  8.4  How Generalizability Theory Extends Classical Test Theory  260
  8.5  Generalizability Theory and Analysis of Variance  260
  8.6  General Steps in Conducting a Generalizability Theory Analysis  263
  8.7  Statistical Model for Generalizability Theory  263
  8.8  Design 1: Single-Facet Person-by-Item Analysis  266
  8.9  Proportion of Variance for the p × i Design  271
8.10  Generalizability Coefficient and CTT Reliability  273
8.11  Design 2: Single-Facet Crossed Design with Multiple Raters  274
8.12  Design 3: Single-Facet Design with the Same Raters
on Multiple Occasions  278
8.13  Design 4: Single-Facet Nested Design with Multiple Raters  279
8.14  Design 5: Single-Facet Design with Multiple Raters Rating
on Two Occasions  280
8.15  Standard Errors of Measurement: Designs 1–5  281
8.16  Two-Facet Designs  281
8.17  Summary and Conclusions  286
Key Terms and Definitions  287
Contents  xv

9 • Factor Analysis289
  9.1  Introduction  289
  9.2  Brief History  291
  9.3  Applied Example with GfGc Data  292
  9.4  Estimating Factors and Factor Loadings  294
  9.5  Factor Rotation  301
  9.6  Correlated Factors and Simple Structure  306
  9.7  The Factor Analysis Model, Communality, and Uniqueness  309
  9.8  Components, Eigenvalues, and Eigenvectors  312
  9.9  Distinction between Principal Components Analysis
and Factor Analysis  315
9.10  Confirmatory Factor Analysis  319
9.11  Confirmatory Factor Analysis and Structural Equation Modeling  319
9.12  Conducting Factor Analysis: Common Errors to Avoid  322
9.13  Summary and Conclusions  325
Key Terms and Definitions  325

10 • Item Response Theory329


  10.1  Introduction  329
  10.2  How IRT Differs from CTT  330
  10.3  Introduction to IRT  331
  10.4  Strong True Score Theory, IRT, and CTT  332
  10.5  Philosophical Views on IRT  333
  10.6  Conceptual Explanation of How IRT Works  334
  10.7  Assumptions of IRT Models  336
  10.8  Test Dimensionality and IRT  337
  10.9  Type of Correlation Matrix to Use in Dimensionality Analysis  337
10.10  Dimensionality Assessment Specific to IRT  341
10.11  Local Independence of Items  345
10.12  The Invariance Property  349
10.13  Estimating the Joint Probability of Item Responses Based on Ability  351
10.14  Item and Ability Information and the Standard Error of Ability  358
10.15  Item Parameter and Ability Estimation  362
10.16  When Traditional IRT Models Are Inappropriate to Use  364
10.17  The Rasch Model  366
10.18  The Rasch Model, Linear Models, and Logistic Regression Models  366
10.19  Properties and Results of a Rasch Analysis  371
10.20  Item Information for the Rasch Model  373
10.21  Data Layout  373
10.22  One-Parameter Logistic Model for Dichotomous Item Responses  374
10.23  Two-Parameter Logistic Model for Dichotomous Item Responses  381
10.24  Item Information for the Two-Parameter Model  388
10.25  Three-Parameter Logistic Model for Dichotomous Item Responses  389
10.26  Item Information for the Three-Parameter Model  397
xvi  Contents

10.27  Choosing a Model: A Model Comparison Approach  400


10.28  Summary and Conclusions  404
Key Terms and Definitions  404

11 • Norms and Test Equating407


  11.1  Introduction  407
  11.2  Norms, Norming, and Norm-Referenced Testing  408
  11.3  Planning a Norming Study  408
  11.4  Scaling and Scale Scores  410
  11.5  Standard Scores under Linear Transformation  411
  11.6  Percentile Rank Scale  415
  11.7  Interpreting Percentile Ranks  416
  11.8  Normalized z- or Scale Scores  418
  11.9  Common Standard Score Transformations or Conversions  422
11.10  Age- and Grade-Equivalent Scores  424
11.11  Test Score Linking and Equating  425
11.12  Techniques for Conducting Equating: Linear Methods  428
11.13 Design I: Random Groups—One Test Administered
to Each Group  429
11.14 Design II: Random Groups with Both Tests Administered
to Each Group, Counterbalanced (Equally Reliable Tests)  432
11.15 Design III: One Test Administered to Each Study Group, Anchor Test
Administered to Both Groups (Equally Reliable Tests)  435
11.16  Equipercentile Equating  436
11.17  Test Equating Using IRT  439
11.18  IRT True Score Equating  443
11.19  Observed Score, True Score, and Ability  445
11.20  Summary and Conclusions  447
Key Terms and Definitions  448

Appendix • Mathematical and Statistical Foundations 451


References 519
Author Index 531
Subject Index 537
About the Author 552

The companion website www.guilford.com/price2-materials


presents datasets for all examples as well as
PowerPoints of figures and key concepts.
1

Introduction

This chapter introduces psychological measurement and classification. Psychological tests are
defined as devices for measuring human behavior. Tests are broadly defined as devices for
measuring ability, aptitude, achievement, attitudes, interests, personality, cognitive function-
ing, and mental health. Psychometrics is defined as the science of evaluating the charac-
teristics of tests designed to measure psychological attributes. The origin of psychometrics
is briefly described, along with the seminal contributions of Francis Galton. The chapter
ends by highlighting the role of psychological measurement and psychometrics in relation
to research in general.

1.1 Psychological Measurement and Tests

During the course of your lifetime, most likely you have been affected by some form
of psychological measurement. For example, you or someone close to you has taken a
psychological test for academic, personal, or professional reasons. The process of psy-
chological measurement is carried out by way of a measuring device known as a test. A
psychological test is a device for acquiring a sample of behavior from a person. The term
test is used to broadly describe devices aimed toward measuring ability, aptitude, achieve-
ment, attitudes, interests, personality, cognitive functioning, and mental health. Tests are
often contextualized by way of a descriptor such as “intelligence,” “achievement,” or “per-
sonality.” For example, a well-known intelligence test is the Wechsler Adult Intelligence
Scale—Fourth Edition (WAIS-IV; 2008). A well-known achievement test is the Stanford
Achievement Test (SAT; Pearson Education, 2015) and the NEO Five-Factor Inventory
(NEO-FFI; Costa & McCrae, 1992) is a well-known test or instrument that measures
personality. Also, tests have norms (a summary of test results for a representative group

1
2  PSYCHOMETRIC METHODS

of subjects) or standards by which results can be used to predict other more important
behavior. Table 1.1 provides examples of common types of psychological tests.
Individual differences manifested by scores on such tests are real and often sub-
stantial in size. For example, you may have observed differences in attributes such as
personality, intelligence, or achievement based on the results you or someone close to
you received on a psychological test. Test results can and often do affect people’s lives in
important ways. For example, scores on tests can be used to classify a person as brain
damaged, weak in mathematical skills, or strong in verbal skills. Tests can also be used for
selection purposes in employment settings or in certain types of psychological counsel-
ing. Tests are also used for evaluation purposes (e.g., for licensure or certification in law,
medicine, and public safety professions).
Prior to examining the attributes of persons measured by tests, we must accurately
describe the attributes of interest. To this end, the primary goal of psychological measure-
ment is to describe the psychological attributes of individuals and the differences among them.
Describing psychological attributes involves some form of measurement or classification
scheme. Measurement is broadly concerned with the methods used to provide quanti-
tative descriptions of the extent to which persons possess or exhibit certain attributes.
Classification is concerned with the methods used to assign persons to one or another
of two or more different categories or classes (e.g., a major in college such as biology,
history, or English; diseased or nondiseased; biological sex [male or female]; or pass/fail
regarding mastery of a subject).

Table 1.1.  Types of Psychological Tests


Intelligence tests: measure an individual’s relative ability in global areas such as verbal comprehen-
sion, perceptual organization, or reasoning and thereby help determine potential for scholastic
work or certain occupations.
Aptitude tests: measure the capability for a relatively specific task or type of skill; aptitude tests are a
narrow form of testing.
Achievement tests: measure a person’s degree of learning, success, or accomplishment in a subject or task.
Personality tests: measure the traits, qualities, or behaviors that determine a person’s individuality;
such tests include checklists, inventories, and projective techniques.
Neuropsychological tests: measure cognitive, sensory, perceptual, and motor performance to deter-
mine the extent, locus, and behavioral consequences of brain damage.
Behavioral procedures: objectively describe and count the frequency of a behavior, identifying the
antecedents and consequences of the behavior.
Interest inventories: measure the person’s preference for certain activities or topics and thereby help
determine occupational choice.
Creativity tests: assess novel, original thinking and the capacity to find unusual or unexpected solu-
tions especially for vaguely defined problems.
Note. Adapted from Gregory (2000, p. 36). Copyright 2000. Reprinted by permission of Pearson Education, Inc.
New York, New York.
Introduction  3

1.2 Tests and Samples of Behavior

A psychological test measures a sample of an individual’s behavior. These “samples of


behavior” from people allow us to study differences among them. To this end, central
to psychological measurement and testing is the study of individual differences among
people. The process of acquiring a sample of behavior is based on a stimulus such as a
test question (paper and pencil or computer administered) or as a naturally occurring
behavior. Acquiring a sample of behavior may also take the form of responses to a ques-
tionnaire, oral responses to questions, or performance on a particular task.
Four essential components of test use are (1) acquiring a sample of behavior, (2) ensur-
ing that the sample of behavior is acquired in a systematic (standardized) manner (i.e.,
the same way for every person), (3) comparing the behavior of two or more people (i.e.,
studying individual differences), and (4) studying the performance of the same persons
over time (i.e., intraindividual differences). Depending on the goal of the measurement
process, the essential components above are used to measure the relevant information.
Tests differ on (1) the mode in which the material is presented (e.g., paper and pencil,
computerized administration, oral, in a group setting, in an individual setting), (2) the
degree to which stimulus materials are standardized, (3) the type of response format
(e.g., response from a set of alternatives vs. a constructed response), and (4) the degree to
which test materials are designed to simulate a particular context (American Educational
Research Association [AERA], American Psychological Association [APA], & National
Council on Measurement in Education [NCME], 1985, 1999; 2014, p. 3). In all cases, a
useful test accurately measures some attribute or behavior.

1.3 Types of Tests

Tests measuring cognitive ability, cognitive functioning, and achievement are classified
as criterion-referenced or norm-referenced. For example, criterion-referenced tests are
used to determine where persons stand with respect to highly specific educational objec-
tives (Berk, 1984). In a norm-referenced test, the performance of each person is inter-
preted in reference to a relevant standardization sample (Peterson, Kolen, & Hoover,
1989). Turning to the measurement of attitudes, instruments are designed to measure the
intensity (i.e., the strength of a person’s feeling), direction (i.e., the positive, neutral, or
negative polarity of a person’s feeling), and target (i.e., the object or behavior with which
the feeling is associated; Gable & Wolfe, 1998). Tests or instruments may be used to
quantify the variability between people (i.e., interindividual differences) at a single point
in time or longitudinally (i.e., how a person’s attitude changes over time). Tests and other
measurement devices vary according to their technical quality. The technical quality of
a test is related to the evidence that verifies that the test is measuring what it is intended
to measure in a consistent manner. The science of evaluating the characteristics of tests
designed to measure psychological attributes of people is known as psychometrics.
4  PSYCHOMETRIC METHODS

Science is defined here as a systematic framework that allows us to establish and organize
knowledge in a way that provides testable explanations and predictions about psycho-
logical measurement and testing.

1.4 Origin of Psychometrics

Charles Darwin’s On the Origin of Species (1859) advanced the theory that chance varia-
tions in species would facilitate selection or rejection by nature. Such chance variations
manifested themselves as individual differences. Specifically, Darwin was likely respon-
sible for the beginning of interest in the study of individual differences, as is seen in the
following quote from Origin of Species:

The many slight differences which appear in the offspring from the same parents . . . may be
called individual differences. . . . These individual differences are of the highest importance . . .
for they afford materials for natural selection to act on. (p. 125)

As a result of the interest in Darwin’s work, Francis Galton (1822–1911), Darwin’s


half-cousin, contributed to measurement in genetics and heredity (Forrest, 1974). Galton
focused on the study of individual differences among people and the role genetics and
heredity played in these differences. Galton’s two most influential works were Heredi-
tary Genius (1869) and Inquiries into Human Faculty and Its Development (1883). The
second publication was largely about individual differences in mental faculties and is
credited as beginning the mental test movement (Boring, 1950). Through these works,
Galton became an influential contributor in the field of measurement (Forrest, 1974).
In fact, Galton’s conviction about measurement was so strong that he believed that
anything was measurable—including personality, beauty, efficacy of prayer, and even
the boringness of lectures.
Galton’s goal with regard to measurement was to “classify people according to their
natural gifts” (Forrest, 1974, p. 1) and to “ascertain their deviation from average” (Forrest,
1974, p. 11). For example, in 1884 at the International Health Exhibition in London,
Galton used his anthropometric and psychometric laboratory to measure a variety of
human physical and sensory characteristics. These characteristics included memory, dis-
crimination of color, steadiness of hand, strength, height (standing), height (sitting),
respiratory vital capacity, weight, arm span, visual acuity, and visual and auditory reaction
time, to name only a few. Galton also measured psychological characteristics; he called
the measurement of psychological characteristics psychometry. During the 1880s and
1890s, Galton measured at least 17,000 individuals on a variety of anthropometric and
psychometric characteristics. Based on his work on psychological measurement (psy-
chometry), Galton is recognized as the father of modern psychometrics. For example, he
is credited with innovations in psychometrics, such as application of the normal distribu-
tion to studying the distribution of human characteristics or attributes, and he pioneered
the idea of using the correlation coefficient.
Introduction  5

1.5 Definition of Measurement

Previously, measurement was described as being concerned with the methods used to
provide quantitative descriptions of the extent to which persons possess or exhibit cer-
tain attributes. Following this idea, measurement is the process of assigning numbers
(i.e., quantitative descriptions) to persons in an organized manner, providing a way to
represent the attributes of the persons. Numbers are assigned to persons according to a
prescribed and reproducible procedure. For example, an intelligence test yields scores
based on using the same instructions, questions, and scoring rules for each person.
Scores would not be comparable if the instructions, questions, and scoring rules were
not the same for each person. In psychological measurement, numbers are assigned in
a systematic way based on a person’s attributes. For example, a score of 100 on an intel-
ligence test for one person and a score of 115 for another yields a difference of 15 points
on the attribute being measured—performance on an intelligence test. Another example
of measurement for classification purposes is based on a person’s sex. For example, the
biological sex of one person is female and the other is male, providing a difference in the
attribute of biological sex.
Measurement theory is a branch of applied statistics that describes and evaluates
the quality of measurements (including the response process that generates specific score
patterns by persons), with the goal of improving their usefulness and accuracy. Psycho-
metricians use measurement theory to propose and evaluate methods for developing
new tests and other measurement instruments. Psychometrics is the science of evaluating
the characteristics of tests designed to measure the psychological attributes of people.

1.6 Measuring Behavior

Although our interest in this book is in psychological measurement, we begin with some
clear examples of measurement of observed properties of things in the physical world.
For example, if we want to measure the length of a steel rod or a piece of lumber, we
can use a tape measure. Things in the physical world that are not directly observable are
measured as well. Consider measurement of the composition of the air we breathe—
approximately 21% oxygen and 79% nitrogen. These two gases are invisible to the human
eye, yet devices or tests have been developed that enable us to measure the composition
of the air we breathe with a high degree of accuracy. Another example is a clock used to
measure time; time is not directly observable, but we can and do measure it daily. In psy-
chological measurement, some things we are interested in studying are directly observable
(e.g., types of body movements in relation to a certain person’s demeanor; reaction time
to a visual stimulus; or perhaps to evaluate someone’s ability to perform a task to a certain
level or standard). More often in psychological measurement, the things we are interested
in measuring are not directly observable. For example, intelligence, personality, cogni-
tive ability, attitude, and reading ability are unobservable things upon which people vary
(i.e., they individually differ). We label these unobservable things as constructs. These
6  PSYCHOMETRIC METHODS

unobservable things (i.e., constructs) are intangible and not concrete, although the people
we are measuring are very real. In this case, we call the variable an intellectual construct.
The quantitative reasoning test under the construct of fluid intelligence (see Table 1.1) is
a variable because people’s scores vary on the test.
In this book we use the construct of intelligence to illustrate the application of psy-
chometric methods to real data. The construct of intelligence is unobservable, so how can
we measure it? Although a number of theories of intelligence have been forwarded over
time, in this book we use a model based on the multifactor form of the general theory
of intelligence (GfGc theory; Horn, 1998), which includes fluid and crystallized compo-
nents of intelligence and a short-term memory component. Why use G or GfGc theory of
intelligence versus one of the other theories? First, psychometric methods and the theory
and measurement of intelligence share a long, rich history (i.e., over a century). Second,
the G theory of intelligence, and variations of it such as GfGc and other multiple-factor
models, boast a substantial research base (in terms of quantity and quality). The research
base on the theory of general intelligence verifies that any given sample of people pos-
sesses varying degrees of ability on cognitively demanding tasks. For example, if a person
excels at cognitively challenging tasks, we say that he or she has an above-average level of
general intelligence (Flynn, 2007). Furthermore, empirical research has established that
the cognitive components of G theory are correlated (Flanagan, McGrew, & Ortiz, 2000).
For instance, people measured according to G theory have patterns of (1) large vocabu-
laries, (2) large funds of general information, and (3) good arithmetic skills. The use of
G theory throughout this book is in no way intended to diminish the legitimacy of other
models or theories of intelligence such as people who exhibit an exceptional level of
musical ability (i.e., musical G) or one who exhibits a high level of kindness, generosity,
or tolerance (i.e., moral G; Flynn, 2007). Rather, use of G theory ideally provides a data
structure that enhances moving from measurement concepts to psychometric techniques
to application and interpretation.
Two components of G theory are crystallized and fluid intelligence (i.e., GfGc denotes
the fluid and crystallized components of G theory). To measure each component, we
use measurements of behavior that reflect certain attributes of intelligence as posited by
G theory. Specifically, we make inferences to the unobservable construct of intelligence
based on the responses to test items on several components of the theory. Table 1.2 pro-
vides each subtest that constitutes three components of the general theory of intelligence:
crystallized and fluid intelligence and short-term memory.
In Table 1.2, three components of the theory of general intelligence—fluid (Gf),
crystallized (Gc), and short-term memory (Gsm)—are used in examples throughout the
book to provide connections between a theoretical model and actual data. The related
dataset includes a randomly generated set of item responses based on a sample size N =
1,000 persons. The data file is available in SPSS (GfGc.sav), SAS (GfGc.sd7), or delimited
file (GfGc.dat) formats and are downloadable from the companion website (www.guilford.
com/price2-materials).
In GfGc theory, fluid intelligence is operationalized as process oriented and crystal-
lized intelligence as knowledge or content oriented. Short-term memory is composed of
Introduction  7

TABLE 1.2.  Subtests in the GfGc Dataset


Number
Name of subtest of items Scoring
Fluid intelligence (Gf )
Quantitative reasoning—sequential Fluid intelligence test 1 10 0/1/2
Quantitative reasoning—abstract Fluid intelligence test 2 20 0/1
Quantitative reasoning—induction Fluid intelligence test 3 20 0/1
and deduction
Crystallized intelligence (Gc)      
Language development Crystallized intelligence test 1 25 0/1/2
Lexical knowledge Crystallized intelligence test 2 25 0/1
Listening ability Crystallized intelligence test 3 15 0/1/2
Communication ability Crystallized intelligence test 4 15 0/1/2
Short-term memory (Gsm)      
Recall memory Short-term memory test 1 20 0/1/2
Auditory learning Short-term memory test 2 10 0/1/2/3
Arithmetic Short-term memory test 3 15 0/1
Note. Scaling key: 0 = no points awarded; 1 = 1 point awarded; 2 = 2 points awarded; 3 = 3 points awarded. Sample
size is N = 1,000.

recall of information, auditory processing, and mathematical knowledge (see Table 1.1).
In Figure 1.1, GfGc theory is illustrated as a model, with the small rectangles on the far
right representing individual test items. The individual test items are summed to create
linear composite scores represented as the second larger set of rectangles. The ovals in
the diagram represent latent constructs as measured by the second- and first-level observed
variables. Table 1.2 provides an overview of the subtests, level of measurement, and
descriptions of the variables for a sample of 1,000 persons or examinees in Figure 1.1.

1.7 P
 sychometrics and Its Importance to Research
and Practice

As previously noted, psychological measurement and testing affect people from all walks
of life. Psychological measurement also plays an important role in research studies of
all types—applied and theoretical. The role measurement plays related to the integrity
of a research study can’t be understated. For this reason, understanding psychological
measurement is essential to your being able to evaluate the integrity and/or usefulness of
scores obtained from tests and other instruments. If you are reading this book, you may
be enrolled in a graduate program in school or clinical psychology that will involve you
making decisions based on scores obtained from a test, personality inventory, or other
8  PSYCHOMETRIC METHODS

Item 01
Fluid
intelligence
test 1
Item 10

Item 01
Fluid
Fluid intelligence intelligence
(Gf ) test 2
Item 20

Item 01
Fluid
intelligence
test 3
Item 20

Item 01
Crystallized
intelligence
test 1
Item 25

Item 01
Crystallized
intelligence
test 2
Item 25
General
Crystallized
intelligence
intelligence (Gc)
(G)
Item 01
Crystallized
intelligence
test 3
Item 15

Item 01
Crystallized
intelligence
test 4
Item 15

Item 01
Short-term
memory
test 1
Item 20

Item 01
Short-term Short-term
memory memory
(Stm) test 2
Item 10

Item 01
Short-term
memory
test 3
Item 15

Figure 1.1.  General theory of intelligence.


Introduction  9

form of behavioral assessment. In these instances, measurement information is used in


a way that directly affects peoples’ lives. To this end, you have a responsibility to acquire
a solid understanding of psychological measurement. Without a clear understanding of
psychological measurement, harm may come to patients, students, clients, and employ-
ees. However, when used appropriately, tests benefit test takers and users alike.
If you plan on conducting research on a regular basis (e.g., to have a career in behav-
ioral or psychological research), the information in this book will help you in your own
research and enable you to become literate in psychological measurement. Regardless of
the type of quantitative research you conduct, measurement is central to it. Consider the
following elements of conducting research (Crocker & Algina, 1986, p. 11):

1. Formulating a research question and hypothesis.


2. Specifying operational definitions for each variable in the hypothesis by deter-
mining how it should be controlled or measured during the study.
3. Developing or selecting the instruments and procedures to be used.
4. Testing the accuracy and sensitivity of the instruments and procedures to be
used.
5. Collecting the experimental data within the framework of an experimental
design that will permit the original question to be answered.
6. Summarizing the data mathematically and, when appropriate, conducting sta-
tistical tests to determine the likelihood that the observed results were due to
chance.

Psychological measurement (and test theory more specifically) have the most rel-
evance for points 2 through 4 above. However, the process of measurement must be
considered from the outset because the outcomes of the study are directly related to how
they are measured.

1.8 Organization of This Book

Psychological measurement and psychometrics is an extensive field with applied and


theoretical components. This book is not targeted to those interested in intermediate or
advanced measurement theory. For an intermediate to advanced book focusing on mea-
surement theory I recommend Measurement, Judgment, and Decision Making, edited by
Michael Birnbaum (1998).
The organization of this book is as follows. In Chapter 2, measurement and sta-
tistical concepts are presented as a foundation for the remainder of the material in this
book. Chapter 3 introduces validity—arguably the most important property of scores
produced by a test. In Chapter 4, statistical aspects of the validation process are pre-
sented with a focus on statistical techniques for group classification and considerations
10  PSYCHOMETRIC METHODS

for establishing evidence of content validity. The final section of the chapter covers tech-
niques for establishing evidence of construct validity. Chapter 5 introduces scaling and
the fundamental role it plays in psychometrics. In Chapter 6, guidelines for test and
instrument development are introduced along with methods for evaluating the quality
of test items. Chapter 7 presents score reliability within the classical test theory (CTT)
framework. Chapter 8 introduces generalizability theory as an extension of the CTT
model for estimating the reliability of scores based on the scenario in which raters or
judges score persons. In Chapter 9, factor analysis is presented as an important tool for
studying the underlying structure of a test. Connections are made to the process of con-
struct validation (Chapter 4). Chapter 10 introduces item response theory and advanced
test theory, which are very useful for modeling a person’s true score (a.k.a. latent trait)
based on patterns of responses to test questions. The final chapter (11) covers the devel-
opment of norms and test equating. Examples of how standard scores (norms) are devel-
oped are provided, along with their utility in measurement and testing. The chapter
ends with an introduction to test score equating based on the linear, equipercentile and
items response theory true score methods. Example applications are provided using three
equating designs. Now we turn to Chapter 2, on measurement and statistical concepts, to
provide a foundation for the material presented in subsequent chapters.

Key Terms and Definitions

Classification. Concerned with the measurement methods used to assign persons to one
or another of two or more different categories or classes.
Composite score. A score created by summing the individual items on a test. Composite
scores may be equally weighted or unequally weighted.
Constructs. Unobservable things that are intangible and not concrete. For example, intel-
ligence is known as an intellectual construct.
Criterion-referenced test. Used to determine where persons stand with respect to highly
specific educational objectives.
Francis Galton. Known as the father of psychometrics due to his work in measurement of
human anthropometrics, differentiation, and abilities.
Measurement. The process of assigning numbers (i.e., quantitative descriptions) to per-
sons in an organized manner, providing a way to represent the attributes of the
persons.
Measurement theory. A branch of applied statistics that describes and evaluates the
quality of measurements with the goal of improving their usefulness and accuracy.
Norm-referenced test. A test where the performance of each person is interpreted in
reference to a well-defined standardization sample.
Psychological test. A device for acquiring a sample of behavior from a person.
Introduction  11

Psychometricians. Persons trained in measurement theory aimed toward psychological


measurement; they propose and evaluate methods for developing new tests and other
measurement instruments.
Psychometrics. The science of evaluating the characteristics of tests designed to measure
psychological attributes of people.
Psychometry. The measurement of psychological characteristics.

Variable. Characteristics or qualities in which persons differ among themselves. The char-
acteristics or qualities are represented numerically. For example, a test score is a
variable because people often differ in their scores.
2

Measurement and Statistical Concepts

This chapter presents measurement and statistical concepts essential to understanding the
theory and practice of psychometrics. The properties of numbers are described, with an
explanation of how they are related to measurement. Techniques for organizing, summariz-
ing, and graphing distributions of variables are presented. The standard normal distribu-
tion is introduced, along with the role it plays in psychometrics and statistics in general.
Finally, correlation and regression are introduced, with connections provided relative to the
fundamental role each plays in the study of variability and individual differences.

2.1 Introduction

We begin our study of psychometrics by focusing on the properties of numbers and how
these properties work together with four levels of measurement. The four levels of mea-
surement provide a clear guide regarding how we measure psychological attributes. For
the more mathematically inclined or for those who want a more in-depth treatment of the
material in this chapter, see the Appendix. Reviewing the Appendix is useful in extend-
ing or refreshing your knowledge and understanding of statistics and psychometrics.
The Appendix also provides important connections between psychometrics and statistics
beyond the material provided in this chapter. Source code from SPSS and SAS is included
in the Appendix to carry out analyses.

2.2 Numbers and Measurement

Measurement is the process of assigning numerals (a.k.a. numbers) to observations.


This is not done arbitrarily but in a way that the numbers are meaningful. Numbers are

13
14  PSYCHOMETRIC METHODS

–6.5 –5.5 –4.5 –3.5 –2.5 –1.5 –.5 .5 1.5 2.5 3.5 4.5 5.5 6.5
–6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6
IQ score: 60 70 80 90 100 110 120 130 140

FIGURE 2.1.  Real number line and intelligence test score.

treated differently depending on their level or scale of measurement. They are used in
psychological measurement in two fundamental ways. First, numbers can be used to
categorize people. For example, for biological sex, the number “1” can be assigned to
reflect females and the number “2” males. Alternatively, the response to a survey ques-
tion may yield a categorical response (e.g., a person answers “Yes,” “No,” “Maybe,” or
“Won’t Say”). In the previous examples, there is no ordering, only categorization. A
second way numbers are useful to us in psychological measurement is to establish order
among people. For example, people can be ordered according to the amount or level of
psychological attribute they possess (e.g., the number “1” may represent a low level of
anxiety, and the number 5 may represent a high level of anxiety). However, in the order
property the size of the units between the score points is not assumed to be equal (e.g.,
the distance between 1 and 2 and the distance between 2 and 3 on a 5-point response
scale are not necessarily equal).
When we use real numbers, we enhance our ability to measure attributes by defin-
ing the basic size of the unit of measurement for a test. Real numbers are also continuous
because they represent any quantity along a number line (Figure 2.1). Because they lie
on a number line, their size can be compared. Real numbers can be positive or negative
and have decimal places after the point (e.g., 3.45, 10.75, or –25.12). To this end, a real
number represents an amount of something in precise units. For example, if a person
scores 100 on a test of general intelligence and another person scores 130; the two people
are precisely 30 IQ points apart (Figure 2.1).
A final point about our example of real number data expressed as a continuous
variable is that in Figure 2.1, although there are intermediate values between the whole
numbers, it is only the whole numbers that are used in analyses and reported.

2.3 Properties of Measurement in Relation to Numbers

Our understanding of scores produced from psychological measurement (e.g., from a


test) is based on three properties of numbers and how we treat the number zero. For
example, a particular level of measurement (see Table 2.1) is determined by the pres-
ence or absence of the following four properties: (1) identity, (2) order and quantity,
(3) equality of intervals, and (4) absolute zero. In Table 2.1, we see that measurement
occurs at four levels: nominal, ordinal, interval, and ratio (Stevens, 1951b). Each level
Measurement and Statistical Concepts  15

Table 2.1.  Levels of Measurement and Their Applications


Adaptations/ Practical
Scale Statistics Examples recommendations recommendations
Nominal Numbers Qualitative: Sex Scaling approaches Cross-sectional
of classes or hair color, and subsequent and longitudinal
(classification), distinguishing statistical categorical
mode labels/categories procedures that hierarchical scaling
are applicable to and modeling, Latent
categorical (non- Class Analysis,
quantitative) Loglinear, Multi-way
or noninterval Contingency Table
levels only (i.e., analytics and effect
nonparametric) sizes, Classification
and Discrimination
analytic approaches
Ordinal Median, Quantitative: Scaling approaches Consideration of
percentiles, Class rank, and statistical the shape of the
order statistics hardness of procedures distribution of the
minerals, order (parametric or data acquired in
finish in a nonparametric) that consideration of the
competitive are applicable to population of interest.
running race either quantitative Plays a crucial role
or ordered in whether or not to
categorical levels of apply interval-level
measurement properties to scales
that are somewhere in
between the two
Interval Equality of Quantitative: Scaling approaches Consideration of
intervals of Temperature and statistical the shape of the
scores along (Celsius), procedures distribution of the
the score standardized (parametric or data acquired in
continuum test scores nonparametric) consideration of the
that are applicable population of interest.
to interval-level, Plays a crucial role
quantitative, in whether or not to
or ordered apply interval-level
categorical levels of properties to scales
measurement that are somewhere in
between ordinal and
interval
Ratio Equality of Quantitative: Scaling approaches Consideration of
ratios temperature and subsequent the shape of the
(Kelvin) statistical distribution of the
procedures data acquired in
(parametric) that consideration of the
are applicable to population of interest.
either ratio-level or Plays a crucial role
quantitative levels in whether or not to
of measurement apply interval-level
properties to scales
that are somewhere in
between the two
Note. Adapted from Stevens (1951b). Copyright 1951 by Wiley. Adapted by permission.
16  PSYCHOMETRIC METHODS

of measurement includes criteria or rules for how numbers are assigned to persons in
relation to the attribute being measured. Also, the different levels of measurement convey
different amounts of information.
Harvard University psychologist S. S. Stevens conducted the most extensive experi-
mentation on the properties and systems of measurement. Stevens’s work produced a useful
definition of measurement and levels of measurement that are currently the most widely
used in the social and behavioral sciences. Stevens defines measurement as “the assign-
ment of numerals to objects or events according to rules” (1951b, p. 22). Stevens does not
mention the property of the numbers (i.e., identity, order, equal intervals, absolute zero);
instead, his definition states that numbers are assigned to objects or events according to rules.
However, it is the rules that provide the operational link between the properties of numbers
and the rules for their assignment in the Stevens tradition. Figure 2.2 illustrates the link
between the properties of numbers and the rules for their assignment.
To illustrate the connection between Stevens’s work on numbers and the properties
of numerical systems, we begin with the property of identity. The property of identity
allows us to detect the similarity or differentness among people. We can consolidate these
contrasting terms into “distinctiveness.” The most basic level of measurement (nominal)
allows us to differentiate among categories of people according to their distinctiveness
(e.g., for two persons being measured, one is female and the other is male, or one person
has red hair and the other blonde hair). Notice that in the examples of the identity prop-
erty in combination with the nominal level of measurement, no ordering exists; there is
only classification on the basis of the distinctiveness of the attribute being measured (see
Table 2.1). As we see in Figure 2.3, when only the identity property exists in the mea-
surement process, the level of measurement is nominal. Another example of the identity
property and the nominal level of measurement is provided in Figure 2.4, where a person

Levels of measurement and associated properties

Nominal Ordinal Interval Ratio

• Identity • Identity • Identity • Identity


Properties

• Order • Order • Order


• Equal intervals • Equal intervals
• Absolute zero

FIGURE 2.2.  Properties and levels of measurement.


Measurement and Statistical Concepts  17

I need to wash my hands five times before I can eat.

Yes

No

Maybe

Won’t Say

FIGURE 2.3.   Item displaying nominal measurement.

No

Maybe Yes

Won’t say

FIGURE 2.4.  Graphic illustrating no order in response alternatives. From de Ayala (2009, p. 239).
Copyright 2009 by The Guilford Press. Reprinted by permission.

responds to a survey question by selecting one of four options. The options are discrete
categories (i.e., only identity is established, not order of any kind).
Next, if two persons share a common attribute, but one person has more of the attri-
bute than the other, then the property of order is established (i.e., the ordinal level of
measurement). Previously, in the nominal level of measurement, only identity or dis-
tinctiveness was a necessary property reflected by the numbers. However, in the ordinal
level of measurement, the properties of identity and quantity must exist. Figure 2.5 illus-
trates an ordinal scale designed to measure anxiety that captures the properties of identity
and order. On the scale in the figure, the number 1 identifies the lowest level of anxiety
expressed by the qualitative descriptor “never,” and the number 5 identifies the highest
level of anxiety expressed by the qualitative descriptor “always.”
Before continuing with properties of numbers and measurement levels, the following
section provides important information related to the quantity property of measurement
and its relationship to units of measurement.

I enjoy being in large group social settings.


1 2 3 4 5

Never Rarely Sometimes Usually Always

FIGURE 2.5.  Item displaying ordinal level of measurement.


18  PSYCHOMETRIC METHODS

Units of Measurement
The property of quantity requires that units of measurement be specifically defined. We are
familiar with how things are measured according to units in physical measurement. For
example, if you want to measure the length of a wall, you use a tape measure marked in inches
or centimeters. The length of the wall is measured by counting the number of units from one
end of the wall to the other. Consider the psychological attribute of intelligence—something
not physically observable. How can we measure intelligence (e.g., what are the units we can
use, and what do these units actually represent)? For example, the units are the responses
to a set of questions included on a test of verbal intelligence, but how sure are we that the
responses to the questions actually represent intelligence? Based on these ideas, you begin to
understand that the measurement of attributes that are not directly observable (in a physical
sense) present one of the greatest challenges in psychometrics.

Defining Units of Measurement


In measuring physical objects (e.g., a wall or table), standard measures such as pounds or centi-
meters are used. Standard measures are useful for three reasons. First, the units of measurement
were originally somewhat subjectively assigned and then they became a working or common
standard. Second, standard measures are general enough to apply broadly to other objects
beyond walls and tables. Third, units of measurement can be used to measure different features
of objects (e.g., the weight and length of a board or the weight and volume of a bag of sand).
Units of measurement in psychological measurement (e.g., intelligence test scores)
are only applicable to the first point above. For example, the units of measurement used
in intelligence testing were/are subjectively or arbitrarily determined regarding their size,
but they are linked to specific dimensions of intelligence (e.g., verbal intelligence or
quantitative reasoning). Thus, sometimes we talk in terms of an “intelligence score met-
ric.” Finally, one example of a type of measurement specific to psychological processes
that meets the criteria of a standard unit or score is reaction time. For example, since time
is measured in well-established standard units, it can be used to measure more than one
type of psychological process manifested by a person’s reaction to a stimulus.
Building on the information about units of measurement and the quantity prop-
erty, ordinal measurement is introduced next. In the ordinal level of measurement, larger
numbers represent a greater level of an attribute (anxiety in our example), but equal
intervals between the numbers on the scale are not assumed. Finally, there is no absolute
zero on an ordinal scale (e.g., the nature and assumptions regarding the way the scale
is constructed with numbers does not allow one to verify that there is no amount of an
attribute).
The third property, equality of score intervals, exists if equal differences between the
measurements represent the same amount of the attribute being measured. Consider the
example of two persons with intelligence (IQ) scores of 100 and 120 (see Figure 2.1).
The property of equality of intervals (i.e., equal units) is met if the distance between
100 and 120 has the same meaning as scores of 90 and 100. For example, for the equal
Measurement and Statistical Concepts  19

interval property to hold, the 10-point difference must be the same at different points
along the score scale. Finally, notice that when equal intervals exist, the property of order
is also met.

Clarifying the Difference between Ordinal and Interval Levels


of Measurement
The difference between the ordinal and interval level of measurement can be visually seen
in Figure 2.6. In the figure, length measured using a ruler representing real numbers on a
number line is compared to length measured using only ranks based on whole numbers
(not real numbers). Figure 2.6 illustrates that because the lengths of the bars in the figure
are measured in centimeters, direct comparisons can be made between the lengths of
each bar according to their length in centimeters. However, we see that when the same
bars are measured using only ranks, the only type of statement that can be made is that
“bar A” is shorter (or has less length or amount of something) than “bar B.”
The fourth property, absolute zero, concerns the meaning and use of the number zero.
For example, if a person’s score on our test intelligence is zero, the implication is that there is
an absence of the attribute being measured. However, understanding the number zero in terms
of its meaning can be confusing in psychological measurement. Specifically, the number zero
may be expressed in absolute terms or relative terms. For example, absolute zero can occur on
a test of visual perception by a person when zero errors are recorded during the test. In this

Length measured using a ruler: 1 2 3 6 (Centimeters)

Length measured using ranks: 1 2 3 4 (Whole numbers


indicating order categories)

FIGURE 2.6.  Length measured using two different measurement rules. Adapted from Glenberg
and Andrzejewski (2008, p. 11). Copyright 2008 by Lawrence Erlbaum Associates. Adapted by per-
mission. Application: The lengths of the bars can be directly compared by using, say, centimeters.
However, when the bars are measured using only ranks, we can only say that “bar A” is shorter
than “bar B.” In psychological measurement, we might say that a person ranked according to “bar
A” has “less” of some attribute than a person ranked according to “bar B.”
20  PSYCHOMETRIC METHODS

Degrees Celsius

0° 50° 100°

50% 50%

0° 300° 350° 400°


Absolute zero Degrees Kelvin

FIGURE 2.7.  Three temperatures represented on the Celsius and Kelvin scales. From King and
Minium (2003). Copyright 2003 by Wiley. Reprinted by permission. Applications: The zero point
on the Celsius scale does not actually reflect a true absence of temperature (i.e., because we see that
a measurement of zero degree on the Celsius scale actually represents 300° Kelvin). However, the
difference between 0° and 50° Celsius reflects the same distance as 300° and 350° Kelvin. So, the
Kelvin and Celsius scales both exhibit the property of an interval scale, but only the Kelvin scale
displays the property of absolute zero.

case, absolute zero has meaning because of the psychophysical properties of visual perception
and how a person responds. The previous example was clear in part because the thing being
measured (sensory reaction to a visual stimulus) was directly observable and zero had an
absolute meaning. Using another example we are all familiar with, Figure 2.7 illustrates how
absolute and relative meanings of zero are used in the measurement of temperature.
Now we turn to the unobservable construct of intelligence for an example of the
meaning of zero being relative. Consider the case where a person scores zero on an intel-
ligence test. Does this mean that the person has a complete absence of intelligence (i.e.,
according to the absolute definition of zero)? The previous interpretation is likely untrue
since the person probably has some amount of intelligence. The point to understand is
that a score of zero is relative in this case; that is, the score is relative to the specific type of
intelligence information the test was designed to measure (i.e., according to a particular
theory of intelligence). This example does not mean that the same person would score
zero on a different test of intelligence that is based on a different theory.

2.4 Levels of Measurement

The levels of measurement proposed by S. S. Stevens (1946) that are widely used today are
nominal, ordinal, interval, and ratio. Notice that one can apply the previously mentioned
kinds of measurement in relation to Stevens’s levels of measurement for a comprehensive mea-
surement scheme. The defining elements, along with some commonly accepted conventions
or practical applications of Stevens’s levels of measurement, are presented in Table 2.1.
Measurement and Statistical Concepts  21

Nominal
The nominal scale represents the most unrestricted assignment of numerals to objects.
That is, numbers are used simply to label or classify objects. The appropriate statistic
to use with this scale is the number or frequency of “cases.” For example, the number
of cases may represent the number of students within a particular teacher’s class. Such
counts may be graphically displayed using bar graphs representing frequency counts of
students within the class or ethnic groups within a defined geographical region of a coun-
try. For example, a numerical coding scheme organized according to the nominal level of
measurement may be biological sex of female = 1 and male = 2.

Ordinal
The ordinal scale is derived from the rank ordering of scores. The scores or numbers in
an ordinal scale are not assumed to be real numbers as previously defined (i.e., there are no
equally spaced units of measurement between each whole number on the scale—more on
this later). Examples in the behavioral sciences include using a Likert-type scale to measure
attitude or a rating scale to measure a teacher’s performance in the classroom. Examples of
other constructs often measured on an ordinal level include depression, ability, aptitude,
personality traits, and preference. Strictly speaking, the permissible descriptive statistics to
use with ordinal scales do not include the mean and standard deviation, because these sta-
tistics mathematically imply more than mere rank ordering of objects. Formally, use of the
mean and standard deviation implies that mathematical equality of intervals between succes-
sive integers (real numbers) representing the latent trait of individuals on some construct is
present. However, if the empirical data are approximately normally distributed, and the number
of scale points exceeds four, treating ordinal data as interval produces statistically similar, if not
identical, results. Ultimately, researchers should be able to defend their actions (and any con-
clusions they draw from them) mathematically, philosophically, and psychometrically.

Interval
The interval scale represents a scale whose measurements possess the characteristic of
“equality of intervals” between measurement points. For example, on temperature scales,
equal intervals of temperature are derived by noting equal volumes of gas expansion. An
arbitrary or relative zero point is established for a particular scale (i.e., Celsius or Fahr-
enheit), and the scale remains invariant when a constant is added. The intelligence test
scores used throughout this book are based on an interval level of measurement.

Ratio
The ratio scale represents a scale whose measurements possess the characteristic of
“equality of intervals” between measurement points. For example, on temperature scales,
equal intervals of temperature are derived by noting equal volumes of gas expansion. An
22  PSYCHOMETRIC METHODS

absolute zero point exists for a particular scale (i.e., temperature measured in Kelvin),
and the scale remains invariant when a constant is added. Ratio scales are uncommon
in psychological measurement because the complete absence of an attribute, expressed
as absolute zero, is uncommon. However, a ratio scale may be used in psychophysical
measurement when the scale is designed to measure response to a visual stimulus or
auditory stimulus. In this case, a measurement of zero will have a clear meaning.

2.5 Contemporary View on the Levels of Measurement


and Scaling

Based on the evolution of measurement and scaling over the past half-century, Brennan
(1998) revisited Stevens’s (1946, 1951b) framework and provided a revised interpretation
of scaling and the levels of measurement to reflect what has been learned through practice.
Scaling is defined as “the mathematical techniques used for determining what numbers
should be used to represent different amounts of a property or attribute being measured”
(Allen & Yen, 1979, p. 179). Broadly speaking, Brennan argues that scaling is assumed by
many to be a purely objective activity when in reality it is subjective involving value-laden
assumptions (i.e., scaling does not occur in a “psychometric vacuum”). These value-laden
assumptions have implications for the validity of test scores, a topic covered in Chapter
3. Brennan states that the rules of measurement and scaling methodology are inextricably
linked such that “the rules of measurement are generally chosen through the choice of
a scaling methodology” (1998, p. 8). Brennan maintains that measurement is not an end
unto itself, but rather is a means to an end—the end being sound decisions about what it is
that we are measuring (e.g., intelligence, student learning, personality, proficiency). Based
on Brennan’s ideas, we see that psychometrics involves both subjective and objective rea-
soning and thought processes (i.e., it is not a purely objective endeavor).

2.6 Statistical Foundations for Psychometrics

At the heart of the measurement of individuals is the concept of variability. For exam-
ple, people are different or vary on psychological attributes or constructs such as intel-
ligence, personality, or memory. Because of variability, in order to learn anything from
data acquired through measurement, the data must be organized. Descriptive statistical
techniques exist as a branch of statistical methods used to organize and describe data.
Descriptive statistical techniques include ways to (1) order and group scores into distribu-
tions that describe observations/scores, (2) calculate a single number that summarizes a
set of observations/scores, and (3) represent observations/scores graphically. Descriptive
statistical techniques can be applied to samples and populations, although most often
they are applied to samples from populations. Inferential statistical techniques are used
to make educated guesses (inferences) about populations based on random samples from
the populations. Inferential statistical techniques are the most powerful methods available
Measurement and Statistical Concepts  23

to statisticians and psychometricians. The following sections of the chapter provide a


review of basic descriptive statistical techniques useful to psychometrics. Additionally, the
correlation, covariance, and simple linear regression are introduced. Readers with a sound
understanding of applied statistics may wish to skip this chapter. Alternatively, readers who
want more depth on the material presented in this section should see the Appendix of this book
for a more rigorous treatment of the material in this chapter.

2.7 Variables, Frequency Distributions, and Scores

Measurements acquired on a variable or variables are part of the data collection process.
Naturally, these measurements will differ from one another. A variable refers to a property
whereby members of a group differ from one another (i.e., measurements change from one
person to another). A constant refers to property whereby members of a group do not dif-
fer from one another (e.g., all persons in a study or taking an examination are female; thus,
biological sex is constant). Variables are defined as quantitative or qualitative and are related to
the levels of measurement presented in Tables 2.1 and 2.2. Additionally, quantitative variables
may be discrete or continuous. A discrete variable can take specific values only. For example,
the values obtained in rolling a die are 1, 2, 3, 4, 5, or 6. No intermediate or in-between values
are possible. Although the underlying variable measurements (the numbers observed in the
die-rolling example) may be theoretically continuous, all sets of real or empirical data in the
die example are discrete. A continuous variable may take any values within a defined range
of values. The possible range of values belongs to a continuous series. For example, between any
two values of the variable, an infinitely large number of in-between values may occur (e.g.,

Table 2.2.  Subtests in the GfGc Dataset


Number
Name of subtest of items Scoring
Fluid intelligence (Gf )
Quantitative reasoning—sequential Fluid intelligence test 1 10 0/1/2
Quantitative reasoning—abstract Fluid intelligence test 2 20 0/1
Quantitative reasoning—induction Fluid intelligence test 3 20 0/1
  and deduction
Crystallized intelligence (Gc)      
Language development/vocabulary Crystallized intelligence test 1 25 0/1/2
Lexical knowledge Crystallized intelligence test 2 25 0/1
Listening ability Crystallized intelligence test 3 15 0/1/2
Communication ability Crystallized intelligence test 4 15 0/1/2
Short-term memory (Gsm)      
Recall memory Short-term memory test 1 20 0/1/2
Auditory learning Short-term memory test 2 10 0/1/2/3
Arithmetic Short-term memory test 3 15 0/1
Note. Scaling key: 0 = no points awarded; 1 = 1 point awarded; 2 = 2 points awarded; 3 = 3 points awarded. Sample
size is N = 1,000.
24  PSYCHOMETRIC METHODS

weight, chronological time, height). In this book, the data used in examples are based on
discrete variables that are scores for a finite sample of 1,000 persons on an intelligence test.

Frequency Distributions
To introduce frequency distributions, suppose you are working on a study examining
the correlates of crystallized and fluid intelligence. As a first step, you want to know
how a group of individuals performed on the language development (vocabulary) subtest
of crystallized intelligence. The vocabulary subtest is one of four subtests comprising
crystallized intelligence in the GfGc dataset used throughout this book. Table 2.2 (intro-
duced in Chapter 1) provides the subtests in the GfGc dataset used throughout this book
(the shaded row is the language development/vocabulary test). We see that this subtest
is composed of 25 items scored as 0 = no credit, 1 = 1 point, 2 = 2 points. The scores/
points on each of the 25 items are summed for each person to create a total score on the
language development subtest for each person tested.
The score data for 100 persons out of the total GfGc data of 1,000 persons on the lan-
guage development/vocabulary is provided in Table 2.3. Before proceeding, an important
note on terminology when working with data and frequency distributions is provided to
help you avoid confusion. Specifically, the terms measurement observations and scores are
often used interchangeably and refer to a single value or datum in a cell.

Table 2.3.  Language Development/Vocabulary Test Scores for 100 Individuals


Person Score Person Score Person Score Person Score Person Score
1 19 23 33 45 36 67 38 89 42
2 23 24 33 46 36 68 38 90 42
3 23 25 33 47 36 69 38 91 42
4 26 26 33 48 36 70 38 92 42
5 26 27 33 49 36 71 39 93 43
6 26 28 33 50 36 72 39 94 43
7 27 29 33 51 37 73 39 95 43
8 27 30 34 52 37 74 39 96 44
9 27 31 34 53 37 75 39 97 44
10 27 32 34 54 37 76 39 98 45
11 30 33 34 55 37 77 39 99 47
12 30 34 34 56 37 78 40 100 49
13 30 35 34 57 37 79 40 — —
14 30 36 34 58 37 80 40 — —
15 30 37 34 59 37 81 40 — —
16 31 38 34 60 37 82 40 — —
17 31 39 34 61 37 83 40 — —
18 31 40 36 62 37 84 41 — —
19 31 41 36 63 38 85 41 — —
20 31 42 36 64 38 86 41 — —
21 31 43 36 65 38 87 41 — —
22 33 44 36 66 38 88 41 — —
Measurement and Statistical Concepts  25

A frequency distribution is a tabulation of the number of occurrences of each score


value. Constructing a frequency distribution involves counting the number of occurrences
of each score. The sum of the frequencies in the distribution should equal the number of per-
sons in the sample (or population). As we see in Table 2.4, the sum of the frequencies is 100.
Closer inspection of Table 2.4 reveals that the frequency distribution summarizes the
scores in a way that highlights important characteristics about the scores. For example,
we see that the range of scores is 19 to 49 (sorted from low to high) and the majority of
the scores are clustered in the middle between scores of 33 and 38. Notice also in Table
2.4 that column 2 represents the number of times each score in a column occurs. Col-
umns 1 and 2 in the table constitute a frequency distribution in their own right. However,
using the information in columns 1 and 2, we can derive three other frequency distri-
butions: the relative frequency, the cumulative frequency, and the cumulative relative
frequency distributions.
The relative frequency of a score (see the third column in Table 2.4) is expressed
as proportion (percentage) and is defined as the proportion of observations (measure-
ments) in the distribution at a particular score value. In Table 2.4, the relative frequency
distribution is a listing of the relative frequencies of each X-score value (interpreted as

Table 2.4.  Frequency Distribution for 100 Individuals


Cumulative
Relative Cumulative relative
Score Frequency frequency frequency frequency
X f(X) p(X) cf(X) cp(X)
19 1 0.01 1 0.01
23 2 0.02 3 0.03
26 3 0.03 6 0.06
27 4 0.04 10 0.10
30 5 0.05 15 0.15
31 6 0.06 21 0.21
33 8 0.08 29 0.29
34 10 0.10 39 0.39
36 11 0.11 50 0.50
37 12 0.12 62 0.62
38 8 0.08 70 0.70
39 7 0.07 77 0.77
40 6 0.06 83 0.83
41 5 0.05 88 0.88
42 4 0.04 92 0.92
43 3 0.03 95 0.95
44 2 0.02 97 0.97
45 1 0.01 98 0.98
47 1 0.01 99 0.99
49 1 0.01 100 1.00
100 1.00
26  PSYCHOMETRIC METHODS

percentages). We see in the table that the relative frequency for a score is derived by
taking the score value’s frequency and dividing it by the total number of measurements
(e.g., 100). For example, the score 34 has a relative frequency of 0.10 (10%) because a
score of 34 occurs 10 times out of 100 observations or measurements (i.e., 10/100 = 0.10;
0.10 × 100 = 10%). Also, note that the fourth column in Table 2.4 (i.e., the cumulative
frequency) sums to 100 (as it should since the column consists of proportions). Relative
frequency distributions provide more information than raw frequency distributions (e.g.,
only columns 1 and 2 in Table 2.4) and are often preferable since information about the
number of measurements is included with frequency of score occurrence. In random
samples, relative frequency distributions provide another advantage. For example, using
long-run probability theory (see the Appendix for more detail), we see that the proportion
of observations at a particular score level is an estimate of the probability of a particular
score occurring in the population. For this reason, in random samples, relative frequen-
cies are treated as probabilities (e.g., the probability that a particular score will occur in
the population is the score’s relative frequency).
The fifth column in Table 2.4 is the cumulative relative frequency distribution.
This distribution is created by tabulating the relative frequencies of all measurements at
or below a particular score. Cumulative relative frequency distributions are often used
for calculating percentiles, a type of information useful in describing the location of a
person’s score relative to others in the group.
The grouped frequency distribution is another form of frequency distribution
when there is a large number of different scores and when listing and describing indi-
vidual scores using the frequency distribution in Table 2.4 is less than ideal. Table 2.5
illustrates a grouped frequency distribution using the same data as in Table 2.4.
Examining the score data in Table 2.5, we are able to more clearly interpret the pat-
tern of scores. For example, we see that most of the individuals scored between 34 and
39 points on the vocabulary subtest (in fact, 48% of the people scored in this range!). We
also can easily see that 21 scores fell in the range of 34 to 36 and that this range of scores
contains the median or 50th percentile.

Graphing Frequency Distributions


Graphs depict important features of distributions more clearly than do tables (e.g., as in
Tables 2.3, 2.4, or 2.5). Here we cover two types of graphs appropriate for relative fre-
quency distributions: the histogram and the frequency polygon. In a relative frequency
histogram, the heights of bars represent relative frequencies of scores (often contained
within class intervals). Another type of histogram is based on the grouped relative
frequency distribution (Figure 2.8). Characteristics of high-quality grouped frequency
distributions include (1) using between 8 and 15 intervals; (2) using class intervals of
2, 3, 5, or multiples of 5; and (3) starting the first interval at or below the first score.
The SPSS syntax for creating the frequency distribution and histogram for the data in
Table 2.5 is provided next along with the associated output table produced by SPSS.
Measurement and Statistical Concepts  27

Table 2.5.  Grouped Frequency Distribution for 100 Individuals


Cumulative
Class Relative Cumulative relative
interval Frequency frequency frequency frequency
19–21 1 0.01 1 0.01
22–24 2 0.02 3 0.03
25–27 7 0.07 10 0.10
28–30 5 0.05 15 0.15
31–33 14 0.14 29 0.29
34–36 21 0.21 50 0.50
37–39 27 0.27 77 0.77
40–42 15 0.15 92 0.92
43–45 6 0.06 98 0.98
46–48 1 0.01 99 0.99
49–51 1 0.01 100 1.00
Total 100 1.00    

FIGURE 2.8.  Grouped relative frequency distribution histogram for 100 individuals (from Table
2.5 data). Application: The height of each bar represents a score’s relative frequency. When histo-
grams are used for grouped frequency distributions, the bar is located over each class interval. For
example, based on the data in Table 2.6, we see that the interval of scores 34 to 36 contains 21
observations or measurements. The width of the class interval (or of the bar) that includes 34 and
36 bisects the Y-axis at a frequency of 21.
28  PSYCHOMETRIC METHODS

FIGURE 2.9.  Relative frequency polygon for 100 individuals (from Table 2.4 data).

SPSS syntax for frequency distribution and histogram for data in Table 2.5

FREQUENCIES VARIABLES=Score
/HISTOGRAM
/ORDER=ANALYSIS.

In Figure 2.8, the class interval width is set at 3 points. Figure 2.9 depicts a frequency
polygon. The relative frequency polygon maps the frequency count (vertical or Y-axis) by
the score in the distribution (horizontal or X-axis). The frequency polygon differs from
the histogram in that a “dot” or single point is placed over the midpoint so that the height
of the dot represents the relative frequency of the class interval.
The adjacent dots are connected to form a continuous distribution representing the
score data. The line represents a continuous variable. For example, the location where the
line changes represents the number of times a score value occurs. For example, a score of 31
occurs 6 times in the dataset in Table 2.4, and a score of 34 occurs 10 times in the dataset.

SPSS syntax for creating relative frequency polygon

GRAPH
/LINE(SIMPLE)=COUNT BY Score.

Histograms versus Polygons


It is much easier to visualize the shape of a distribution of a set of scores using a graph versus
a tabular representation (i.e., a frequency table). Graphs such as histograms and polygons
are often used when two or more groups are compared on a set of scores such as our lan-
guage development or vocabulary test. The choice between using a histogram and a polygon
Measurement and Statistical Concepts  29

depends on preference; however, the type and nature of the variable also serve as a guide for
when to use one type of graph rather than another. For example, when a variable is discrete,
score values can only take on whole numbers that can be measured exactly—and there are
no intermediate values between the score points. Even though a variable may be continuous in
theory, the process of measurement always reduces the scores on a variable to a discrete level
(e.g., a discrete random variable; see the Appendix for a rigorous mathematical treatment of
random variables and probability). In part, this is due to the accuracy and/or precision of the
instrumentation used and the integrity of the data acquisition/collection method. Therefore,
continuous scales are in fact discrete ones with varying degrees of precision or accuracy.
Returning to Figure 2.1, for our test of general intelligence any of the scores may appear
to be continuous but are actually discrete because a person can only obtain a numerical value
based on the sum of his or her responses across the set of items on a test (e.g., it is not pos-
sible for a person to obtain a score of 15.5 on a total test score). The frequency histogram
can also be used with variables such as zip codes or family size (i.e., categorical variables
with naturally occurring discrete structures). Alternatively, the nature of the frequency poly-
gon technically suggests that there are intermediary score values (and therefore a continuous
score scale) between the points and/or dots in the graph. The intermediary values on the
line in a polygon can be estimated using the intersection of the X- and Y-axes anywhere on
the line. An example of a continuously measured variable from psychological measurement
is reaction time to visual stimulus. In this case, the score values can take on anywhere from
zero (no reaction at all) to an amount of time that, theoretically, is infinitesimally small.

2.8 Summation or Sigma Notation

The previous section showed how to describe the shape of the distribution of a variable
using tabular (frequency table) and graphic (histogram and polygon) formats. This sec-
tion introduces central tendency and variability, two characteristics that describe the cen-
ter and width of a distribution expressed as how different the scores are from one another.
Before discussing these two concepts, an explanation is provided on the notation used in
psychometrics and statistics—summation or sigma notation.
Sigma notation is a form of notation used to sum an identified number of quantities
(e.g., scores or other measurements). To illustrate summation notation, we use the first
10 scores from our sample of 100 people in Table 2.3 from our test of language develop-
ment; here are the scores in ascending order:

Person
1 2 3 4 5 6 7 8 9 10
Score
19 23 23 26 26 26 27 27 27 27

In sigma notation, the direction to sum the number of scores for these 10 people is given
in Expression 1:
30  PSYCHOMETRIC METHODS

Expression 1. Sigma notation


N
∑ XI
I =1

• X = the variable being measured (e.g., intelligence test


score, depression score, achievement score, etc.).
• i = a position index; it positions the scores that contribute
to the sum; the starting value of the index (location of the
first score) is indicated by i = 1 at the bottom of S . The final
value in the index is the last score that contributes to the
sum and is located atop of S .
• n = if this notation is used atop S , it indicates that the final
value of the position index is the final nth score. Note that
when the index is missing, it is assumed that you are to sum all
of the scores.

The shorthand notation above can be expanded by writing out all of the Xs for all
of the scores of the index between the starting value of the index and the final value as
illustrated below in Equation 2.1.
Finally, the notation S X is defined as “the sum of all the measurements of X”; for our
example set of 10 scores, S X = 251.
Another frequent use of summation notation is provided next. In Expression 2, we
see that each score is squared as a first step, and then summation occurs.

Equation 2.1. Expanded summation notation


N
∑ XI = X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10
I =1

For our example data, application of Equation 2.1 yields:


N
∑ XI = 19 + 23 + 23 + 26 + 26 + 26 + 27 + 27 + 27 + 27
I =1

Expression 2. Sigma notation using squared numerical values

∑ X 2 = 192 + 232 + 232 + 262 + 262 + 262 + 272 + 272 + 272 + 272 = 6363
Measurement and Statistical Concepts  31

Expression 3. Sigma notation: the square of the summed scores

( ΣX ) = (19 + 23 + 23 + 26 + 26 + 26 + 27 + 27 + 27 + 27 ) = 63001
2 2

Expression 4. Sigma notation using a constant

Σ ( X − C)2 = (19 − 3)2 + (23 − 3)2 + (23 − 3)2 + (26 − 3)2 + (26 − 3)2 + (26 − 3)2

+ (27 − 3)2 + (27 − 3)2 + (27 − 3)2 + (27 − 3)2

= 256 + 400 + 400 + 529 + 529 + 529 + 576 + 576 + 576 + 576 = 4947

Expression 3 provided above provides yet another summation example often encoun-
tered in psychometrics and statistics—squaring the summated scores.
Notice that there is a clear distinction between Expressions 2 and 3; for example, the
“sum of squared scores” does not equal the “square of the summed scores.” Remember the
order of operations rule to always conduct the operations within the parentheses first before
carrying out the operation outside of the parentheses (i.e., work from the inside to outside).
Next, we turn to the situation where a constant is added to our scores to see how
this is applied in summation notation. A constant is a value applied to each score that is
unchanging. Suppose we want to subtract a constant of 3 to each of our scores, and then
proceed with summing the squares of the difference enclosed within parentheses. This is
illustrated using summation notation in Expression 4.
Becoming familiar with sigma and summation notation requires a little practice. You
are encouraged to practice using the expressions and equations above with single integer
values. Given that sigma notation is used extensively in psychometrics and statistics,
familiarity with it is essential.

2.9 Shape, Central Tendency, and Variability


of Score Distributions
Shape
The shape of a distribution is defined as symmetric, positively skewed, and negatively
skewed. In Figure 2.10, panels A–F, illustrations of various shapes of distributions are
provided using distributions of continuous variables.
A symmetric distribution can be divided into two mirror halves. The distributions
in the top half of Figure 2.10 (A, B, C) are symmetric. Alternatively, skewed distributions
32  PSYCHOMETRIC METHODS

A B C

Relative frequency

D E F

Score value

FIGURE 2.10.  Distributions with various shapes.

are not able to be divided into mirror halves. Positively skewed distributions are those
with low frequencies that trail off with positive numbers to the right. The distributions in
the bottom half of Figure 2.10 (E, F) are positively skewed. If the tail of the distribution
is directed toward positive numbers, the skew is positive. For example, home prices in
major metropolitan cities are often positively skewed because professional athletes pur-
chase homes at very high prices, producing a positive skew in the distribution of home
prices. In Figure 2.10, panel D illustrates a negatively skewed distribution. Finally, the
modality is the number of clearly identifiable peaks in the distribution. Distributions
with a single peak are unimodal and distributions with two peaks are bimodal. For
example, in Figure 2.10, panels C and F illustrate bimodal distributions, whereas panels
A, D, and E illustrate unimodal distributions. Panel B in Figure 2.10 illustrates a rect-
angular distribution. Notice that this type of distribution does not have a well-defined
mode and includes a large number of score values at the same frequency. For example,
consider tossing a fair die. The relative frequency (i.e., probability) of rolling any value
(1, 2, 3, 4, 5, 6) on the die face is 1/6. This pattern of frequencies produces a rectangular
distribution because all of the relative frequencies are the same. Panel E in Figure 2.10
illustrates a type of distribution where the relative frequency of rare events occurs (e.g., in
occurrences of rare diseases). For example, most people will never contact an extremely
rare disease, so the relative frequency is greatest at a value of zero. However, some people
have contracted the disease, and these people create the long tail trending to the right.
Score distributions differ in terms of their central tendency and variability. For exam-
ple, score distributions can vary only in the central tendency (center) or only in their
variability (spread), or both. Examining the graphic display of distributions for different
groups of people on a score distribution provides an informative and intuitive way to
learn about how and to what degree groups of people differ on a score. Central tendency
and variability are described in the next section.
Measurement and Statistical Concepts  33

Central Tendency
Central tendency is the score value at the center (position at the center of the X-axis) that
marks the center of the distribution of scores. Knowing the center of a distribution for a
set of scores is important for the following reasons. First, since a measure of central ten-
dency is a single number, it is a concise way to provide an initial view of the set of scores.
Second, measures of central tendency can quickly and easily be compared. Third, many
inferential statistical techniques use a measure of central tendency to test hypotheses of
various types. In this section we cover three measures of central tendency: the mean,
median, and mode.

Mean
The mean is a measure of central tendency appropriate for data acquired at an interval/
ratio level; it is equal to the sum of all the values (e.g., scores or measurements) of a vari-
able divided by the number of values (e.g., scores or measurements). The formula for the
mean is provided in Equations 2.2a and 2.2b.
A statistic is computed for measurements or scores in a sample; a parameter is a
value computed for scores or measurements in a population.

Median
The median is a measure of central tendency that is defined as the value of a variable
that is at the midpoint of the measurements or scores. For example, the median is the
value at which half of the scores or measurements on a variable are larger and half of the
scores are smaller. The median is appropriate for use with ordinal and interval/ratio-level

Equation 2.2a. Mean of population

µ=
ΣX
N
• ­ΣX = sum of measurements.
• N = total number of measurements in the population.

Equation 2.2b. Mean of sample

X=
ΣX
N
• Σ
­ X = sum of measurements.
• n = total number of measurements in the sample.
34  PSYCHOMETRIC METHODS

Equation 2.3. Median

N 
 − FCUM 
D = L + W 2 
 F 
 

• Lm = lower limit of the class interval that contains the


median.
• w = width of the class interval.
• n = total number of measurements on the variable in the
dataset.
• fcum = n
 umber of measurements falling below the interval
containing the median.
• fm = n  umber of measurements or scores within the interval
containing the median.

measurements or scores. The median is also less sensitive to extreme scores or outliers
(e.g., when the distribution is skewed). For this reason, it better represents the middle of
skewed score distributions. Finally, the median is not appropriate for nominal measure-
ments because quantities such as larger and smaller do not apply to variables measured
on a nominal level (e.g., categorical data like political party affiliation or biological sex).
The formula for the median is provided in Equation 2.3.
To illustrate Equation 2.3 with the data from Table 2.5, we have the following result:

N 
 2 − FCUM   50 − 29 
MD = LM + W   = 34 + 2.5   = 36.5
 F M   21 
 

Mode
The mode is the score or measurement that occurs most frequently. For example, for
the data in Table 2.4, 37 is the score that most frequently occurs in the distribution (i.e.,
occurring 12 times).

Variability
Measures of variability for a set of scores or measurements provide a value of how spread
or dispersed the individual scores are in a distribution. In this section the variance and
standard deviation are introduced.
Measurement and Statistical Concepts  35

Variance
Variance is the degree to which measurements or scores differ from the mean of the
population or sample. The population variance is the average of the squared deviations of
each score from the population mean “mu” (m). The symbol for the population variance
is sigma squared (s2). Recall that Greek letters are used to denote population parameters.
Equation 2.4 provides an example of the population variance.
Equation 2.5 illustrates the sample variance. As you notice, there are only two
changes from Equation 2.4 for the population variance. The first change is that each
score is subtracted from the sample mean (rather than the population mean). The second
difference is that the sum of squares is divided by n – 1. Using n – 1 makes the sample
variance an unbiased estimate of the population variance—provided the sample is drawn
or acquired in a random manner (see the Appendix for more detail).
Finally, the population standard deviation (s) is the square root of the population
variance; the sample standard deviation (s) is the square root of the sample variance.
To illustrate calculation of the population and sample variance, 10 scores are used
from the data in Table 2.4 and are provided in Table 2.6.

Equation 2.4. Population variance


(X )
2
2 SS( X )
N N
• ­Σ(X – m)2 = s um of the squared deviations of each score
from the population mean.
• N = total number of measurements or scores in
population.
• SS(X) = sum of the squared deviations of each score
from the mean (a.k.a. sum of squares).

Equation 2.5. Sample variance

(X X)
2
2 (X )
S
N 1 N 1
• ( X − X )2 = s um of the squared deviations of each score
from the population mean.
• n – 1 = total number of measurements or scores in
­sample minus 1.
• SS(X) = sum of the squared deviations of each score
from the mean (a.k.a. sum of squares).
36  PSYCHOMETRIC METHODS

Table 2.6.  Computation of the Population Variance


Score (X) m X−m X − m2
19 35 −16 256
23 35 −12 144
30 35 −5 25
33 35 −2 4
36 35 1 1
38 35 3 9
40 35 5 25
42 35 7 49
44 35 9 81
45 35 10 100
n = 10 S(X – m) = 0 S(X – m)2 = SS(X) = 694

For the data in Table 2.6, application of Equation 2.4 yields a population variance of

S( X )
2
SS( X ) 694
s2 69.4
N N 10

For the same data, application of Equation 2.5 yields a sample variance of:

(X X)
2
S S (X) 694
S2 77.11
N 1 N 1 9

In summary, the variance is a measure of the width of a distribution equal to the


mean of the squared deviations (i.e., the mean square deviation). Although the standard
deviation is useful for understanding and reporting the dispersion of scores in many
cases, the variance is more useful for intermediate through advanced statistical tech-
niques (e.g., analysis of variance or regression). Later in this chapter we will examine
how the variance (and the sum of squares) is used in regression analysis.

Percentiles
In psychological measurement, individual measurements or scores are central to our
study of individual differences. Percentiles are used to provide an index of the relative
standing for a person with a particular score relative to the other scores (persons) in a
distribution. Percentiles reflect relative standing and are therefore classified as ordinal
values. Percentiles do not reflect how far apart scores are from one another. Scores that reside
at or near the top of the distribution are highly ranked or positioned, whereas scores
that reside at or near the bottom of the distribution exhibit a low ranking. For example,
consider the scores in Table 2.4 (on p. 25). The percentiles relative to the raw scores in
column 1 are located in the last column of the table. We see in the last column (labeled
cumulative relative frequency) that a person with a score of 36 is located at the 50th per-
centile of the distribution. A person with a score of 43 is located at the 95th percentile
Measurement and Statistical Concepts  37

(0.50). The percentile rank is another term used to express the percentage of people scor-
ing below a particular score. For example, based on the same table, for a person scoring
43, his or her percentile rank is 95. The person’s standing is interpreted by stating that the
person scored higher or better than 95% of the examinees.

z-Scores
Raw scores in a distribution provide little information specific to (1) other persons’
scores in the distribution and (2) because the meaning of zero changes from distribution
to distribution. For example, in Table 2.3 (on p. 24) a score of 36 tells little about where
this person stands relative to others in the distribution. However, if we know that a
score of 47 is ~2s standard deviations above the mean, we know that this score is rela-
tively high in this particular distribution of scores. Turning to another example, con-
sider that a person takes three different tests (on the same subject) and that on each test
the points awarded for each item differ (see Table 2.7). The total number of points on
each test is 100, but the number of points awarded for a correct response to each item differs
by test (e.g., see row 4 of Table 2.7). It appears from the data in Table 2.7 that the person
is performing progressively worse on each test moving from test 1 to test 3. However,
this assumes that a score of zero has the same meaning for each test (and this may not
be the case). For example, assume that the lowest score on test 1 is 40 and that the
lowest score on test 2 is zero. Under these circumstances, zero on test 1 is interpreted
as 40 points below the lowest score. Alternatively, on test 2, zero is the lowest score.
The previous example illustrates why raw scores are not directly comparable. However,
if we rescale or standardize the value of zero so that it means the same thing in every dis-
tribution, we can directly compare scores from different tests with different distributions.
Transforming raw scores to z-scores accomplishes this task. Returning to Table 2.7, if we
create difference scores by subtracting the person’s score from the mean of the distribution,
we see that if the raw score equals the mean, then the difference score equals zero; this is true
regardless of the distributional characteristics of the three tests. So, the difference score always
has the same meaning regardless of the characteristics of the distribution of scores. Notice
that based on difference scores, the person’s performance improved from test 1 to test 2,
even though the raw score is lower! Applying the z-score transformation in Equation 2.6a
to each of the score distributions of tests 1–3 yields the values in the last row of Table 2.7.

Table 2.7.  Descriptive Statistics for Three Verbal Intelligence Tests


Summary statistics Test 1 Test 2 Test 3
Person’s score 81 70 60
Mean (m) 85 60 55
Difference score −4 10 5
Points per question 2 5 1
Questions above mean −2 2 5
Standard deviation (s) 5 20 2
z-score (standard deviation above mean) −0.8 0.5 2.5
38  PSYCHOMETRIC METHODS

Equation 2.6a. z-score transformation


RAW SCORE − µ
Z=
σ
m = mean of the population of scores.
s = standard deviation of population of scores.

Note. Sample statistics can be substituted for population parameters,


depending on how the score distributions are sampled or acquired.

Alternatively, if one wants to obtain a raw score from a known z-score, Equation 2.6b
serves this purpose.
Finally, based on inspection of the raw and difference scores in Table 2.7, it may
appear that the person performed worse on test 3 than on test 2. However, this is not true
because each question is worth 1 point on test 3 but 5 points on test 2. So, relative to
the mean of each distribution, the person performed better on test 3 than on test 2. This
pattern or trend is captured in the z-scores in Table 2.7. For example, inspection of the
z-scores created using Equation 2.6a in the table illustrates that the person’s performance
improved or increased relative to others in the distribution of scores for each of the three
tests. To summarize, difference scores standardize the meaning of zero across distributions,
and z-scores standardize the unit of measurement.
To make the relationship between raw and z-scores clearer, Table 2.8 provides a set
of 20 scores from the language development/vocabulary test in the GfGc dataset. We treat
these scores as a population and apply the sigma notation to these data to illustrate deri-
vation of the mean, standard deviation, sum of squares, and the parallel locations in the
score scale between z- and raw scores. Figure 2.11 illustrates the position equivalence for
a raw score of 45 and a z-score of 1.14.

ΣX = 690 Σz = 0
ΣX = 25430
2
Σz = 19
2


µ =
∑ X = 690 = 34.5 µZ = ∑ =
Z 0
=0
X
N 20 N 20

SS( X ) = ∑ X 2 − Nµ2X SS(Z ) = ∑ Z 2 − Nµ2Z

SS(X) = 25430 – 20(34.5)2 = 1625 SS(z) = 19 – 20(0)2 = 19

SS( X ) 1625 SS(Z ) 19


σX = = = 9 σZ = = =1
N 20 N 20

Note. Based on Glenberg and Andrzejewski (2008).


Measurement and Statistical Concepts  39

Equation 2.6b. Raw score to z-score formula


X = m + zs

Table 2.8.  Distributions of Raw


and z-Scores for 20 People
X z-score
18 –1.78
20 –1.57
22 –1.35
24 –1.14
24 –1.14
30 –0.49
30 –0.49
33 –0.16
33 –0.16
37 0.27
37 0.27
37 0.27
39 0.49
39 0.49
40 0.59
42 0.81
43 0.92
45 1.14
48 1.46
49 1.57
N = 20 N = 20

Normal Distributions
There are many varieties of shapes of score distributions (e.g., see Figure 2.10 for a
review). One commonly encountered type of distribution in psychological measurement
is the normal distribution. Normal distributions share three characteristics. First, they
are symmetric (the area to the left and right of the center of the distribution is the same).
Second, they are often bell-shaped (or a close approximation to a bell-type shape). When
the variance of the distribution is large, the height of the bell portion of the curve is lower
(i.e., the curve is much flatter) than an ideal bell-shaped curve. Similarly, when the vari-
ance is small, the height of the bell portion of the curve is more peaked and the width of
the curve is narrower than an ideal bell-shaped curve. Third, the tails of the distribution
extend to positive and negative infinity. Figure 2.12 illustrates three different varieties
of normal distributions. In the figure, the tails touch the X-axis, signifying that we have
discernible lower and upper limits to the distributions rather than the purely theoretical
depiction of the normal distribution where the tails never actually touch the X-axis.
40
Z = X – μ/σ

3 X 3 Z

2 X X X X X 2 Z Z Z Z Z

Frequency

Frequency
1 X X X X X X X X X X X X X X 1 Z Z Z Z Z Z Z Z Z Z Z Z Z Z

18 20 22 24 30 33 37 39 40 42 43 45 48 49 –1.8 –1.6–1.4–1.1–0.5–0.2 0.3 0.5 0.6 0.8 0.9 1.1 1.5 1.6
Score (X) z-scores
(Mean = 34.5/σ = 9) (Mean = 0/σ = 1)

FIGURE 2.11.  Frequency polygons for raw and z-scores for Table 2.8 data. For example, a raw score of 45 equals a z-score of 1.14. Adapted from
Glenberg and Andrzejewski (2008). Copyright 2008 by Lawrence Erlbaum Associates. Adapted by permission.
Measurement and Statistical Concepts  41

Y (ordinate)

Bell-shaped curve
Height

Distribution with small variance

Distribution with large variance

Scores
X (abscissa)
Mean
Median
Mode

FIGURE 2.12.  Normal distributions with same mean but different variances.

All normal distributions can be described by Equation 2.7.

Equation 2.7. Normal distribution


 1  −( X −µ)2/2σ2
U= E
 2πσ 
2

• p = 3.1416.
• u = height of the normal curve.
• s2 = variance of the distribution.
• e = 2.7183.
• X = score value.
• m = mean of score distribution.

Any value can be inserted for the mean (m) and variance (s 2), so an infinite number of
normal curves can be derived using Equation 2.7.
There are at least two reasons why normal distributions are important to psychological
measurement and the practice of psychometrics. First, the sampling distribution of many
statistics (e.g., such as the mean) are normally distributed. For example, if the mean test score
is calculated based on a random sample from a population of persons, the sampling distri-
bution of the mean is normally distributed as well. Statistical inference is based on this fact.
Second, many variables in psychological measurement follow the normal distribution (e.g.,
intelligence, ability, achievement) or are an approximation to it. The term approximate means
that although, from a theoretical perspective, the tails of the normal distribution extend to
infinity on the left and right, when using empirical (actual) score data, there are in fact limits
42  PSYCHOMETRIC METHODS

in the upper and lower end of the score continuum. Finally, we can use z-scores in combina-
tion with normal distributions to answer a variety of questions about the distribution.

Standard Normal Distribution


Normal distributions can be transformed into a standard normal distribution (also called
the unit normal distribution) using the z-score transformation. The z-score distribution
is so called because it has a mean of zero and a standard deviation of 1.0. Likewise, the
standard normal distribution has a mean of zero and a standard deviation of 1.0. How-
ever, if the original distribution of raw scores is not normally distributed, the z-score trans-
formation will not automatically normalize the score distribution (i.e., the z-scores will
not be normally distributed). Therefore, it is important to take into account the shape of
the original score distribution prior to conducting any analysis with transformed scores.
Chapter 11 on test norming and equating addresses this issue in more detail.

2.10 Correlation, Covariance, and Regression

Fundamental to psychological measurement is the study of how people differ or vary


with respect to their behavior or psychological attributes. Some of the most interesting
questions in psychological measurement revolve around the causes and consequences of
differences among people on psychological attributes such as intelligence, aptitude, and
psychopathology. The study of relations between two attributes or variables (e.g., fluid and
crystallized intelligence), requires procedures for (1) measuring and defining the variables
and (2) statistical techniques for describing the nature of relationships between them. For
this reason, psychological measurement is concerned primarily with individual differences
between persons and with how such differences (expressed as variability) may contribute
to understanding the cause and consequence of behavior(s). Behaviors are directly observ-
able, and we can study the associations among two or more behaviors (measured as vari-
ables) for a group of people using the correlation or covariance. Psychological attributes
are not directly observable, but assuming that individual differences exist among people
on an unobservable attribute, we can use the correlation or covariance to study associa-
tions among the attributes of interest. The GfGc data used in the examples throughout
this text include measures of verbal ability, quantitative ability, and memory—all of which
are not directly observable, but for which differences among persons are assumed to exist.
Examining differences between persons is defined as the study of interindividual differ-
ences. Alternatively, change occurring for a single person over time is known as the study
of intraindividual change. In either context, studying interindividual differences and intra-
individual change relies on correlation or covariance-based techniques.

Correlation
A multitude of questions in psychology and other fields can be investigated using corre-
lation. At the most basic level, correlation is used to study the relationship between two
Measurement and Statistical Concepts  43

variables. For example, consider the following questions. Is there a relationship between
verbal ability and quantitative reasoning? Is there a relationship between dementia and
short-term memory? Is there a relationship between quantitative reasoning and math-
ematical achievement? The correlation coefficient provides an easily interpretable measure
of linear association to answer these questions. For example, correlation coefficients have
a specific range of −1 to +1. Using the correlation, we can estimate the strength and direction
of the relationship for the example questions above, which in turn helps us to understand
individual differences relative to these questions. In this section, we limit our discussion
to the linear relationship between two variables. The Appendix provides alternative measures
of correlation appropriate (1) for ranked or ordinal data, (2) when the relationship between
variables is nonlinear (i.e., curvilinear), and (3) for variables measured on the nominal level.
The Pearson correlation coefficient is appropriate for interval- or ratio-level vari-
ables, assumes a linear relationship, and is symbolized using r (for a statistic) and r
(rho) for a population parameter. The correlation coefficient (r or r) has the following
properties:

1. The range of the statistic r or parameter r is −1.0 to +1.0.


2. The sign (+ or −) corresponds to the sign of the slope of the regression line (e.g.,
lines with positive slopes indicate a positive correlation, lines with negative slopes
indicate a negative correlation, and lines with zero slopes indicate no correlation).
3. The strength of the relationship is provided by the absolute value of the correlation
coefficient. For example, an r or r of +1.0 is a perfect (positive) linear relationship;
an r or r of −1.0 or +1.0 provides evidence of a perfect linear relationship (negative
or positive); intermediate (or moderate relationship) values of r or r (e.g., −.50 or
+.50); no linear relationship is evidenced by values of r or r equaling zero.

Equations 2.8a and 2.8b illustrate the Pearson correlation coefficient using raw scores
from Table 2.9.
Figure 2.13 illustrates the X-Y relationship for the score data in Table 2.9. The
graph in Figure 2.13 is known as a scattergraph or scatterplot and is essential to
understanding the nature of the X-Y relationship (e.g., linear or nonlinear). Based on
Figure 2.14, we see that the X-Y relationship is in fact linear (i.e., follows a diagonal
line or slope, whereas each score value of X increases, and there is a corresponding
positive change in Y-scores). Figure 2.14 illustrates the linear relationship between
X and Y.

Equation 2.8a. Correlation coefficient: Raw score formula


N P ∑ XY − ∑ X ∑ Y
R=
N X 2 − ( ∑ X )  N P ∑ Y 2 − ( ∑ Y ) 
 ∑
2 2

  
P
44  PSYCHOMETRIC METHODS

Equation 2.8b. Correlation coefficient based on Table 2.9

N P ∑ XY − ∑ X ∑ Y
R=
P ∑ X − (∑ X ) P ∑ Y − (∑ Y)
N 2 2
 N 2 2

   

10 ( 29130 ) − ( 360 )( 762 )


=
10 (13668 ) − (129600 )  10 ( 62282 ) − ( 580644 ) 

291300 − 274320
=
[136680 − 129600][622820 − 580644 ]
16980
=
[7080][ 42176]
16980 16980
= = = .983
298606080 17280.22

Note. For correlation computations, rounding to a minimum of


three decimal places is recommended.

Table 2.9.  Scores for 10 Persons on Fluid and Crystallized Tests


Person Fluid Crystallized
(np) intelligence (X) X2 intelligence (Y) Y2 XY
1 20 400 42 1764 840
2 25 625 50 2500 1250
3 30 900 60 3600 1800
4 33 1089 68 4624 2244
5 37 1369 76 5776 2812
6 39 1521 82 6724 3198
7 41 1681 85 7225 3485
8 43 1849 89 7921 3827
9 45 2025 98 9604 4410
10 47 2209 112 12544 5264
nX = 10 nY = 10
SX = 360 SY = 762
S(X) = 129600
2
SX = 13668
2
S(Y)2 = 580644 SY 2 = 62282 SXY = 29130
m = 36 m = 76.2
s = 8.41 s = 20.54
Measurement and Statistical Concepts  45

FIGURE 2.13.  Scatterplot of fluid and crystallized intelligence total scores. Correlation (r) is .983.

FIGURE 2.14.  Scatterplot of fluid and cystallized intelligence with regression line. Correlation
(r) is .983; r 2 = .966.

Covariance
The covariance is defined as the average cross product of two sets of deviation scores. The
covariance retains the original units of measurement for two variables and is expressed in
deviation score form or metric (a deviation score being a raw score minus the mean of the
distribution of scores). Because of its raw score metric, the covariance is an unstandardized
version of the correlation. The covariance is useful in situations when we want to conduct
an analysis and interpret the results in the original units of measurement. For example,
46  PSYCHOMETRIC METHODS

we may want to evaluate the relationships among multiple variables (e.g., three or more
variables), and using a standardized metric like the correlation would provide mislead-
ing results because the variables are not on the same metric or level of measurement. In
this case, using the covariance matrix consisting of more than two variables makes more
sense. In fact, the multivariate technique structural equation modeling (SEM; used in
a variety of psychometric analyses) typically employs the covariance matrix rather than
the correlation matrix in the analysis. Thus, SEM is also called covariance structure
modeling. The equation for the covariance using raw scores is provided in Equations
2.9a and 2.9b for the population and sample. The Appendix provides examples of how to
derive the covariance matrix for more than two variables.
An important link between the correlation coefficient r and the covariance is illus-
trated in Equations 2.10a and 2.10b.

Equation 2.9a. Covariance: Population parameter


( − )( − )
σXY =
N
1698
= = 169.8
10
• X = deviation score on a single measure.
• Y = deviation score on a single measure.
• xy = raw score on any two measures.
• X = mean on measure X.
• Y = mean on measure Y.
• sxy = covariance for measures X and Y.

Equation 2.9b. Covariance: Sample statistic

( − )( − )
SXY =
N −1
1698
= = 188.66
9
• X = deviation score on a single measure.
• Y = deviation score on a single measure.
• xy = raw score on any two measures.
• X = mean on measure X.
• Y = mean on measure Y.
• sxy = covariance for measures X and Y.
Measurement and Statistical Concepts  47

Equation 2.10a. Relationship between correlation and covari-


ance: Population
σ XY
ρXY =
σ X σY
• sx = square root of the variance for score x.
• sy = square root of the variance for score y.
• sxy = covariance.

Equation 2.10b. Relationship between correlation and covari-


ance: Sample
SXY
RXY =
SX S Y

• sx = square root of the variance for score x.


• sy = square root of the variance for score y.
• sxy = covariance.

Regression
Recall that use of the correlation concerned the degree or magnitude of relation between
variables. Sometimes the goal is to estimate or predict one variable from knowledge of
another (notice that this remains a relationship-based question, as was correlation). For
example, research may suggest that fluid intelligence directly affects crystallized intelli-
gence to some degree. Based on this knowledge, and using scores on the fluid intelligence
test in the GfGc data, we find that our goal may be to predict the crystallized intelligence
score from the fluid intelligence score. To address problems of predicting one variable
from knowledge of another we use simple linear regression.
The rules of linear regression are such that we can derive the line that best fits our
data (i.e., best in a mathematical sense). For example, if we want to predict Y (crystal-
lized intelligence) from X (fluid intelligence), the method of least squares locates the
line in a position such that the sum of squares of distances from the points to the line
taken parallel to the Y-axis is at a minimum. Application of the least-squares crite-
rion yields a straight line through the scatter diagram in Figure 2.13 (illustrated in
Figure 2.14).
Using the foundations of plane geometry, we can define any straight line by specify-
ing two constants, called the slope of the line and its intercept. The line we are interested
in is the one that we will use to predict values of Y (crystallized intelligence) given values
48  PSYCHOMETRIC METHODS

of X (fluid intelligence). The general equation for a straight line is stated as: The height of
the line at any point X is equal to the slope of the line times X plus the intercept. The equation
for deriving the line of best fit is provided in Equation 2.11.
To apply the regression equation to actual data, we need values for the constants a
and b. Computing the constant for b is provided in Equations 2.12 and 2.13 using data
from Table 2.9.
The equation for calculating the intercept is illustrated in Equation 2.14 using data
from Table 2.9.

Equation 2.11. Regression line for predicting Y from X

Yˆ = BX + A
Ŷ = predicted values of Y.
b = slope of the regression line.
a = intercept of the line.

Equation 2.12. Slope calculation using correlation coefficient

SY 20.54
B= R = .983 = 2.40
SX 8.41
• r = correlation between X and Y.
• sy = standard deviation of Y.
• sx = standard deviation of X.

Equation 2.13. Slope calculation using raw scores

ΣXY − ΣXNΣY
B=
ΣX 2 − (ΣX)
2

N
• ΣXY = sum of the product of X times Y.
• ΣXΣY = sum of X-scores times sum of Y-scores.
• ΣX2 = sum of X-scores after they are squared.
• (ΣX)2 = sum of X-scores then squaring the sum.
• n = sample size.
Measurement and Statistical Concepts  49

Equation 2.14. Intercept

A = Y − BX

= 76.2 − 2.40(36)
= 76.2 − 86.43

= − 10.23
• Ŷ = mean of Y-scores.
• X = mean of X-scores.
• b = slope of the regression line.

Returning to Equation 2.11, now that we know the constants a and b, the equation for
predicting crystallized intelligence (Ŷ) from fluid intelligence (X) is given in Equation 2.15a.
To verify that the equation is correct for a straight line, we can choose two values of
X and compute their respective Ŷ from the preceding regression equation, as follows in
Equations 2.15b and 2.15c.
Figure 2.15 illustrates the regression line for the data in Table 2.9. The “stars” rep-
resent predicted scores on crystallized intelligence for a person who scores (1) 30 on the
fluid intelligence test and (2) 43 on the fluid intelligence test.

Equation 2.15a. Prediction equation using constants derived


from Table 2.11

Yˆ = BX + A
= 2.40( X ) − 10.23

Equation 2.15b. Prediction equation using a fluid intelligence


score of 30

Yˆ = BX + A

= 2.40(30) − 10.23

= 61.77
50  PSYCHOMETRIC METHODS

Equation 2.15c. Prediction equation using a fluid intelligence


score of 43

Yˆ = BX + A

= 2.40(43) − 10.23

= 92.97

Note. Equations 2.15b and 2.15c can include subscript i (e.g.,


YI OR YˆI OR XI) to denote that the prediction equation applies to
persons with specific scores on the predictor ­variable X.

FIGURE 2.15.  Line of best fit (regression line) for Table 2.9 data. When fluid intelligence = 30,
crystallized intelligence is predicted to be 61.77. When fluid intelligence = 43, crystallized intel-
ligence is predicted to be 92.97. Highlighting these two points verifies that these two points do in
fact determine a line (i.e., the drawn regression line).

Error of Prediction
We call the difference between the actual value of Y and its predicted value Ŷ the error
of prediction (sometimes also called the residual). The symbol used for the error of
prediction is e. Thus, the error of prediction for the ith person is ei and is obtained by
Equation 2.16.
The errors of prediction are illustrated in Figure 2.16. The errors of prediction are
shown as arrows from the regression line to the data point.
Measurement and Statistical Concepts  51

Equation 2.16. Error of prediction

E I = YI − ŶI.
• Yi = observed score for person i on variable Y.
• ŶI = predicted score for person i on variable Y.

120.0

Y observed = 112 (person 10)


e10 = Y10 – Ypredicted = 112 – 102.57 = 9.43 points Ypredicted = 102.57
100.0
Crystallized intelligence

80.0

60.0

40.0

20.0 25.0 30.0 35.0 40.0 45.0 50.0

Fluid Intelligence

FIGURE 2.16.  Errors of prediction.

For example, we see that a person with an observed score of 112 on crystallized intel-
ligence will have a predicted score of 102.57 based on our prediction equation previously
developed with a slope of 2.40 and intercept of −10.23. Note that negative errors are indi-
cated by points below the regression line and positive errors are indicated by points above
the regression line. So, errors of prediction are defined as the vertical distance between
the person’s data point and the regression line.

Determining the Best Line of Fit and the Least-Squares Criterion


In general, a regression line that minimizes the sum of the errors of prediction will be the
best regression line. However, there are many plausible regression lines where the sum of
52  PSYCHOMETRIC METHODS

the errors of prediction is zero. For example, any line that passes through the mean of Y
and mean of X will have errors of prediction summing to zero. To overcome this dilemma,
we apply the least-squares criterion to determine the best regression line. The best
regression line according to the least-squares criterion is a line (1) exhibiting a sum of the
errors of prediction being zero and (2) where the sum of the squared errors of prediction
is smaller than the sum of squared errors of prediction for any other possible line.

Standard Error of the Estimate


The size of the errors of prediction provides a measure of the adequacy of the estimates
made using regression lines. The standard error of the estimate or the standard devia-
tion of the errors of prediction is related to the size of the errors of prediction themselves
and therefore serves as a measure of goodness of regression lines. Equation 2.17 provides
the standard error of the estimate for our data in Table 2.9.
The standard error of the estimate can also be computed using the correlation, the
standard deviation of Y, and the sample size as in Equation 2.18.
Finally, since the standard error of the estimate is the standard deviation of the errors
of prediction, approximately 95% of all points in a scatterplot will lie within two standard
errors of the regression line (i.e., 47.5% above and 47.5% below). Figure 2.17 illustrates
the 95% confidence region for the data in Table 2.9.

Coefficient of Determination
When using regression, we want an idea of how accurate our prediction is likely to be.
The square of the correlation coefficient (r2), known as the coefficient of determination,
measures the extent to which one variable determines the magnitude of another. Recall
that when the correlation coefficient is close to ±1, our prediction will be accurate or
good. Also, in this case, r2 will be close to ±1. Furthermore, when r2 is close to ±1,
1 − R 2 is close to zero. The relationship between R, R 2, 1 − R 2, AND 1 − R 2 is provided in
Table 2.10.

Equation 2.17. Standard error of the estimate

S ( EI - E ) S ( EI )
2 2
145.278
SE = = = = 18.159 = 4.26
N -2 N -2 8

• ei = error of prediction for person i.


• E = mean of the errors of prediction.
• n = sample size.
• S = summation operator.
Measurement and Statistical Concepts  53

Equation 2.18. Standard error of the estimate from the correlation


coefficient
N −1
SE = S Y 1 − R 2
N−2

9
= 20.54 1 − .966
8
= 20.54 (.184 )(1.06 )

= 20.54 (.195 ) = 4.02


Note. The discrepancy between the se in Equations 2.17 and 2.18 is
due to using the r-square that is not adjusted for sample size. If you
use the adjusted r-square of .961 in Equation 2.18, the resulting se
is 4.29.

FIGURE 2.17.  95% confidence region for standard error of estimate.

Table 2.10.  Relationship between


r, r 2, 1−r 2, and 1− r 2

r r2 1 – r2 1- 2

.00 .00 1.0000 1.0000


.25 .0625 .9375 .9682
.50 .2500 .7500 .8660
.75 .5625 .4375 .6614
.90 .8100 .1900 .4359
1.00 1.0000 .0000 .0000
54  PSYCHOMETRIC METHODS

Regression and Partitioning Sums of Squares


This final section of the chapter illustrates the connection between the sum of squares
introduced earlier in the chapter and how they can be used to derive r2 within the regres-
sion framework. Also, understanding how the sum of squares is partitioned specific to
an outcome variable (e.g., crystallized intelligence) aids in interpreting how the analysis
of variance works, an analytic technique used to answer research questions about dif-
ferences between groups. To facilitate the presentation, consider the data in Table 2.11
which includes the same data as in Table 2.9. Equation 2.19 illustrates how to derive r2
using the sum of squares.

Table 2.11.  Partition of Sum of Squares for Regression


Fluid Crystallized
Person intelligence intelligence
Yˆ − Y Yˆ − Y
2
(np) (X) (Y) Ŷ YI − Y YI − Y 2 e e2
1 20 42 37.827 −34.200 1169.640 4.173 17.413 −38.373 1472.478
2 25 50 49.819 −26.200 686.440 0.181 0.033 −26.381 695.976
3 30 60 61.810 −16.200 262.440 −1.810 3.277 −14.390 207.067
4 33 68 69.005 −8.200 67.240 −1.005 1.010 −7.195 51.767
5 37 76 78.598 −0.200 0.040 −2.598 6.751 2.398 5.752
6 39 82 83.395 5.800 33.640 −1.395 1.946 7.195 51.767
7 41 85 88.192 8.800 77.440 −3.192 10.186 11.992 143.797
8 43 89 92.988 12.800 163.840 −3.988 15.905 16.788 281.842
9 45 98 97.785 21.800 475.240 0.215 0.046 21.585 465.901
10 47 112 102.581 35.800 1281.640 9.419 88.711 26.381 695.976
SStotal 4217.600 SSerror 145.278 SS­regression 4072.323

Equation 2.19. r-square calculation based on sum of squares

SSREGRESSION SSTOTAL − SSERROR


R2 = =
SSTOTAL SSTOTAL
For the data in Table 2.11,
SSREGRESSION 4072.323
R2 = = = .966
SSTOTAL 4217.600
Note. Sum-of-squares regression is also sometimes called sum-of-
squares “explained” because the regression component of the equa-
tion is the part of Y that X “explains.” Sum-of-squares error is the part
of the equation that X is unable to explain in Y.
Measurement and Statistical Concepts  55

2.11 Summary

This chapter presented measurement and statistical concepts essential to understanding


the theory and practice of psychometrics. The presentation focused on concepts transi-
tioning to application. We began our study of psychometrics by introducing the proper-
ties of numbers, defining measurement and how the properties of numbers work together
with four levels of measurement. The four levels of measurement provide a clear guide
regarding how we measure psychological attributes. The number zero was described rel-
ative to its meaning and interpretation in psychological measurement. Techniques for
organizing, summarizing, and graphing distributions of variables were provided with
suggestions about when to use tables or tables and graphs. Normal distributions were
described and illustrated, and the standard normal distribution was introduced, along
with a discussion of the role it plays in psychometrics and statistics in general. The z-score
was introduced, together with its application in relation to the standard normal distri-
bution. Finally, correlation and regression were introduced with connections provided
relative to the fundamental role each plays in the study of variability and individual dif-
ferences. Applications of correlation and regression were provided using the GfGc data. A
sound understanding of the material in this chapter provides the requisite foundation for
understanding the material in subsequent chapters. For example, in Chapter 3 validity is
introduced, and correlation and regression-based techniques for estimating the validity
coefficient are presented. Readers interested in a more advanced treatment of the material
in this chapter are encouraged to see the Appendix.

Key Terms and Definitions


Absolute zero. Temperature at which a thermodynamic system has the lowest energy.

Analysis of variance. A statistical method to test differences between two or more means.
Also used to test variables between groups.
Bimodal distribution. A distribution exhibiting two most frequently occurring scores.

Coefficient of determination. The proportion of variation in Y that is associated with dif-


ferences in X (predicted from the regression equation). Also known as r2.
Constant. A characteristic that may take on only one value.

Continuous variable. A variable consisting of connected elements (e.g., temperature,


or score values that are expressed and reported as numbers with an infinitesimal
amount of intermediate values).
Covariance. Average cross product of two sets of deviation scores. The unstandardized
correlation coefficient.
Covariance structure modeling. Another term for structural equation modeling.

Cumulative relative frequency distribution. A graph illustrating how many cases or


persons lie below the upper limit of each class interval.
56  PSYCHOMETRIC METHODS

Descriptive statistical techniques. Methods for organizing and summarizing measure-


ments or observations.
Discrete variable. A variable consisting of distinct or unconnected classes or elements (e.g.,
biological sex, or score values that are only expressed and reported as whole numbers).
Error of prediction (or residual). The discrepancy between the actual value of Y and the
predicted value of Ŷ .
Frequency polygon. A graph that consists of a series of connected dots above the
midpoint of each possible class interval. The heights of the dots correspond to the
frequency or relative frequency.
Grouped frequency distribution. A graph showing the number of observations for the
possible categories or score values in a dataset.
Histogram. A graph that consists of a series of rectangles, the heights of which represent
frequency or relative frequency.
Identity. An equation which states that two expressions are equal for all values of any
variables that occur.
Inferential statistical techniques. Techniques whose purpose is to draw a conclusion
about conditions that exist in a population from studying a sample.
Intercept. The point on the Y-axis at which a straight line crosses it.

Interval. Ordered measurements made on a constant scale such that it is possible to


assess the size of the differences between them. No absolute or natural zero point
exists.
Interval scale. A scale exhibiting all of the properties of an ordinal scale, and a given
distance between the measures has the same meaning anywhere on the scale.
Least-squares criterion. A mathematical procedure that yields a line that minimizes the
sum of the squares of the discrepancies between observed and predicted values of Y.
Mean. The sum of all scores divided by the total number of scores.

Measurement. The process of assigning numerals to observations.

Median. The value that divides the distribution into halves.

Mode. The score that appears with greatest frequency in a distribution.

Negatively skewed distribution. A distribution in which the tail slants to the left.

Nominal. A form of categorical data where the order of the categories is not significant.

Nominal scale. A measurement scale that consists of mutually exclusive and exhaustive
categories differing in some qualitative aspect.
Normal distribution. A mathematical abstraction based on an equation with certain
properties. The equation describes a family of normal curves that vary according to
the mean and variance of a set of scores.
Ordinal. Categorical data for which there is a logical ordering to the categories based
on relative importance or order of magnitude.
Measurement and Statistical Concepts  57

Ordinal scale. A scale exhibiting the properties of a nominal scale, but in addition the
observations or measurements may be ranked in order of magnitude (with nothing
implied about the difference between adjacent steps on the scale).
Parameter. A descriptive index of a population.

Pearson correlation coefficient. Measures the linear relationship between two variables
on an interval or ratio level of measurement.
Percentile. A point on the measurement scale below which a specified percentage of the
cases in a distribution falls.
Population. The complete set of observations about which a researcher wishes to draw
conclusions.
Positively skewed distribution. A distribution in which the tail slants to the right.

Random sample. Sample obtained in a way that ensures that all samples of the same
size have an equal chance of being selected from the population.
Ratio. Data consisting of ordered, constant measurements with a natural origin or zero
point.
Ratio scale. A scale having all the properties of an interval scale plus an absolute zero
point.
Real numbers. The size of the unit of measurement is specified, thereby allowing any
quantity to be represented along a number line.
Sample. A subset of a population.

Sampling distribution. A theoretical relative frequency distribution of scores that would


be obtained by chance from an infinite number of samples of a particular size drawn
from a given population.
Sampling distribution of the mean. A theoretical relative frequency distribution of all
values of the mean ( X ) that would be obtained by chance from an infinite number of
samples of a particular size drawn from a population.
Scaling. The development of systematic rules and meaningful units of measurement for
quantifying empirical observations.
Scatterplot. A plot of the values of Y versus the corresponding values of X.

Sigma notation. A form of notation used to sum scores or other measurements.

Simple linear regression. The regression of Y on X (only one predictor and one outcome).

Slope of the line. Specifies the amount of increase in Y that accompanies one unit of
increase in X.
Standard error of the estimate. A measure of variability for the actual Y values around
the predicted value Ŷ .
Standard normal distribution. A symmetric, unimodal distribution that has a mean of 0
and a standard deviation of 1 (e.g., in a z-score metric).
Statistic. A descriptive index of a sample.
58  PSYCHOMETRIC METHODS

Structural equation modeling. A family of related procedures involving the analysis of


covariance structures among variables.
Symmetric distribution. Distribution curve where the left and right sides are mirror
images of each other.
Unimodal distribution. A distribution with a single mode.

Variability. The spread of scores or observations in a distribution.

Variable. A characteristic that may take on different values.

Variance. The mean of the squares of the deviation scores.


z-score. How far a score is away from the mean in standard deviation units. It is one type
of standard score.
z-score distribution. A statistical distribution with areas of the unit normal distribution
expressed in z-score units.
3

Criterion, Content, and Construct Validity

This chapter introduces validity, including the statistical aspects and the validation process.
Criterion, content, and construct validity are presented and contextualized within the com-
prehensive framework of validity. Four guidelines for establishing evidence for the validity
of test scores are (1) evidence based on test response processes, (2) evidence based on
the internal structure of the test, (3) evidence based on relations with other variables, and
(4) evidence based on the consequences of testing. Techniques of estimation and interpre-
tation for score-based criterion validity are introduced.

3.1 Introduction

Validity is a term commonly encountered in most, if not all, disciplines. In psychologi-


cal measurement, one definition of validity is a judgment or estimate of how well a test
or instrument measures what it is supposed to measure. For example, researchers are
concerned with the accuracy of answers regarding their research questions. Answering
research questions in psychological or behavioral research involves using scores obtained
from tests or other measurement instruments. To this end, the accuracy of the scores
is crucial to the relevance of any inferences made. Over the past 50 years, the American
Educational Research Association (AERA), the American Psychological Association (APA),
and the National Council on Measurement in Education (NCME) have facilitated work
by a committee of scholars on advancing the interpretation and application of the mul-
tifaceted topic of validity. The most recent result of their work is the Standards for Edu-
cational and Psychological Testing (AERA, APA, & NCME, 1999, 2014). The AERA, APA,
and NCME standards describe validity as “the degree to which accumulated evidence
and theory support specific interpretations of test scores entailed by proposed uses of
a test” (1999, p. 184). The term evidence presents a related but slightly different view

59
60  PSYCHOMETRIC METHODS

of validity—a view that espouses validation as a process in test or instrument develop-


ment. Note that a test score is meaningless until one draws inferences from it based on
the underlying proposed use of the test or instrument. In relation to test development,
the validation process involves developing an interpretative argument based on a clear
statement of the inferences and assumptions specific to the intended use of test scores.
The AERA, APA, and NCME standards present a clear set of four guidelines for
establishing evidence. The four guidelines articulate that establishing evidence for the
validity of test scores includes (1) test response processes, (2) the internal structure of
the test, (3) relations with other variables, and (4) the consequences of testing. Based
on these four guidelines, we see that validation involves considering the interpretation,
meaning, and decision-based outcomes of test score use, which in turn involves societal
values and consequences. Given the breadth and complexity of validity, Samuel Messick
(1989) concluded that construct validation as an isolated technique was inaccurate and
that construct validation is actually the “base on which all other forms of validity rest”
(Fiske, 2002, p. 173). I take the same position in this chapter, where validity is presented
as a unified, multifaceted topic that can be understood by examining the contribution
of specific components to a unified model of construct validity. Specifically, criterion,
content, and construct validity are presented, and the role these components play in the
unified concept of validity is described. Throughout Chapters 3 and 4, you should keep
in mind that establishing evidence for criterion and content validity contributes to construct
validation more generally.
Validity in psychometrics refers to how well scores on a test accurately measure
what they are intended (i.e., designed) to measure. For example, when applied to test
scores, validity may refer to the accuracy with which the scores measure (1) cognitive
ability, (2) a personality attribute, (3) the degree of educational attainment, or (4) clas-
sification of persons related to mastery of subject material on a test used for certification
and licensure. As articulated by the Standards for Educational and Psychological Testing,
validation refers to the development of various types of evidence to support an inter-
pretative argument for the proposed uses of scores acquired from tests or measurement
instruments. Samuel Messick (1995) espoused a comprehensive approach to validity and
described it as a socially salient value that assumes a “scientific and political role that by
no means can be fulfilled by a simple correlation coefficient between test scores and a
purported criterion (e.g., statistically based criterion-related validity) or by expert judg-
ments that test content is relevant for the proposed test use (i.e., content-related valid-
ity)” (p. 6). Consequently, the validity of test scores is not simply expressed by a single
statistical summary measure (e.g., the correlation between a test score and an exter-
nal criterion) but rather by multifaceted evidence acquired from criterion, content, and
construct-related issues. Validity is arguably the most important topic in psychometrics
(Waller, 2006, pp. 9–30). The result of this multifaceted approach is that the validity of
test scores can be viewed along a continuum ranging from weak to acceptable to strong.
Figure 3.1 illustrates the validity continuum in relation to criterion, content, and con-
struct validity, along with explanations of each component provided by the AERA/APA/
NCME (1999) standards.
Criterion, Content, and Construct Validity  61

Validity

Weak Acceptable Strong

Collective evidence

Content + Criterion − Construct

● The appropriateness of a ● External variables that ● Analysis of the internal


given content domain is include criteria that the test structure of a test indicates
related to the specific is expected to predict as well the degree to which the
inferences to be made from as relationships to other relationships among test
test scores. tests hypothesized to items and test components
measure the related or conform to the construct on
different constructs. which the proposed test
● Themes, wordings and score interpretations are
format of items, tasks or based..
questions on a test. ● Categorical variables such
as group membership are ● The conceptual framework
relevant when underlying for a test may imply a single
● Evidence based on logical or theory of a proposed test dimension of behavior, or it
empirical analysis of the adequacy use suggests that group may posit several
with which the test content differences should be components that are each
represents the content domain present or absent if a expected to be
and of the relevance of the proposed test interpretation homogeneous, but that are
content domain to the proposed is to be supported. distinct from each other. The
interpretation of test scores. extent to which item
interrelationships bear out
● Measures other than test the presumptions of the
● Evidence based on expert framework is relevant to
scores such as performance
judgments of the relationship validity.
criteria are often used in
between parts of the test and the
employment settings.
construct.

FIGURE 3.1. Validity continuum. Bulleted information is from AERA, APA, and NCME (1999,
pp. 11–13).

To clarify the role validity plays in test score use and interpretation, consider the follow-
ing three scenarios based on the general theory of intelligence used throughout this book.

1. The test is designed to measure inductive quantitative reasoning (a subtest con-


tained in fluid intelligence), but the multiple-choice items contained qualifiers
(e.g., most, some, usual) in the keyed responses and absolutes (e.g., all, never,
every) in the answer-choice distractors. Result: The test was made easier than
62  PSYCHOMETRIC METHODS

intended due to the qualifiers in the test items. Therefore, the scores produced
by the measurement actually are indicative of testwiseness rather than inductive
quantitative reasoning skill.
2. The test is designed to measure communication ability (a subtest contained
in crystallized intelligence), but the test items require a high level of reading
skill and vocabulary. Result: The test was made harder than intended due to the
required level of reading skill and vocabulary in the test items. Therefore, the
scores produced by the measurement actually are indicative of reading skill and
vocabulary levels rather than communication ability. This problem may be fur-
ther confounded by educational access in certain sociodemographic groups of
children.
3. The test is designed to measure working memory (a subtest contained in
short-term memory) by requiring an examinee to complete a word-pair asso-
ciation test by listening to one word, then responding by providing a second
word that completes the word pair, but the test items require a high level of
reading skill and vocabulary. Result: The test was made harder than intended
due to the required level of vocabulary in the test items (i.e., the words pre-
sented to the examinee by the examiner). Therefore, the scores produced by
the measurement actually are indicative of vocabulary level rather than short-
term working memory.

From the scenarios presented, you can see how establishing evidence for the validity of
test scores relative to their interpretation can be substantially undermined. The points
covered in the three scenarios illustrate that establishing validity evidence in testing
involves careful attention to the psychometric aspects of test development and in some
instances the test administration process itself.
Recall that validity evidence involves establishing evidence based on (1) test
response processes, (2) the internal structure of the test, (3) relations with other vari-
ables, and (4) the consequences of testing. Each component of validity addresses dif-
ferent but related aspects in psychological measurement. However, the three types of
validity are not independent of one another; rather, they are inextricably related. The
degree of overlap among the components may be more or less, depending on (1) the
purpose of the test, (2) the adequacy of the test development process, and (3) subsequent
score interpretations.
Using the quantitative reasoning (i.e., inductive reasoning) test described in scenario
number 1 previously described, evaluation of content and criterion validity is concerned
with the question, “To what extent do the test items represent the traits being measured?”
A trait is defined as “a relatively stable characteristic of a person . . . which is manifested
to some degree when relevant, despite considerable variation in the range of settings and
circumstances” (Messick, 1989, p. 15). For example, when a person is described as being
sociable and another as shy, we are using trait names to characterize consistency within
Criterion, Content, and Construct Validity  63

individuals and also differences between them (Fiske, 1986). For additional background
on the evolution and use of trait theory, see McAdams & Pals (2007).
One example of the overlap that may occur between content and criterion-related
validity is the degree of shared relationship between the content (i.e., expressed in the test
items) and an external criterion (e.g., another test that correlates highly with the induc-
tive reasoning test). Construct validity addresses the question, “What traits are measured
by the test items?” In scenario number 1, the trait being measured is inductive reasoning
within the fluid reasoning component of general intelligence. From this example, you
can see that construct, criterion, and content validity concerns the representativeness of
the trait relative to (1) trait theory, (2) an external criterion, and (3) the items comprising
the test designed to measure a specific trait such as fluid intelligence.

3.2 Criterion Validity

Criterion validity emerged first among the three components of validity. The criterion
approach to establishing validity involves using correlation and/or regression techniques
to quantify the relationship between test scores and a true criterion score. A true crite-
rion score is defined as the score on a criterion corrected for its unreliability. In criterion-
related validity studies, the process of validation involves addressing the question, “How
well do test scores estimate criterion scores?”. The criterion can be performance on a task
(e.g., job performance—successful or unsuccessful), or the existence of a psychological
condition such as depression (e.g., yes or no), or academic performance in an educational
setting (e.g., passing or failing a test). The criterion may also be a matter of degree in the
previous examples (i.e., not simply a “yes” or “no” or “pass” or “fail” outcome); in such
situations, the criterion takes on more than two levels of the outcome. In the criterion
validity model, test scores are considered valid for any criterion for which they provide
accurate estimates (Gulliksen, 1987). To evaluate the accuracy of the criterion validity
approach, every examinee included in a validation study has a unique value on the crite-
rion. Therefore, the goal in acquiring evidence of criterion-related validity is to estimate
an examinee’s score on the criterion as accurately as possible.
Establishing criterion validity evidence occurs either by the concurrent or the predic-
tive approach. In the concurrent approach, criterion scores are obtained at the same time
(or approximately) as the scores on the test under investigation. Critical to the efficacy of
the criterion validity model is the existence of a valid criterion. The idea is that if an accu-
rate criterion is available, then it serves as a valid proxy for the test currently being used.
Alternatively, in the predictive approach, the goal is to accurately estimate the future perfor-
mance of an examinee (e.g., in an employment, academic, or medical setting). Importantly,
if a high-quality criterion exists, then powerful quantitative methods can be used to estimate
a validity coefficient (Cohen & Swerdlik, 2010; Cronbach & Gleser, 1965). Given the
utility of establishing criterion-related validity evidence, what are the characteristics of a
high-quality criterion? I address this important question in the next section.
64  PSYCHOMETRIC METHODS

3.3 Essential Elements of a High-Quality Criterion

Characteristics of a high-quality criterion include the following elements. First, the


criterion must be relevant. Relevance is defined by examinee traits or attributes that
are observable and measureable. For example, related to the general theory of intelli-
gence, an example of a relevant criterion for crystallized intelligence (e.g., measuring
the language development component of intelligence) is a test that taps the same trait
or attribute as the language development test but comprises an entirely different set
of test items. Second, the test serving as the criterion must be valid (i.e., the test and
the scores it yields should have research-based evidence of its validity). Ideally, the
test we use as a criterion to estimate validity will be on the same level of measurement
(e.g., on an interval level). Sometimes, however, criterion measures are based on sub-
jective-type rating scales or other measurements. In such cases, any test or criterion
measure based on subjective ratings or expert judgments should meet the require-
ments for the rigor of such ratings or judgments (e.g., see AERA, APA, & NCME,
1999, Standard 1.7, p. 19). Third, the criterion must be reliable if it is to be useful
for producing validity evidence. An essential element of reliability is that the scores
on tests are consistent when they are obtained under similar testing conditions. In
fact, score reliability is a necessary but not sufficient condition to establish validity
evidence. To this end, score reliability plays a central role in developing interpretative
validity evidence in general and for the estimation of validity coefficients specifically.
Fourth, a high-quality criterion is uncontaminated. Criterion contamination occurs
when the criterion measure, at least in part, consists of the same items that exist
on the test under study. There are challenges and limitations to implementing the
criterion-related approach, and these are presented next.

Challenges and Limitations to the Criterion Model


Conceptually, establishing criterion validity evidence has two advantages: (1) the crite-
rion is relevant to interpretation of the proposed uses and interpretations of test scores,
and (2) the technique is objective (i.e., once a criterion is specified, and data on exam-
inees is acquired, the validity coefficient can be estimated using correlation/­regression
techniques). However, the challenges to the criterion validity model include the follow-
ing issues.

Challenge 1: The Criterion Problem


The main challenge in applying the criterion model in the process of establishing validity
evidence is finding an adequate, high-quality criterion. For example, in intelligence test-
ing, different theories abound, raising the question of whether a satisfactory theoretical
model even exists (Cronbach, 1980; Guion, 1998). As another example, in educational
achievement testing, it is difficult, if not impossible, to locate a criterion that is more
relevant and accurate than the test actually being used. Furthermore, specification of the
Criterion, Content, and Construct Validity  65

criterion involves value judgments and consideration of the consequences for examinees
based on using the criterion for placement and/or selection decisions. Although the cri-
terion validity model provides important advantages to certain aspects of the validation
process, the fact remains that validating the criterion itself is often viewed as an inher-
ent weakness in the approach (Gregory, 2000; Ebel & Frisbie, 1991). To illustrate, in
attempting to validate a test to be used as a criterion, there must be another test that
can serve as a reference for the relevance of the criterion attempting to be validated; to
this end, a circular argument ensues. Therefore, validation of the criterion itself is the pri-
mary shortcoming of the approach. One strategy in addressing this challenge is to use the
content validity model supplemental to the criterion validity model. This strategy makes
sense because remember that establishing a comprehensive validity argument involves all three
components—criterion, content, and construct, with construct validity actually subsuming
criterion and content aspects. The content validity model is discussed later in the chapter.
Next we turn to challenge 2—sample size.

Challenge 2: Sample Size


A particularly challenging problem in estimating a validity coefficient relates to sample
size and its contribution to sampling error. For example, during the process of conduct-
ing a validation study, a researcher may have access to only a small sample of examinees
(e.g., < 50) with which to estimate the criterion validity coefficient. However, previous
research has demonstrated that although a predictor (i.e., the test being evaluated) may
have an acceptable level of validity in the population, with samples smaller than 50 exam-
inees the validity level in the sample will be adequate less than 35% of the time (Schmidt,
Hunter, & Urry, 1976). Another artifact of correlation-based techniques is restriction of
range—challenge number 3.

Challenge 3: Restriction of Range


Because the criterion validity coefficient is derived using correlation techniques, any
restriction of range in the predictor or criterion (or both) will attenuate the validity
coefficient. For example, suppose a test is being used to diagnose patients as clinically
depressed, but no evidence exists regarding the validity of the test. If this test is used
to diagnose patients, any patients who score such that they are classified as not being
depressed are unable to be used in a validation study. This situation results in a restriction
of range on the predictor because no scores are available for patients who were classified
as being not depressed.
Another cause of attenuated validity coefficients can be ascribed to the predictor being
correlated with some other measure that is also correlated with the criterion. Continuing
with our depression example, consider the situation where a criterion validity study is
being conducted on the test of clinical depression and patients participating in the study
have been included based on their scores on a test of anxiety. Because anxiety is often
related to clinical depression, a range restriction may incidentally occur in the predictor.
66  PSYCHOMETRIC METHODS

Restriction of range may also occur when the predictor or criterion tests exhibit
floor or ceiling effects. A floor effect occurs when a test is exceptionally difficult result-
ing in most examinees scoring very low. Conversely, a ceiling effect occurs when a test is
exceptionally easy, resulting in most examinees scoring very high.

Challenge 4: Criterion Contamination


Criterion contamination occurs when any person who can influence or affect an examin-
ee’s score on the criterion has access to information on the examinee’s predictor scores. To
this end, no person or persons should have access to examinees’ scores on predictor vari-
ables to be used in a criterion validity study. To provide an example, suppose that a pro-
fessor wants to conduct a criterion validity study of undergraduate performance where
the predictor is the crystallized intelligence test total score and the criterion is measured
as students’ first-year grade point average. The sample consists of 100 students in a large
lecture-based seminar course. Next, consider the situation where the professor knows
the scores of his or her students on the crystallized intelligence test. Such knowledge
can influence the professor’s expectations of the students’ performance on the criterion
because knowledge of the students’ intelligence test scores may result in the professor’s
behavioral change. Influenced by knowledge of the students’ intelligence test score, the
professor may view the students’ work in class as better than it really is. Because it is not
possible to statistically adjust for criterion contamination, it is important that situations
likely to result in this artifact be avoided when planning criterion validity studies. Next,
the reliability of scores produced by the predictor and criterion is presented as another
challenge in estimating criterion-related score validity.

Challenge 5: Reliability of the Predictor and Criterion


Reliability and validity share a close relationship in psychometrics. For example, the
reliability of the predictor and criterion variables directly influences the validity coef-
ficient. Chapter 7 (reliability) emphasizes that tests exhibiting high score reliability is an
important property. This same point holds true for predictor and criterion tests in validity
studies. However, it is not always true that the predictive power of a test peaks when a
high level of internal consistency reliability is observed. This subject is addressed next.

3.4 Statistical Estimation of Criterion Validity

This section introduces the psychometric and statistical aspects related to the estimation of
criterion-related validity. To facilitate understanding, we use the GfGc data to merge theory
with application. Recall that one of the subtests of crystallized intelligence tests measures
the language development component according to the general theory of intelligence. Sup-
pose a psychologist wants to concurrently evaluate the criterion validity between the lan-
guage development component (labeled as “cri1_tot” in the GfGc dataset) of crystallized
Criterion, Content, and Construct Validity  67

intelligence and an external criterion. The criterion measure is called the Highly Valid Scale
of Crystallized Intelligence (HVSCI; included in the GfGc dataset). Furthermore, suppose
that based on published research, the HVSCI correlates .92 with the verbal intelligence
(VIQ) composite measure on the Wechsler Scale of Intelligence for Adults—Third Edition
(WAIS-III; Wechsler, 1997b). The validity evidence is firmly established for the WAIS-III
VIQ composite as reported by published research. To this end, the correlation evidence
between the HVSCI and the WAIS-III VIQ provides evidence that the HVSCI meets one
aspect of the criteria discussed earlier for a high-quality criterion. Although the general the-
ory of intelligence is different from the Wechsler theory of intelligence WAIS-III (e.g., it is
based on a different theory and has different test items), the VIQ composite score provides
a psychometrically valid external criterion by which the language development test can be
evaluated. Finally, because there is strong evidence of a relationship between the HVSCI
and the WAIS-III VIQ (i.e., the correlation between the HVSCI and VIQ is .92), we will use
the HVSCI as our external criterion in the examples provided in this section.
The criterion validity of the language development test can be evaluated by calcu-
lating the correlation coefficient using examinee scores on the language development
subtest and scores on the HVSCI. For example, if we observe a large, positive correlation
between scores on the language development subtest and scores on the HVSCI, we have
evidence that scores on the two tests converge, thereby providing one source of validity
evidence within the comprehensive context of validity. The correlation between the lan-
guage development total score and examinee scores on the HVSCI is .85 (readers should
verify the value of .85 using the GfGc N = 1000 dataset). The .85 coefficient is a form of
statistical validity evidence referred to as the validity coefficient. The concurrent validity
coefficient provides one type of criterion-based evidence (i.e., statistical) in the approach
to evaluating the validity of scores obtained on the language development test.
Recall that test (score) reliability affects the value of the validity coefficient. One way
to conceptualize how score reliability affects score validity is based on the unreliability
(i.e., 1 – rxx) of the scores on the test. To deal with the influence of score reliability in vali-
dation studies, we can correct for the error of measurement (i.e., the unreliability of the
test scores). The upper limit of the validity coefficient is constrained by the square root of
the reliability of each test (i.e., the predictor and the criterion). By taking the square root
of each test’s reliability coefficient, we are using reliability indexes rather than reliability
coefficients (e.g., in Equation 3.1a on page 68). From classical test theory, the reliability
index is defined as the correlation between true and observed scores (e.g., see Chapter 7
for details). To apply this information to our example, the reliability coefficient is .84 for
crystallized intelligence test 1 and .88 for the HVSCI external criterion (readers should
verify this by calculating the coefficient alpha internal consistency reliability for each test
using the GfGc data). Knowing this information, we can use Equation 3.1a to estimate
the theoretical upper limit of the validity coefficient with one predictor.
Inserting the information for our two tests into Equation 3.1a provides the fol-
lowing result in Equation 3.1b.
Once the correlation between the two tests is corrected for their reliability, we see
that the upper limit of the validity coefficient is .86. However, this upper limit is purely
68  PSYCHOMETRIC METHODS

Equation 3.1a. Upper limit of validity coefficient

rxy = ( rxx )( ryy )

• rxy = upper limit of validity coefficient.


• rxx = reliability of the test being evaluated.
• ryy = reliability of the test serving as the criterion.

Equation 3.1b. Upper limit of validity coefficient

RX Y = (.84)(.88) = .739 = .86

• rxy = upper limit of validity coefficient.


• rxx = reliability of the test being evaluated.
• ryy = reliability of the test serving as the criterion.
Note. The square root of the reliability coefficient is the reliabil-
ity index. From classical test theory, this index is the correlation
between true scores and observed scores (see Chapter 7).

theoretical because in practice we are using fallible measures (i.e., tests that are not per-
fectly reliable). To further understand how the upper limit on the validity coefficient is
established, we turn to an explanation of the correction for attenuation.

3.5 Correction for Attenuation

The correction for attenuation introduced in 1907 by Charles Spearman (1863–1945)


provides a way to estimate the correlation between a perfectly reliable test and a perfectly
reliable criterion. Formally, Spearman defined the correction as (a) the corrected correla-
tion between true scores in each of the two measures and (b) the correlation between
the two measures when each is increased to infinite length (i.e., mathematically, as the
number of test items increases, the reliability coefficient continues to approach 1.0; an
infinitely long test will exhibit a perfect reliability of 1.0). The correction for attenua-
tion for predictor and criterion scores is provided in Equation 3.2a (Gulliksen, 1987,
pp. 101–105; Thissen & Wainer, 2001; Guilford, 1978, p. 487). Equation 3.2b illustrates
the application of 3.2a with the example reliability information using the GfGc dataset.
Criterion, Content, and Construct Validity  69

Equation 3.2a. Validity coefficient corrected for attenuation in the


test and criterion
rxy
r∞ω =
rxx ryy

• r¥w = correlation between true score components of test


scores x and y.
• rxy = correlation between test score x and criterion score y.
• ryy = reliability of a test—the criterion.
• rxx = reliability of test x.

Equation 3.2b. Estimated validity coefficient based on a perfect


test and criterion
rxy .85 .85
rxω = = = = .98
rxx r yy (.84)(.88) .86

In practice, we use tests that include a certain amount of error—a situation that
is manifested as the unreliability of test scores. For this reason, we must account for
the amount of error in the criterion when estimating the validity coefficient. To correct
validity coefficients for attenuation in the criterion measure only but not the predictor,
Equation 3.3a is used (Guilford, 1954, p. 401, 1978, p. 487; AERA, APA, and NCME,
1999, pp. 21–22). Equation 3.3b illustrates the application of 3.3b with our example
reliability information.

Equation 3.3a. Validity coefficient corrected for attenuation in the


criterion only
rxy
rxω =
ryy

• rxw = validity coefficient corrected for attenuation in the


criterion only.
• rxy = correlation between test score x and criterion score y.
• ryy = reliability of a test y—the criterion.
70  PSYCHOMETRIC METHODS

Equation 3.3b. Validity coefficient corrected for attenuation in the


criterion with example data

rx y .85 .85
rxω = = = = .91
ryy .88 .94

3.6 Limitations to Using the Correction for Attenuation

Effective use of the correction for attenuation requires accurate estimates of score reliability.
For example, if the reliability of the scores on a test or the criterion is underestimated, the cor-
rected coefficient will be overestimated. Conversely, if the reliability of the test or the criterion
is overestimated, the corrected coefficient will be underestimated. To err on the conservative
side, you can use reliability coefficients for the test and criterion that are overestimated in the
correction formula. The aforementioned point calls into question, “Which type of reliability
estimate is best to use when correction formulas are to be applied?” For example, when using
internal consistency reliability methods such as coefficient alpha, the reliability of true scores
is often underestimated. Because of the underestimation problem with coefficient alpha, alternate
forms of reliability estimates are recommended for use in attenuation correction formulas. Finally,
correlation coefficients fluctuate based on the degree of sampling and measurement error.
The following recommendations are offered regarding the use and interpretation
of correction attenuation formulas. First, when conducting validity studies researchers
should make every attempt to reduce sampling error by thoughtful sampling protocols
paired with rigorous research design (e.g., see the section on challenges to the criterion
validity model earlier in the chapter). Second, large samples are recommended since this
action aids in reducing sampling error. Third, corrected validity coefficients should be
interpreted with caution when score reliability estimates are low (i.e., the reliability of
either the predictor or criterion or both is low).

3.7 E
 stimating Criterion Validity with Multiple Predictors:
Partial Correlation

Establishing validity evidence using the criterion validity model sometimes involves using
multiple predictor variables (e.g., several tests). Central to the multiple variable problem rela-
tive to test or score validity is the question, “Am I actually studying the relationships among
the variables that I believe I am studying?” The answer to this question involves thoughtful
reasoning to ensure that we are actually studying the relationships we believe we are studying.
We can employ statistical control to help answer the validity-related question, “Am I study-
ing the relationship among the variables that I believe I am studying?” In a validation study,
statistical control means controlling the influence of a “third” or “additional” predictor (e.g.,
test) by accounting for (partialling out) its relationship with the primary predictor (e.g., test)
Criterion, Content, and Construct Validity  71

of interest in order to more accurately estimate its effect on the criterion. The goal in statisti-
cal control is to (1) maximize the systematic variance attributable to the way examinees
respond to test items (e.g., artifacts of the test or testing conditions that cause examinees to
score consistently high or low); (2) minimize error variance (e.g., error attributable to the
content of the test or instrument or the research design used in a study); and (3) control
extraneous variance (e.g., other things that increase error variance such as elements specific
to the socialization of examinees). Chapter 7 on score reliability based on classical test theory
summarizes the issues that contribute to the increase in variability of test scores.
In validation studies, multiple predictor variables (tests) are often required in order to
provide a comprehensive view of the validity of test scores. For example, consider the sce-
nario where, in addition to the primary predictor variable (test), there is a second predictor
variable that correlates with the primary predictor variable and the criterion. To illustrate, we
use as the criterion the Highly Valid Scale of Intelligence (HVSCI) in the GfGc dataset and the
language development subtest component of crystallized intelligence as the primary predic-
tor. Suppose research has demonstrated that fluid intelligence is an important component
that is related to language development. Therefore, accounting for fluid intelligence provides
a more accurate picture of the relationship between language development and the HVSCI.
The result is an increase in the integrity of the validity study. Armed with this knowledge, the
graphic identification subtest of fluid intelligence (labeled “fi2_tot” in the GfGc dataset) is
introduced with the goal of evaluating the relationship between the criterion (HVSCI) and the
primary predictor for a group of examinees whose graphic identification scores are similar. To
accomplish our analytic goal, we use the first-order partial correlation formula illustrated
in Equation 3.4a.

Equation 3.4a. First-order partial correlation coefficient


RYX1 − RYX2 RX1 X2
RYX1 ⋅ X2 =
1 − RYX
2
1 1 − RX1 X 2
2

• RYX1 ⋅ X2 = first-order partial correlation coefficient.


• RYX1 = correlation between criterion Y and predictor X1.
• RYX2 = correlation between criterion Y and predictor X2.
• RX1 X2 = correlation between predictor X1 and predictor X2.
2
• RYX 1 = square of the correlation between criterion Y and
predictor X1.
2
• RX1 X2 = square of the correlation between X1 and predictor
X2.
2
• rXY = coefficient of determination.
Note. The variable following the multiplication dot (·) is the vari-
able being “partialled.’’
72  PSYCHOMETRIC METHODS

The first-order partial correlation refers to one of two predictors being statistically
controlled and involves three variables (i.e., criterion, predictor 1, and predictor 2). Alter-
natively, the zero-order correlation (i.e., Pearson) involves only two variables (i.e., a
criterion and one predictor). To apply the first-order partial correlation, we return to our
GfGc data and use the HVSCI as the criterion (Y; labeled HVSCI), the language develop-
ment total score (based on the sum of the items on crystallized intelligence test 1) as
the primary predictor (X1; labeled cri1_tot), and a second predictor (X2; labeled fi2_tot)
based on a measure of fluid intelligence (i.e., the graphic identification subtest total score).
Equation 3.4b illustrates the use of the first-order partial correlation using the GfGc data;
we see that the result is .759. To illustrate how to arrive at this result using SPSS, syntax is
provided below (readers should conduct this analysis and verify their work with the par-
tial output provided in Table 3.1). The dark shading in Table 3.1 is Pearson (zero-order)
correlations, and the lightly shaded values in the bottom of the table include the first-order
partial correlation.

SPSS partial correlation syntax

PARTIAL CORR
/VARIABLES=HVSCI cri1_tot BY fi2_tot
/SIGNIFICANCE=TWOTAIL
/STATISTICS=CORR
/MISSING=LISTWISE.

Continuing with our example, we see from the SPSS output in Table 3.1 that language
development and graphic identification are moderately correlated (.39). Using this informa-
tion, we can answer the question, “What is the correlation (i.e., validity coefficient) between
language development (the primary predictor) and HVSCI (the criterion) given the exam-
inees’ scores (i.e., ability level) on graphic identification?” Using the results of the analysis,
we can evaluate or compare theoretical expectations based on previous research related to

Equation 3.4b. First-order partial correlation coefficient with GfGc


data
Criterion, Content, and Construct Validity  73

Table 3.1.  SPSS Partial Correlation Output


Correlations
Control Variables HVSCI cri1_tot fi2_tot
-none = zero order HVSCI Correlation 1.000 .799 .428
correlationa Significance (2-tailed) . .000 .000
df 0 998 998
cri1_tot Correlation .799 1.000 .392
Significance (2-tailed) .000 . .000
df 998 0 998
fi2_tot Correlation .428 .392 1.000
Significance (2-tailed) .000 .000 .
df 998 998 0
fi2_tot (adjusted correlation HVSCI Correlation 1.000 .759
between HVSCI and cri1_tot Significance (2-tailed) . .000
with fi2_tot partialled) df 0 997
cri1_tot Correlation .759 1.000
Significance (2-tailed) .000 .
df 997 0
a. Cells contain zero-order (Pearson) correlations.

the previous question (e.g., “Does our analysis concur with previous research or theoretical
expectations?”). As you can see, the partial correlation technique provides a way to evaluate
different and sometimes more complex score validity questions beyond the single-predictor
case. Inspection of Equation 3.4a and Table 3.1 reveals that when two predictors are highly
and positively correlated with the criterion, the usefulness of the second predictor diminishes
because the predictors are explaining much of the same thing.
The results of the partial correlation analysis can be interpreted as follows. Controlling
for examinee scores (i.e., their ability) on the graphic identification component of fluid intel-
ligence, we see that the correlation between HVSCI and language development is .759. Notice
that the zero-order correlation between HVSCI and language development is .799. By par-
tialing out or accounting for the influence of graphic identification, the correlation between
HVSCI and language development reduces to .759. Although the language development and
graphic interpretation tests are moderately correlated (.39), the graphic identification test
adds little to the relationship between language development and the HVSCI. To this end,
the graphic identification component of fluid intelligence contributes little above and beyond
what language development contributes alone to the HVSCI. However, the first-order partial
correlation technique allows us to isolate the contribution each predictor makes to the HVSCI
in light of the relationship between the two predictors.
Equation 3.5a illustrates another way to understand how we arrived at the result of
.759 in Equations 3.4a and 3.4b. Equation 3.5 illustrates the semipartial correlation and
allows for partitioning the correlation in a way that isolates the variance in the HVSCI
74  PSYCHOMETRIC METHODS

Equation 3.5a. Semipartial correlation coefficient

RYX2 − RYX1 RX2 X1


RYX1⋅ X2 =
1 − RX22 X1

• RYX1⋅ X2 = semipartial correlation coefficient.


• rYX2 = correlation between criterion Y and predictor X2.
• rX2 X1 = correlation between predictor X2 and predictor X1.
2
• rYX 2 = square of the correlation between criterion Y and
predictor X2.
2
• rX2 X1 = square of the correlation between X2 and predictor
X1.
2
• rXY = coefficient of determination or proportion of vari-
ance accounted for in Y by X.
2
• rYX1⋅ X2 = coefficient of determination or proportion of vari-
ance accounted for in Y by X1 after controlling for
X2.
Note. The variable following the multiplication dot (·) is the vari-
able being “partialed.”

(Y) accounted for by language development (X1) after the effect of graphic identification
(X2) is partialed or controlled.
Applying the correlation coefficients from our example data, we have the result in
Equation 3.5b. Note that the result below agrees with Equation 3.4b. Therefore, we have
illustrated a second way to arrive at the same conclusion but the semipartial correlation
provides a slightly different way to isolate or understand the unique and nonunique rela-
tionships among the predictor variables in relation to the criterion.
Figure 3.2 provides a Venn diagram depicting the results of our analysis in Equation
3.5b.

Equation 3.5b. Semipartial correlation coefficient with example data

RYX2 − RYX1 RX2 X1 .428 − .799(.392) .3132


RYX1• X2 = = = = .0876 ⇒ RYX
2
1 • X2 = .0076
1 − RX22 X1 1 − .847 .920

AND
(.0076)(100) = 76%
Y-HVSCI

Part of variance accounted for by language


development after the effect of graphic identification
is partialled out or removed – this .08762 = 76%
X2 – Graphic identification

Common variance in Y accounted for by


graphic identification and language
development

X1 – Language development

FIGURE 3.2.  Venn diagram illustrating the semipartial correlation. The circles represent percentages (e.g., each circle represents 100% of each
variable). This allows for conversion of correlation coefficients into the proportion of variance metric, r2. The r2 metric can then be converted to
percentages to aid interpretation.

75
76  PSYCHOMETRIC METHODS

A final point relates to the size of the sample required to ensure adequate statistical
power for reliable results. A general rule of thumb regarding the necessary sample size
for conducting partial correlation and multiple regression analysis is minimally 15 sub-
jects per predictor variable when (1) there are between 3 and 25 predictors and (2) the
squared multiple correlation, R2 = .50 (Stevens, 2003, p. 143). The sample size require-
ment for partial correlation and regression analysis also involves consideration of (1) the
anticipated effect size and (2) the alpha-level used to test the null hypothesis that R2 =
0 in the population (Cohen, Cohen, West, & Aiken, 2003, pp. 90–95). Interpretation of
effect size in terms of proportion of variance accounted for R2 (see Figure 3.2 as an exam-
ple) are .02—small, .13—medium, and .26—large (Cohen et al., 2003, p. 93). These
sample guidelines are general, and a sample size/power analysis should be conducted as
part of a validation study to ensure accurate and reliable results. Finally, remember that,
in general, sampling error is reduced as sample size increases.

Correction for Attenuation and First-Order Partial Correlation


To refine the first-order partial correlation, we can apply the correction for attenuation,
resulting in an improved validity coefficient. Equation 3.6a provides the correction for
attenuation, and Equation 3.6b illustrates this step with our example data.
Equation 3.6b illustrates estimation of the partial correlation corrected for attenu-
ation in the criterion and two predictor variables using reliability and correlation coef-
ficients from our example data.

Equation 3.6a. Correction for attenuation applied to first-order


partial correlation
ρ X1 rYX2 − rYX1 rX1X2
2 .X1 =
*
rYX
ρ X1 ρY − rYX
2
1 ρ X1 ρ X2 − rX21X2

*
• rYX 2 .X1 = partial correlation corrected for attenuation.

• rX1 = reliability of language development—predictor X1.


• rY = reliability of HVSCI—criterion Y.
• rX2 = reliability of graphic identification—predictor X2.
• rYX2 = correlation between criterion Y (HSVCI) and pre-
dictor X2 (graphic identification).
• rX1X2 = correlation between predictor X1 and X2.
2
• rYX1 = correlation between criterion Y and predictor X1
squared.
2
• rX1X2 = correlation between predictor X1 and predictor X2
squared.
Criterion, Content, and Construct Validity  77

Equation 3.6b. First-order partial correlation corrected for attenuation


in the criterion and predictor variables

ρ X1RYX2 − RYX1 RX1X2 (.87)(.42) − (.79)(.39)


2 .X1 = =
*
RYX
ρ X1 ρY − RYX
2
1 ρ X1 ρ X2 − RX21X2 (.87)(.88) − .64 (.87)(.91) − .15

.36 − .31 .05


= = = .18
.76 − .64 .79 − .15 (.34)(.8)

Equation 3.6c. Correction for attenuation for the criterion only

rYX2 − rYX1 rX1X2 .42 − (.80)(.39) .42 − .31 .11


2 X1 = = = = = .36
* .
rYX
ρY − rYX
2
1 rX21X2 .88 − .64 .39 .88 − .64 .39 (.49)(.62)

We see in Equation 3.6b that correcting for the attenuation using all three variables
substantially changes the partial correlation for graphic identification’s predictive validity
to .18. In practical testing situations, the predictors will never be completely reliable (i.e.,
100% free from measurement error), so it is more reasonable to correct for attenuation in
the criterion only. The result of this approach is provided in Equation 3.6c.
As Equation 3.6c shows, correcting for attenuation in just the criterion makes a sub-
stantial change from the case where we corrected for attenuation in all three variables.
Specifically, the validity coefficient derived based on the partial correlation corrected for
attenuation in the criterion only is .36 (substantially higher than .18).

3.8 E
 stimating Criterion Validity with Multiple Predictors:
Higher-Order Partial Correlation

The first-order partial correlation technique can be expanded to include more than a
single variable. For example, you may be interested in controlling the influence of an
additional predictor variable that is related to the primary predictor variable. In this sce-
nario, the higher-order partial correlation technique provides a solution. To illustrate,
consider the case where you are conducting a validity study with the goal of evaluating
the criterion validity of the HVSCI using a primary predictor of interest, but now you
have two predictors that previous research has indicated influence the criterion validity
of the HVSCI. Building on the first-order partial correlation technique, the equation for
higher-order partial correlation is presented in Equation 3.7.
78  PSYCHOMETRIC METHODS

Equation 3.7. Higher-order partial correlation


RYX1 . X2 − RYX1 .X3 RX1X3.X2
RYX1 .X2 X3 =
1 − RYX
2
3 . X2 1 − RX1X3. X 2
2

• rYX1 .X2 X3 = higher-order partialed correlation.


• RYX1.X2 = correlation between Y and X1 with predictor X2
partialed (removed).
• RYX1. X3 = correlation between Y and predictor X1 with
predictor X3 partialed (removed).
• RX1X3.X2 = correlation between predictor X1 and predictor
X3 with predictor X2 partialed (removed).
2
.
• RYX3 X2 = squared correlation between criterion Y and pre-
dictor X3 with predictor X2 partialed (removed).
2
• RX1X3.X2 = squared correlation between predictor X1 and pre­
dictor X3 with predictor X2 partialed (removed).
• 1 − RYX3 . X2 = proportion of variance unaccounted for between
2

criterion Y and predictor X3 with predictor X2


partialed (removed).
• 1 − RX21X3.X2 = proportion of variance unaccounted for between
predictor X1 and predictor X3 with predictor X2
partialed (removed).

To provide an example of the higher-order partial correlation technique, we begin


with an analysis using SPSS. We use the same criterion and predictors as before but will
include an additional predictor (X3; a short-term memory component consisting of audi-
tory and visual learning). Tables 3.2a and 3.2b provide SPSS output, including the means
and standard deviations and zero-order (Pearson) correlations for the criterion (HVSCI)
and the predictors (X1—language development; X2 —graphic identification; X3 —auditory and
visual components of short-term memory). The following SPSS syntax generated the data
in Tables 3.2a and 3.2b.

SPSS syntax for partial correlation

PARTIAL CORR
/VARIABLES=HVSCI cri1_tot BY fi2_tot stm2_tot
/SIGNIFICANCE=TWOTAIL
/STATISTICS=DESCRIPTIVES CORR
/MISSING=LISTWISE.
Criterion, Content, and Construct Validity  79

Table 3.2a.  Means and Standard Deviations for


Predictors and Criterion
Descriptive Statistics
Mean Std. Deviation N

HVSCI 80.24 21.89 1000

cri1_tot 35.23 8.61 1000

fi2_tot 9.02 5.44 1000

stm3_tot 16.37 4.12 1000

Table 3.2b.  Higher-Order Partial Correlations for Predictors and Criterion


Correlations
Control Variables HVSCI cri1_tot fi2_tot stm3_tot
-none-a HVSCI Correlation 1.000 .799 .428 .393
Significance (2-tailed) . .000 .000 .000
df 0 998 998 998
cri1_tot Correlation .799 1.000 .392 .363
Significance (2-tailed) .000 . .000 .000
df 998 0 998 998
fi2_tot Correlation .428 .392 1.000 .480
Significance (2-tailed) .000 .000 . .000
df 998 998 0 998
stm3_tot Correlation .393 .363 .480 1.000
Significance (2-tailed) .000 .000 .000 .
df 998 998 998 0
fi2_tot & HVSCI Correlation 1.000 .746
stm3_tot Significance (2-tailed) . .000
df 0 996
cri1_tot Correlation .746 1.000
Significance (2-tailed) .000 .
df 996 0
a. Cells contain zero-order (Pearson) correlations.

Reviewing the results presented in Tables 3.2a and 3.2b, we see that the higher-order
partial correlation between the criterion HVSCI and language development (cri1_tot) is
.746 (lightly shaded) after removing the influence of graphic identification (fi2_tot) and
short-term memory (stm2_tot). In Chapter 2, correlation and regression were introduced
as essential to studying the relationships among variables. When we have more than
two variables, regression provides a framework for estimating validity. The next section
80  PSYCHOMETRIC METHODS

illustrates how using multiple regression techniques helps explain how well a criterion is
predicted from a set of predictor variables.

3.9 Coefficient of Multiple Determination


and Multiple Correlation

The coefficient of multiple determination (RY.1,...,m2 or R2) provides an answer to the


question, “How well is a criterion predicted from a set of predictor variables?” To derive
R2, several components related to the criterion Y and multiple predictor variables (Xks)
are required. Specifically, we need the components that comprise the total sum of squares
(introduced in Chapter 2) in the criterion Y (SSY). The sum of squares in Y is defined in
Equation 3.8.
The total sum of squares in Y can be partitioned into the sum of squares derived from
the regression of Y on the Xks (i.e., SSregression) and the sum of squares derived from the dif-
ference between the observed Y scores and the predicted Y scores (i.e., SSresidual). Equation
3.9 provides the relationship among SSY, SSregression, and SSresidual (Draper & Smith, 1998,
pp. 28–33).
Using the sums of squares, we find that the coefficient of multiple determina-
tion, RY.1,...,m2 or R2 provides an overall measure of the predictive accuracy of the set
of predictors relative to the criterion. With regard to the mechanics of the regression
equation, Equation 3.10 illustrates the coefficient of multiple determination.
The relationship between RY.1,...,k2 and the sum of squares is provided in Equation 3.11.
Finally, the size of the coefficient of multiple determination is affected by (1) reliability of
the predictors in the regression model, (2) reliability of the relevant predictors not in the
model, and (3) total variation (i.e., standard deviation/variance) in Y.

Equation 3.8. Total sum of squares in Y

(å Y )
2
n å Yi 2 - i
SSY =
n

• SSY = sum of the difference between each examinee’s score


on Y minus the mean of Y, then this result squared.
• å Yi2
= sum of the squared scores on Y for each examinee
in a sample.
• n = sample size.
• (SYi) = s um of the Y scores across all examinees, then this
2

sum squared.
Criterion, Content, and Construct Validity  81

Equation 3.9. Partitioning the sum of squares in regression

SSY = SS REGRESSION + SS RESIDUAL

OR

∑ (YI − Y )2 = ∑ (YI′ − Y )2 + ∑ (YI − YI′)2


(SUM OF SQUARES ( SUM OF SQUARES ( SUM OF SQUARES
ABOUT THE MEAN) ABOUT REGRESSION ) DUE TO REGRESSION)

DF : (N − 1) (N − K − 1) (K)
• SSY = s um of the difference between each examinee’s score
on Y minus the mean of Y, then this result squared.
• Yi′ = predicted score on Y for each examinee in a sample.
• ­Y = mean of the criterion Y for the sample.
• Yi = score on Y for an examinee.
• ­S = summation operator.
• df = degrees of freedom.
• n = sample size.
• k = number of predictors.

Equation 3.10. Coefficient of multiple determination

Y.1,…,K
2 = 1( YX1 ) + 2 ( YX2 ) +  + K ( YK)

• RY.1,...,k2 = coefficient of multiple determination (i.e., R2).


• b1 = unstandardized regression coefficient for predictor 1.
• ­rYX1 = c orrelation between Y and predictor 1.
• b2 = unstandardized regression coefficient for predictor 2.
• rYX2 = correlation between Y and predictor 2.
• bk = fi
 nal unstandardized regression coefficient in the
equation.
• rYk = c orrelation between Y and final predictor in the
equation.
82  PSYCHOMETRIC METHODS

Equation 3.11. Coefficient of multiple determination expressed as


sum of squares
SSREGRESSION
RY .1,...,K2 =
SSY
AND
SSRESIDUAL = (1 − R 2 )SS Y = SS Y − SS REGRESSION

As a prelude to the next section and to illustrate estimation of the partial and semipartial
correlations with our example data, the results of a multiple linear regression (MLR) analy-
sis are presented in Tables 3.3a, 3.3b, and 3.3c. The SPSS syntax that generated the output
tables is presented after Tables 3.3a through 3.3c. The mechanics of multiple regression analy-
sis are presented in more detail in the next section, but for now results are presented in order

Table 3.3a.  Multiple Regression Model Summary


Model Summary
Adjusted R Std. Error of
Model R R Square Square the Estimate

1 .811a .658 .657 12.81874


a. Predictors: (Constant=intercept), Gsm short-term memory: auditory
and visual components (stm3_tot), Gc measure of vocabulary (cri1_tot),
Gf measure of graphic identification (fi2_tot).

Table 3.3b.  Sums of Squares and Analysis of Variance Statistics


ANOVAb
Sum of
Model Squares dfc Mean Square Fd Sig.
1 Regression 314832.424 3 104944.141 638.656 .000a

(k)
Residual 163662.932 996 164.320

(n – k
– 1)
Total 478495.356 999

(n – 1)
a. Predictors: (Constant=intercept), Gsm short-term memory: auditory and visual components
(stm3_tot), Gc measure of vocabulary (cri1_tot), Gf measure of graphic identification (fi2_tot).
b. Dependent (criterion) Variable: Highly valid scale of crystallized intelligence—external
criterion measure of crystallized IQ (HVSCI).
c. Degrees of freedom (df)-related information for sample size and predictor variables has
been added in parentheses to aid interpretation.
d. F = mean square regression divided by mean square residual (e.g., 104944.14/164.32 = 638.65).
Criterion, Content, and Construct Validity  83

Table 3.3c.  Regression and Partial and Part Correlation Coefficients


95.0%
Unstandardized Standardized Confidence
Coefficients Coefficients Interval for B Correlations
Partialb = Partc =
Std. Lower Upper Zero- 1st order & semi-
Model B Error Beta t Sig. Bound Bound order higher partial
1 (Constant) 4.406
a
2.068   2.131 .033 .348 8.463      
cri1_tot 1.854 .052 .729 35.342 .000 1.751 1.957 .799 .746 .655
fi2_tot .424 .088 .106 4.816 .000 .251 .597 .428 .151 .089
stm3_tot .409 .115 .077 3.562 .000 .184 .634 .393 .112 .066
a. Dependent (criterion) Variable: Highly valid scale of crystallized intelligence—external
criterion measure of crystallized IQ (HVSCI). The term “Constant” is the intercept in the
regression equation.
b. The column under “Correlations” labeled “Partial” is the first-order and higher partial
correlation which represents the correlation between the criterion (HVSCI) and the predictor
variables as presented in Equation 3.7.
c. The column under “Correlations” labeled “Part” is the semipartial correlation which
represents the correlation between the criterion (HVSCI) and each predictor variable after
the removal of each predictor (uniquely) from one another. This allows for evaluation of
the relationship between the criterion (HVSCI) and each predictor after the removal of the
other predictors but does not include the correlation between HVSCI and the predictor being
partialed out or removed.

to (1) provide a connection to the analysis of variance (ANOVA) via the sums of squares
(Table 3.3b) and (2) highlight the partial and semipartial correlation coefficients with our
example data. Table 3.3c is particularly relevant to the information on partial and semipartial
correlation coefficients and how multiple predictors contribute to the relationship with the
criterion—in light of their relationship to one another. As an instructive exercise, readers should
use the sums of squares in Table 3.3b and insert them into Equation 3.11 to verify the R2 value
presented in Table 3.3a produced by SPSS.

SPSS syntax for production of Tables 3.3a, 3.3b, and 3.3c

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI (95) R ANOVA ZPP
/CRITERIA=PIN (.05) POUT (.10)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot stm3_tot.
N
 ote. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time. Other predictor variable entry methods are discussed later in the chapter.
84  PSYCHOMETRIC METHODS

3.10 E
 stimating Criterion Validity with More Than
One Predictor: Multiple Linear Regression

The correlation techniques presented so far are useful for estimating criterion validity by
focusing on the relationship between a criterion and one or more predictor variables—
when measured at the same point in time. However, many times the goal in validation
studies is to predict the outcome on some criterion in the future. For example, consider the
following question: “What will an examinee’s future score be on the HVSCI given our
knowledge of their scores on language development, graphic identification, and auditory
and visual short-term memory?” A related question is, “How confident are we about the
predicted scores for an examinee or examinees?” To answer questions like these, we turn
to multiple linear regression (MLR) introduced briefly in the last part of this section.
When tests are used for prediction purposes, the first step required is the devel-
opment of a regression equation (introduced in Chapter 2). In the case of multiple
predictor variables, a multiple linear regression equation is developed to estimate the
best-fitting straight line (i.e., a regression line) for a criterion from a set of predictor vari-
ables. The best-fitting regression line minimizes the sum of squared deviations from the
best-fitting straight line. For example, Figure 3.3a illustrates the regression line based on
a two-predictor multiple regression analysis using HVSCI as the criterion, and language
development (X1 − cri1_tot) and graphic identification (X2 − fi2_tot) as the predictor
variables.
Figure 3.3b illustrates the discrepancy between the observed HVSCI criterion scores
(i.e., the circular dots) for the 1,000 examinees and their predicted scores (i.e., the solid
straight line of best fit), based on the regression equation developed from our sample data.

FIGURE 3.3a.  Regression line of best fit with 95% prediction interval. The dashed lines repre-
sent the 95% prediction interval based on the regression equation. The confidence interval is inter-
preted to mean that in (1 – a) or 95% of the sample confidence intervals that would be formed
from the multiple random samples, the population mean value of Y for a given value of X will be
included.
Criterion, Content, and Construct Validity  85

FIGURE 3.3b.  Regression line of best fit with observed versus predicted Y values and the 95%
prediction interval.

3.11 Regression Analysis for Estimating Criterion Validity:


Development of the Regression Equation

The concepts we have developed in the previous sections (and in Chapter 2) provide
a solid foundation for proceeding to the development of a multiple linear regression
equation. However, before proceeding, we review the assumptions of the multiple linear
regression (presented in Table 3.4). Since the model is linear, several assumptions are
relevant to properly conduct an analysis. The model assumptions should be evaluated
with any set of data prior to conducting a regression analysis because violations of the
assumptions can yield inaccurate parameter estimates (i.e., intercept, regression slopes,
and standard errors of slopes). Moderate violations of the assumptions weaken the regres-
sion analysis but do not invalidate it completely. Therefore, researchers need a degree of
judgment specific to violations of the assumptions and their impact on the parameters to
be estimated in a regression analysis (e.g., see Tabachnick & Fidell, 2007, or Draper &
Smith, 1998, for detailed guidance).
For reasons of brevity and simplicity of explanation, we focus on the sample regres-
sion and prediction equations rather than the population equations. However, the
equation elements can be changed to population parameters under the appropriate cir-
cumstances (e.g., population focused and the design of the study includes randomization
in the sampling protocol and model cross validation). In the population equations, the
notation changes to Greek letters (i.e., for population parameters) rather than English
letters. The following sections cover (1) the unstandardized and standardized multiple
regression equations, (2) the coefficient of multiple determination, (3) multiple correla-
tion, and (4) tests of statistical significance. Additionally, the F-test for testing the signifi-
cance of the multiple regression equation is presented.
86  PSYCHOMETRIC METHODS

Table 3.4.  Assumptions and Violations of Assumptions of Multiple Linear


Regression
Assumption Effect of assumption violation How to check assumption
Regression of Y on the Bias in partial slopes and inter- Residual plot of the errors of prediction, ei,
Xks is linear cept; expected change in Y is not and values of Yi predicted; points in the graph
a constant and depends on value should be scattered in a rectangular shape
of Xk around zero
Independence of Influences standard errors of
residuals model
Residual means equal Bias in Y´
zero
Homogeneity of vari- Bias in sres2; may inflate standard Residual plot of the errors of prediction, ei,
ance of residuals errors or result in nonnormal and values of Yi predicted; points in the graph
conditional distributions should be randomly scattered around zero
Normality of residuals Less precise partial slopes and Residual plot
coefficient of determination
Values of Xk are fixed (a) Extrapolating beyond the As a research design issue, assumes that
range of Xk combinations: predic- the scores on the predictor variables are the
tion errors larger, may also bias only ones applicable to the regression equa-
partial slopes and intercept; (b) tion (e.g., the predictors are not considered
interpreting within the range of Xk as being random variables)
combinations: smaller effects than
in (a); if other assumptions met,
minimal effect
Nonmulticollinearity Regression coefficients can be Checked using the tolerance statistic;
of the Xks quite unstable across samples ranges from 0 to 1 (1 being best); values of
(as standard errors are larger); < .1 indicative of multicollinearity
R2 may be significant, yet none
of the predictors is significant;
restricted generalizability of the
model
Outliers Extreme scores influence the Mahalanobis distance values are calculated
regression coefficients and there- for scores on all predictors using examinee
fore the accuracy of the resulting ID as dependent (criterion) variable (e.g.,
equation using the regression procedure in SPSS; this
results in a new variable being created in
the dataset named “MAH_1”); Mahalanobis
distance values are created, a chi-square
table of critical values can be used to
evaluate whether the Mahalanobis distance
values are significant (e.g., the degrees of
freedom to use in the chi-square table is
the number of predictors in the regression
analysis). Procedures such as Explore in SPSS
facilitates the identification of Mahalanobis
distance values using the newly created
variable MAH_1 as the dependent variable
Criterion, Content, and Construct Validity  87

3.12 Unstandardized Regression Equation


for Multiple Regression

The unstandardized multiple regression equation is developed from sample data as


illustrated in Equation 3.12.
Here we want to predict an examinee’s future score on the criterion. To do so requires
application of the sample prediction equation (provided in Equation 3.13).

3.13 Testing the Regression Equation for Significance

Testing the statistical significance of the overall regression equation involves the hypothe­
sis in Equation 3.14 (i.e., the hypothesis that R2 is zero in the population; note the Greek
letter representing population parameters).
If the null hypothesis in Equation 3.14 is rejected, then at least one of the predictors is
statistically significant. Conversely, if the hypothesis is not rejected, then the overall test indi-
cates that none of the predictors plays a significant role in the equation. The statistical test of
the hypothesis that R2 is zero in the population is provided in Equation 3.15a.
Inserting values from Tables 3.5a and 3.5b into Equation 3.15b, we see that the result
concurs with the SPSS output.

Equation 3.12. Unstandardized regression equation for a sample

Yi = a + b1X1i + b2X2i + ... + bmXmi + ei

• Yi = score on the criterion variable for subject i.


• X1i = score on predictor variable 1 for examinee i.
• X2i = score on predictor variable 2 for examinee i.
• b1 = sample partial slope for the regression line for Y
predicted by Xk after removing the influence of other
predictors.
• b2 = sample partial slope for the regression line for Y
predicted by Xk after removing the influence of other
predictors.
• a = sample intercept.
• ei = examinee-specific errors of prediction or residuals
(part of Y not predicted by the X’s).
• ­i = index value for examinees 1 . . . n in a sample.
88  PSYCHOMETRIC METHODS

Equation 3.13. Sample prediction equation with multiple predictors

YI′ = A + B1 X1I + B2 X2 I + ... + BM X MI

• Yi′ = predicted score on the criterion variable for subject i.


• X1i = score on predictor variable 1 for examinee i.
• X2i = score on predictor variable 2 for examinee i.
• b1 = sample partial slope for the regression line for Y predicted
by Xk after removing the influence of other predictors.
• b2 = sample partial slope for the regression line for Y pre-
dicted by Xk after removing the influence of other
predictors.
• a = sample intercept.
• ei = e xaminee-specific errors of prediction or residuals
(part of Y not predicted by the X’s).
• i = index value for examinees 1 . . . n in a sample.

Equation 3.14. Hypothesis tests for overall regression equation


H 0 : ρY .1,...,m2 = 0

H1 : ρY .1,...,m2 > 0

• H0 = null hypothesis.
• H1 = alternative hypothesis.
ρ
• Y .1,...,m = p
2  opulation coefficient of multiple determination
(R2 in the sample).

To illustrate Equations 3.15a and 3.15b, the results of a multiple regression analysis
using SPSS syntax below are presented in Tables 3.5a and 3.5b.

SPSS syntax for multiple linear regression

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI (95) R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN (.05) POUT (.10) CIN (95)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot stm3_tot
Criterion, Content, and Construct Validity  89

/SCATTERPLOT= (HVSCI ,*ZPRED) (*ZPRED ,*ZRESID)


/RESIDUALS DURBIN HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3)
/SAVE PRED ZPRED MAHAL COOK ICIN RESID ZRESID.

Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.

Equation 3.15a. F-test for overall regression equation

R2
F= M
(1 − R 2 )
(N − M − 1)

• F = F-ratio to be used in determining statistical significance


based on the F-distribution.
• R = coefficient of multiple determination in a sample.
2

• m = number of predictors.
• n = sample size.

Equation 3.15b. F-test for overall regression equation

R2 .658
m 3 .219 .219
F= = = = = 638.65
(1 − R 2 ) (1 − .658) .342 .0003
(n − m − 1) (1000 − 3 − 1) 996

Table 3.5a.  Overall Test of the Multiple Regression Equation


Model Summaryb
Change Statistics
Adjusted Std. Error R
R R of the Square F Sig. F Durbin-
Model R Square Square Estimate Change Change df1 df2 Change Watson
1 .811a .658 .657 12.81874 .658 638.656 3 996 .000 1.827
a. Predictors: (Constant), Gsm short-term memory: auditory and visual components, Gc
measure of vocabulary, Gf measure of graphic identification
b. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion
measure of crystallized IQ
Variable entry procedure = ENTER (all predictors entered simultaneously)
90  PSYCHOMETRIC METHODS

Table 3.5b.  Overall Test of the Multiple Regression Equation


ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 314832.424 3 104944.141 638.656 .000a
Residual 163662.932 996 164.320
Total 478495.356 999
a. Predictors: (Constant), Gsm short-term memory: auditory and visual
components, Gc measure of vocabulary, Gf measure of graphic identification
b. Dependent Variable: Highly valid scale of crystallized intelligence—external
criterion measure of crystallized IQ

Regarding the statistics in Tables 3.5a and 3.5b, R is the multiple correlation (i.e.,
a single number representing the correlation among the three predictors and the crite-
rion). Notice that R is fairly large within the context of a correlation analysis (i.e., the
range of 0 to 1.0). Next, we see from Table 3.5b that the overall regression equation
is statistically significant at a probability of less than .001 (p < .001). The significant R
is interpreted as meaning that at least one variable is a significant predictor of HVSCI
(readers should verify this by referring to the F-table of critical values in a statistics
textbook). Table 3.5b provides the sum of squares (from Equations 3.9 and 3.11),
degrees of freedom, means square (explaining different parts of the regression model),
the F-­statistic, and the significance (“Sig.” signifying the probability value associated
with the F-statistic).

3.14 Partial regression slopes

The partial regression slopes in a multiple regression equation are directly related to the
partial and semipartial correlation presented earlier in the chapter. To illustrate, we use
our example data to calculate the partial slopes in our regression analysis. The equations
for estimating the partial slopes for b1 (the test of language development; cri1_tot) and b2
(the test of graphic identification) are derived in Equations 3.16a and 3.16b.
To determine if the partial slope(s) is(are) statistically significant from zero in the
population, the standard error of a regression slope is required. The hypothesis tested is
population-­based and is provided in Equation 3.17.
The statistical significance of the partial regression slope is evaluated using critical
values of the t-distribution and associated degrees of freedom (where df = n – m – 1; n =
sample size and m = number of predictors). The standard error for the partial regression
coefficient is provided in Equation 3.18. Finally, the t-test for significance of the partial
regression slope is provided in Equation 3.19.
Criterion, Content, and Construct Validity  91

Equation 3.16a. Partial regression slope for b1 (predictor 1)

(RYX1 − RYX2 RX1X2 )SY [.79 − (.42 *.39)]21.8 13.64


B1 = = = = 1.88
(1 − RX21X2 ) SX1 (1 − .392 )8.6 7.29
• b1 =u  nstandardized sample partial slope for the language
development test.
• rYX1 = c orrelation between the criterion (HVSCI) and lan-
guage development (predictor 1).
• rYX2 = correlation between the criterion (HVSCI) and
graphic identification (predictor 2).
• rX1X2 = c orrelation between language development (predic-
tor 1) and graphic identification (predictor 2).
• sY = sample standard deviation of the criterion (HVSCI).
• sX1 = s ample standard deviation of language development
(predictor 1).
2
• rX1X2 = the squared correlation between language development
(predictor 1) and graphic identification (predictor 2).
Note. SPSS regression output yields a b1 coefficient of 1.89. Differ-
ence is due to the number of decimal places used throughout the
hand calculations versus SPSS calculations.

Equation 3.16b. Partial regression slope for b2 (predictor 2)

(rYX2 − rYX1 rX2 X1 )sY [.42 − (.79 *.39)]21.8 2.44


b2 = = = = .53
(1 − rX22 X1 ) sX2 (1 − .392 )5.4 4.58

• b2 = unstandardized sample partial slope for the graphic


identification test.
• rYX2 = c orrelation between the criterion (HVSCI) and
graphic identification (predictor 2).
• rYX1 = correlation between the criterion (HVSCI) and lan-
guage development test (predictor 1).

Note. SPSS regression output yields a b1 coefficient of .54. Differ-


ence is due to the number of decimal places used throughout the
hand calculations versus SPSS calculations. The remaining element
definitions are the same as in Equation 3.16a for b1.
92  PSYCHOMETRIC METHODS

Equation 3.17. Hypothesis test for regression coefficient

H 0 : βk = 0
H 0 : βk ≠ 0
• H0 = null hypothesis.
• ­bk = population regression coefficient for predictor k.

Equation 3.18. Standard error of a regression slope

SRESIDUAL
S( BK) =
(N − 1)SK2 (1 − R2K)

• s(bk) = standard error of a regression slope.


• sresidual = 1 – R2 or the sum of squares due to regression of Y
on the Xks.
• n − 1 = sample size minus 1.
• sk2 = variance of predictor k.
2
• Rk = coefficient of multiple determination defined as the
overlap between predictor Xk and the remaining
predictors.

Equation 3.19. Significance test of a regression slope coefficient

b
t=
s(bk )

• t = calculated t-value for the predictor based on the data.


• b = unstandardized regression coefficient.
• s(bk) = standard error of a regression slope.

Note. To facilitate understanding of Equations 3.18 and 3.19, read-


ers may find it helpful to review Equations 3.3a–c regarding the
role of the sum of squares and degrees of freedom in regression
analysis.
Criterion, Content, and Construct Validity  93

3.15 Standardized Regression Equation

The regression of Y on X (or multiple X’s) can also be expressed in a z-score metric by
transforming raw scores on Y and X (see Chapter 2 for a review of how to transform a
raw score to a z-score). After transformation, the means and variances are now expressed
on a 0 and 1 metric. A result of this transformation is that the regression slopes are now
standardized (i.e., the standardized regression slopes) and are equal to rXY, the Pearson
correlation. Since the scores on Y and the multiple X’s are standardized, no intercept is
required for the regression equation. Equations 3.20–3.22 illustrate (1) the standardized
prediction equation, (2) the unstandardized regression equation, and (3) the sample pre-
diction equation using the example intelligence test data.

Equation 3.20. Standardized prediction equation for a sample

ZYI′ = B1*Z1I + B2* Z2 I + ... + B*M ZMI

• ZYi′ = s core on the criterion variable expressed on a z-score


metric for subject i.
• b1* z1i = s core on predictor variable 1 expressed on a z-score
metric for examinee i.
*
b z
• 2 2 i = s core on predictor variable 2 expressed on a z-score
metric for examinee i.

Equation 3.21. Unstandardized sample regression equation

a = Y − b1 X1 − b2 X2 = 80.24 − (1.88)(35.23) − (.53)(9.02)


= 80.24 − (66.23 − 4.78) = 80.24 − 66.23 − 4.78 = 9.2
• a = sample intercept.
• Y = mean score for the sample on the criterion (HVSCI).
• b1 = u nstandardized sample partial slope for the language
development test.
• X1 = mean for the sample on the language development test.
• b2 = u nstandardized sample partial slope for the graphic
identification test.
• X2 = mean for the sample on the graphic identification test.

Note. SPSS regression output yields values for the a intercept of .855.
Difference is due to the number of decimal places used throughout the
hand calculations versus SPSS calculations.
94  PSYCHOMETRIC METHODS

Equation 3.22. Sample prediction equation with parameter esti-


mates from Equation 3.21

YI′ = B1 X1I + B2 X2 I + ... + BM X MI + A = 1.88(7) + .53(6) + 9.2

= 13.16 + 3.18 + 9.2 = 25.54

In Equation 3.21, consider the scenario where an examinee has a language develop-
ment score of 7 and a graphic identification score of 6. Using the results from the previous
calculations in Equation 3.21 for b1, b2, and a, we can calculate the examinee’s predicted
score on HVSCI as shown in Equation 3.22.
Using the following syntax (below) to run the regression analysis yields a predicted
value of 25.1 for our examinee. To have SPSS save the predicted values for every exam-
inee (and the 95% prediction interval), the “/SAVE PRED” line in the syntax provided
below is required. A comparison of the predicted value for this examinee reveals that our
regression coefficients and intercept are in agreement (within decimal places/rounding
differences). The correlation between the actual (observed) HVSCI scores and the
predicted scores for the 1,000 examinees is .81. To evaluate the degree of association
between the actual and predicted scores, you can run the correlation between the saved
(predicted) scores for all 1,000 examinees and their actual scores.

SPSS multiple regression syntax

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot
/SCATTERPLOT=(HVSCI ,*ZPRED)
/SAVE PRED.

Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.

3.16 Predictive Accuracy of a Regression Analysis

Arguably, the most critical question related to predictive validity when using regression
analysis is, “How accurate is the regression equation in terms of observed scores versus
scores that are to be predicted by the equation?” Answering this question involves using
the standard error of the estimate (SEE), a summary measure of the errors of prediction
based on the conditional distribution of Y for a specific value of X (see Figure 3.4).
Criterion, Content, and Construct Validity  95

FIGURE 3.4.  Conditional distribution of Y given specific values of predictor X. The criterion (Y)
HVSCI is regressed on the predictor (X; cri1_tot). Notice that the distribution appears the same for
each value of the predictor. The standard score (Z) to raw score equivalence on the language develop-
ment test (predictor X) is approximately: –3 = 8.0; –2 = 18.0; –1 = 26.0; 0 = 35.0; 1 = 44.0; 2 = 50.0.

Next, to illustrate the role of the SEE, we use the simple linear regression of Y
(HVSCI) on X (cri1_tot). For a sample with a single predictor, the standard error of the
estimate is provided in Equation 3.23a.
Using Equation 3.23a, we can calculate the standard error of the estimate for the
simple linear regression model. To provide a connection with the output produced in
SPSS, we conduct a regression analysis for estimating the sample coefficients for the

Equation 3.23a. Standard error of the estimate

SY . X = ∑(Y − Y ′)2 =
SSRESIDUAL
N − K −1 N − K −1
• sY·X = sample standard error of the estimate for the
regression of Y on X.
• Y − Y¢ = difference between an observed score (on the
criterion) and the predicted score.
• ­S(Y − Y¢) = sum of the difference (errors of prediction)
2

between an observed score on the criterion and


the predicted score on the criterion squared.
• N = sample size.
• k = number of independent or predictor variables.
• ssresidual = sum of the squared residuals where a residual
is defined as the difference between an exam-
inee’s observed score (on the criterion) and
the predicted score squared.
96  PSYCHOMETRIC METHODS

regression of Y (HVSCI) on X (the predictor language development; cri1_tot). The SPSS


syntax below includes all of the options necessary (1) for evaluating the assumptions
of the linear regression model (e.g., outliers, diagnostics by each examinee, graphs of
residuals for evaluating homogeneity of variance) and (2) for estimating predicted scores
on Y, including the 95% confidence around the predicted score for Y. Recall that interpre-
tation of the 95% confidence interval means that in 1 − a or 95% of the sample confidence
intervals that would be formed from multiple samples, the population mean value of Y for
a given value of X will be included.

SPSS syntax for simple linear regression

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10) CIN(95)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot
/SCATTERPLOT=(HVSCI ,*ZPRED) (*ZPRED ,*ZRESID)
/RESIDUALS DURBIN HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3)
/SAVE PRED ZPRED MAHAL COOK ICIN RESID ZRESID.

Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.

As an exercise, we can apply the information in Table 3.6a into Equation 3.23b, and
we can verify that the standard error of the estimate is 13.176 as provided in the output
included in Table 3.6b.
Next, we calculate an examinee’s predicted score using the regression coefficients
and the intercept (in Table 3.6c). Specifically, we will predict a score on the HVSCI for an
examinee whose actual score is 25 on the HVSCI and 12 on the language development

Table 3.6a.  Analysis of Variance Summary Table Providing the Sum of Squares
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 305224.404 1 305224.404 1758.021 .000a
Residual 173270.952 998 173.618
Total 478495.356 999
a. Predictors: (Constant), Gc measure of vocabulary
b. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion
measure of crystallized IQ (HVSCI)
Criterion, Content, and Construct Validity  97

Equation 3.23b. Standard error of the estimate using estimates


from GfGc data

Σ( − ′)2 RESIDUAL 173270.952 173270.952


SY . X = = = = = 13.176
N − K −1 N − K−1 1000 − 1 − 1 998

Table 3.6b.  Regression Model Summary


Model Summaryb
Adjusted Std. Error of
Model R R Square R Square the Estimate Durbin–Watson
1 .799a .638 .638 13.17643 1.719
a. Predictors: (Constant), Gc measure of vocabulary (cri1_tot – language
development)
b. Dependent Variable: Highly valid scale of crystallized intelligence—external
criterion measure of crystallized IQ (HVSCI)

(cri1_tot) test using Equation 3.13. The sample prediction equation using information
from the SPSS output (Table 3.6c) is given in Equation 3.24.
Because we created the predicted scores on HVSCI for all examinees in the GfGc data-
set (e.g., see the last line highlighted in the SPSS syntax “/SAVE PRED”), we can check if
the result in Equation 3.24 agrees with SPSS. The predicted score of 33.07 (now included
in the dataset as a new variable) is 8.07 points higher than the observed score of 25. This
discrepancy is due to the imperfect relationship (i.e., a correlation of .79) between language

Table 3.6c.  Regression Coefficients for Single Predictor Model


Coefficientsa
95.0%
Unstandardized Standardized Confidence
Coefficients Coefficients Interval for B
Std. Lower Upper
Model B Error Beta t Sig. Bound Bound
1 (Constant/ 8.712 1.756 4.961 .000 5.266 12.158
intercept)
cri1_tot 2.030 .048 .799 41.929 .000 1.935 2.125
(language
development)
a. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion
measure of crystallized IQ (HVSCI)
98  PSYCHOMETRIC METHODS

Equation 3.24. Sample prediction equation for a single prediction

Yi′ = a + b1 X1i = 8.712 + 2.03(12) = 8.712 + 24.36 = 33.07

development and HVSCI in our sample of 1,000 examinees. Finally, using the sum of
squares presented in Table 3.6a, we can calculate R2—the total variation in Y that is pre-
dictable using the predictor or predictors in a simple linear or multiple linear regression
equation. Recall that R2 is calculated by dividing the sum of squares regression by the sum
of squares total in Table 3.6b. For example, using the sums of squares in Table 3.6b, the
result is 305224.404/478495.356 = .638. Notice that .638 is the same as in Table 3.6a, the
regression model summary table.
Recall from the previous section that the correlation between HVSCI (Y) and lan-
guage development (X) was .79, an imperfect relationship. To address the question of
how accurate the predicted scores using the regression equation are requires application
of the standard error of a predicted score provided in Equation 3.25.
Equation 3.25 is also used to create confidence intervals for (1) predicted scores for
each examinee in a sample or (2) the mean predicted score for all examinees. The SPSS
syntax provided earlier in this section includes the options to produce the predicted
scores for each examinee and the associated 95% prediction intervals.
As mentioned earlier, it is common to have multiple predictor variables in a predictive
validity study. Estimation of the standard error of prediction for the multiple regression is more
complex and involves matrix algebra. Fortunately, the computer executes the calculations

Equation 3.25. Standard error of a predicted score: Single


predictor case

1 ( X − X )2
SY ′ = SY ⋅X 1 + +
N ∑ X2
• sY·X = sample standard error of prediction for the
regression of Y on X.
• X − X = difference between an observed predictor score
and the mean predictor score.
• ( X − X )2 = difference between an observed predictor score
and the mean predictor score squared.
• N = sample size.
• k = number of independent or predictor variables.
• S
­ X2 = sum of the scores on predictor variable X squared.
Criterion, Content, and Construct Validity  99

for us. To understand the details of how the calculations are derived, readers are referred
to Pedhazur (1982, pp. 68–96) or Tabachnick and Fidell (2007). The standard error of pre-
diction for multiple linear regression is provided in Equation 3.26 (Pedhazur, 1982, p. 145).
The SPSS syntax for conducting multiple linear regression provided next includes
the options to produce the predicted scores for each examinee and the associated 95%
prediction intervals.

SPSS syntax for multiple linear regression with prediction intervals

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN(.05) POUT(.10) CIN(95)
/NOORIGIN
/DEPENDENT HVSCI
/METHOD=ENTER cri1_tot fi2_tot stm3_tot
/SCATTERPLOT=(HVSCI ,*ZPRED) (*ZPRED ,*ZRESID)
/RESIDUALS DURBIN HISTOGRAM(ZRESID) NORMPROB(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3)
/SAVE PRED ZPRED MAHAL COOK ICIN RESID ZRESID.

Note. The variable entry method is “ENTER” where all predictors are entered into the equation at
the same time.

Equation 3.26. Standard error for a prediction equation: Multiple


linear regression

SY2 ′ = SY2 .12...K 1 + P′( X′X)−1 P

• sY2 ′ = s ample standard error for predicting Y from mul-


tiple X’s.
2
S .
• Y 12...K = s ample standard error of a predicted criterion
score for predictors 1 to k.
• p¢ = t ranspose of p; a vector of raw scores on the
predictors.
• 1 = intercept in matrix formulation.
• ­X =N  × k matrix of deviation scores on k independent
variables.
• ­X¢ = transpose of X.
• (X¢X)−1 = inverse of XX.
• ­p = column vector of scores on the predictor variables
and a 1 for the intercept.
100  PSYCHOMETRIC METHODS

Tables 3.7a through 3.7c provide a partial output from SPSS syntax for multiple lin-
ear regression analysis.
To make the information more concrete, we will calculate an examinee’s predicted
score using the regression coefficients and the intercept in Table 3.7c by inserting the
coefficients into Equation 3.27. Specifically, we will predict a score on the HVSCI for an
examinee whose actual score is 25 on the HVSCI, 12 on language development, 0.0 on

Table 3.7a.  Regression Model Summary


Model Summaryb
Change Statistics
Adjusted Std. Error R
R R of the Square F Sig. F Durbin–
Model R Square Square Estimate Change Change df1 df2 Change Watson
1 .811a .658 .657 12.81874 .658 638.656 3 996 .000 1.827
a. Predictors: (Constant), Gsm short-term memory: auditory and visual components, Gc measure of
vocabulary, Gf measure of graphic identification
b. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion measure of
crystallized IQ

Table 3.7b.  Analysis of Variance Summary


ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 314832.424 3 104944.141 638.656 .000a
Residual 163662.932 996 164.320
Total 478495.356 999
a. Predictors: (Constant), Gsm short-term memory: auditory and visual components, Gc measure of
vocabulary, Gf measure of graphic identification
b. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion measure of
crystallized IQ

Table 3.7c.  Regression Coefficient Summary


Coefficientsa
95.0%

Unstandardized Standardized Confidence Collinearity

Coefficients Coefficients Interval for B Correlations Statistics


Std. Zero-

Model B Error Beta t Sig. LL UL order Partial Part Tolerance VIF


1 (Constant)b 4.406 2.068 2.131 .033 .348 8.463
Cri1_tot 1.854 .052 .729 35.342 .000 1.751 1.957 .799 .746 .655 .806 1.240
Fi2_tot .424 .088 .106 4.816 .000 .251 .597 .428 .151 .089 .715 1.398
graphic
Stm3_tot .409 .115 .077 3.562 .000 .184 .634 .393 .112 .066 .733 1.364
a. Dependent Variable: Highly valid scale of crystallized intelligence—external criterion measure of crystallized IQ.
b. Constant = intercept.
Criterion, Content, and Construct Validity  101

Equation 3.27. Sample prediction equation for multiple predictors

Yi′ = a + b1 X1i + b 2 X2 i + b 3 X3 i

= 4.406 + 1.854(12) + .424(0) + .409(9) = 30.33

graphic identification, and 9.0 on the short-term memory test using Equation 3.27. The
sample prediction equation with the sample values applied to the regression coefficients
from the SPSS output (Table 3.7c) is illustrated in Equation 3.27.
Notice that the predicted score of 30.33 is 2.73 points closer to the examinee’s
actual HVSCI score of 25 than the score of 33.07 predicted with the single predictor
equation. From Table 3.7a, we see that the standard error of the estimate for the mul-
tiple regression equation is 12.82 (compared to 13.17 in the single predictor model).
Therefore, by adding short-term memory (but not fluid intelligence–based graphic
identification because the examinee’s score was 0) to the multiple regression equa-
tion, predictive accuracy was increased in the prediction model. At this point, using
multiple regression is often desirable in conducting validity studies, but how should
you go about selecting predictors to be included in a regression model? The next sec-
tion addresses this important question.

3.17 Predictor Subset Selection in Regression

In behavioral research, many predictor variables are often available for use in con-
structing a regression equation to be used for predictive validity purposes. Also, the
predictor variables are correlated with each other in addition to being correlated with
the criterion. The driving factor dictating variable selection should be substantive
knowledge of the topic under study. Also, achieving model parsimony is desirable
by identifying the smallest number of predictor variables from a total set of variables
that provides the maximum variance explained in the criterion variable. Focusing on
model parsimony also improves the sample size to predictor ratio because the fewer
the predictors, the smaller the sample size required for reliable results. Also, Lord
and Novick (1968, p. 274) note that the addition of many predictor variables seldom
improves the regression equation because the incremental improvement in the vari-
ance accounted for based on adding new variables is very low after a certain point.
When the main goal of a regression analysis is to obtain the best possible equation,
several variable entry procedures are available. These techniques include (1) forward
entry, (2) backward entry, (3) stepwise methods, and (4) all possible regressions opti-
mization. The goal of variable selection procedures is to maximize the variance explained
in the criterion variable by the set of predictors. The techniques may or may not be used
in consideration of theory (e.g., in a confirmatory approach). One other technique of
variable entry is the enter technique, where all predictors are entered into the model
102  PSYCHOMETRIC METHODS

simultaneously (with no predetermined order). This technique (used in the examples ear-
lier in the chapter) produces the unique contribution of each predictor with the criterion
in addition to the relationship among the predictors. For a review and application of these
techniques, readers are referred to Cohen et al. (2003, pp. 158–162), Draper and Smith
(1998), and Hocking (1976).

3.18 Summary

This chapter introduced validity, an overview of the validation process and statistical tech-
niques for estimating validity coefficients. Validity was defined as a judgment or estimate of
how well a test or instrument measures what it is supposed to measure. For example, we
are concerned with the accuracy of answers regarding our research questions. Answering
research questions in psychological and/or behavioral research involves using scores obtained
from tests or other measurement instruments. To this end, the accuracy of the scores is cru-
cial to the relevance of any inferences made. Criterion, content, and construct validity were
presented and contextualized within the comprehensive framework of validity, with crite-
rion and content forms of score validity serving to inform construct validity. Four guidelines
for establishing evidence for the validity of test scores were discussed: (1) evidence based on
test response processes, (2) evidence based on the internal structure of the test, (3) evidence
based on relations with other variables, and (4) evidence based on the consequences of test-
ing. The chapter presented statistical techniques for estimating criterion validity, along with
applied examples using the GfGc data. Chapter 4 presents additional techniques for estab-
lishing score validity. Specifically, techniques for classification and selection and for content
and construct validity are presented together with applied examples.

Key Terms and Definitions

Analysis of variance. A statistical technique for determining the statistical differences


among means; it can be used with more than two groups.
Ceiling effect. A phenomenon arising from the diminished utility of a tool of assessment in
distinguishing examinees at the high end of the ability, trait, or other attribute being
measured (Cohen & Swerdlik, 2010, p. 317).
Coefficient of multiple determination. A measure of the proportion of the variance of the
dependent variable about its mean that is explained by the independent or predictor
variables. The coefficient can vary between 0 and 1. If the regression model is prop-
erly estimated, the higher the coefficient of multiple determination (R2), the greater the
explanatory power of the regression equation (Hair, Anderson, Tatham, & Black, 1998,
p. 143).
Construct validity. An evidence-based judgment about the appropriateness of inferences
drawn from test scores regarding individual standings on a variable defined as a
construct.
Criterion, Content, and Construct Validity  103

Content validity. An evidence-based judgment regarding how adequately a test or other


measurement instrument samples behavior representative of the universe of behavior
it was designed to sample.
Correction for attenuation. A corrective technique that adjusts the validity coefficient for
measurement error on the predictor, criterion, or possibly both.
Criterion contamination. Occurs when the criterion measure, at least in part, consists of
the same items that exist on the test under study.
Criterion validity. A type of validity that is demonstrated when a test is shown to be effec-
tive in estimating an examinee’s performance on some outcome measure (Gregory,
2000, p. 98).
Cross validation. Procedure of dividing a sample into two parts: the analysis sample
used to estimate the discriminant function(s) or logistic regression model, and the
holdout sample used to validate the results (Hair et al., 1998, p. 241).
Degrees of freedom. The number of scores in a frequency distribution of scores that are
free to vary.
First-order partial correlation. A measure of the relationship between a single indepen-
dent or predictor variable and the dependent or criterion variable, holding all other
independent or predictor variables constant. The first-order partial correlation is often
used synonymously with partial correlation.
Floor effect. A phenomenon arising from the diminished utility of a tool of assessment in
distinguishing examinees at the low end of the ability, trait, or other attribute being
measured (Cohen & Swerdlik, 2010, p. 248).
Higher-order partial correlation. A measure of the relationship between two or more
independent or predictor variables and the dependent or criterion variable, holding
all other independent or predictor variables constant.
Multiple correlation. A linear combination of independent or predictor variables that
maximally correlate with the criterion or dependent variable.
Multiple linear regression. The analysis of relationships between more than one
independent variable and one dependent variable to understand how each pre-
dictor predicts the dependent or criterion variable (Cohen & Swerdlik, 2010,
p. 245).
Prediction equation. An equation used to predict scores on a criterion from a single or
multiple predictor variable.
Regression equation. The best-fitting straight line for estimating the criterion from a pre-
dictor or set of predictors.
Score validity. A judgment regarding how well test scores measure what they purport to
measure. Score validity affects the appropriateness of the inferences made and any
actions taken.
Squared multiple correlation. A linear combination of independent or predictor vari-
ables (squared) that maximally correlate with the criterion or dependent variable.
104  PSYCHOMETRIC METHODS

Standard error of the estimate. A summary measure of the errors of prediction based
on the conditional distribution of Y for a specific value of X.
Standardized regression slope. The slope of a regression line that is in standard score
units (e.g., z-score units).
Statistical control. Controlling the variance by accounting for (i.e., partialing out) the
effects of some variables while studying the effects of the primary variable (i.e., test)
of interest.
Sum of squares regression. Sum of the squared differences between the mean and
predicted values of the dependent or criterion variable for all observations (Hair et
al., 1998, p. 148).
Sum of squares total. Total amount of variation that exists to be explained by the
independent or predictor variables. Created by summing the squared differences
between the mean and actual values on the dependent or criterion variables (Hair et
al., 1998, p. 148).
Systematic variance. An orderly progression or pattern, with scores obtained by an exam-
inee changing from one occasion to another in some trend (Ghiselli, 1964, p. 212).
t-distribution. A family of curves each resembling a variation of the standard normal
distribution for each possible value of the associated degrees of freedom. The
t-distribution is used to conduct tests of statistical significance in a variety of analysis
techniques.
Trait. A relatively stable characteristic of a person which is manifested to some degree
when relevant, despite considerable variation in the range of settings and circum-
stances (Messick, 1989, p. 15).
True criterion score. The score on a criterion corrected for its unreliability.

Unstandardized multiple regression equation. The best-fitting straight line for estimat-
ing the criterion from a set of predictors that are in the original units of measurement.
Validation. A process that involves developing an interpretative argument based on a
clear statement of the inferences and assumptions specific to the intended use of test
scores.
Validity. A judgment or statistical estimate based on accumulated evidence of how well
scores on a test or instrument measure what they are supposed to measure.
Validity coefficient. A correlation coefficient that provides a measure of the relationship
between test scores and scores on a criterion measure.
Zero-order correlation. The correlation between two variables (e.g., the Pearson cor-
relation based on X and Y ).
4

Statistical Aspects
of the Validation Process

This chapter continues with the topic of validity, including the statistical aspects and the
validation process. Statistical techniques based on classification and selection of individu-
als are presented within the context of predictive validity. Content validity is presented with
applications for its use in the validation process. Finally, construct validity is introduced
along with several statistical approaches to establishing construct evidence for tests.

4.1 Techniques for Classification and Selection

Many, if not most, tests are used to make decisions in relation to some aspect of people’s
lives (e.g., selection for a job or classification into a diagnostic group). Related to the
criterion validity techniques already introduced in Chapter 3, another predictive valid-
ity technique is based on how tests are used to arrive at decisions about selection and/or
classification of individuals into selective groups. Examples include tests that are used for
the purpose of (1) predicting or distinguishing among examinees who will matriculate
to the next grade level based on passing or failing a prescribed course of instruction,
(2) making hiring decisions (personnel selection) in job settings, and (3) determining
which psychiatric patients require hospitalization. Tests used for selection and/or clas-
sification are based on decision theory. In the decision theory framework, a predictive
validation study has the goal of determining who will likely succeed or fail on some crite-
rion in the future. For example, examinees that score below a certain level on a predictor
variable (test) can be screened from employment, admission to an academic program of study,
or placed into a treatment program based on a diagnostic outcome. Another use of decision-
classification validity studies is to determine if a test correctly classifies examinees into
appropriate groups at a current point in time. For example, a psychologist may need a

105
106  PSYCHOMETRIC METHODS

Are there “criterion” and “predictor” variable sets?

Yes No

Dependence methods Independence methods

How many criterion variables? Are inputs “metric”?

One Several Yes No

Is it metric? Are they metric?

Yes No Yes No

Factor Cluster Metric Nonmetric


Multiple Multivariate
Predictive analysis analysis multidimen- scaling
regression/ discriminate analysis of
sional
Correlation analysis (PDA) variance or scaling
covariance

Logistic Descriptive Canonical


regression discriminant regression/
analysis (DDA) correlation

Logistic
regression

Figure 4.1. Classification of multivariate methods. Adapted from Huberty (1994, p. 27).


Copyright 1994 by Wiley. Adapted by permission.

test that accurately classifies patients into levels of depression such as mild, moderate,
and severe in order to begin an appropriate treatment program; or the psychologist may
need a test that accurately classifies patients as being either clinically depressed or not. In
educational settings, a teacher may need a test that accurately classifies students as being
either gifted or not for the purpose of placing the students into a setting that best meets
their needs. Figure 4.1 illustrates the multivariate techniques useful for conducting pre-
dictive validation studies. Highlighted techniques in Figure 4.1 depict the techniques of
classification presented in this section.

4.2 Discriminant Analysis

Discriminant analysis (DA; Hair et al., 1998; Glass & Hopkins, 1996, p. 184) is a widely
used method for predicting a categorical outcome such as group membership consisting
of two or more categories (e.g., medical diagnosis, occupation type, or college major).
Statistical Aspects of the Validation Process   107

DA was originally developed by Ronald Fisher (1935) for the purpose of classifying
objects into one or two clearly defined groups (Pedhazur, 1982, p. 692). The technique
has been generalized to accommodate classification into any number of groups (i.e.,
multiple discriminant analysis, or MDA). The goal of DA is to find uncorrelated lin-
ear combinations of predictor variables that maximize the between- to within-subjects
variance as measured by the sum-of-squares and cross-products matrices (Stevens,
2003). The sum-of-squares and cross-products matrix is a precursor to the variance–
covariance matrix in which deviation scores are not yet averaged (see Chapter 2 and the
Appendix for a review of the variance–covariance matrix). The resulting uncorrelated
(weighted) linear combinations are used to create discriminant functions, which are
variates of the predictor variables selected for their discriminatory power used in the
prediction of group membership. The predicted value of a discriminant function for
each examinee is a discriminant z-score. The discriminant scores for examinees are
created so that the mean score on the discriminant variable for one group differs maxi-
mally from the mean discriminant score of the other group(s).
Given that the goal of DA is to maximize the between- to within-subjects vari-
ance, the procedure has close connections with multivariate analysis of variance
(MANOVA). In fact, DA is sometimes used in conjunction with MANOVA to study
group differences on multiple variables. To this end, DA is a versatile technique that gen-
erally serves two purposes: (1) to describe differences among groups after a multivari-
ate analysis of variance (MANOVA) is conducted (descriptive discriminant analysis
[DDA]; Huberty, 1994) and (2) to predict the classification of subjects or examinees into
groups based on a combination of predictor variables or measures (predictive discrimi-
nant analysis [PDA]; Huberty, 1994). Note that since DA is based on the general linear
model (e.g., multiple linear regression and MANOVA), the assumptions required for the
correct use of DA are the same. In this chapter, we focus on PDA because it aligns with
predictive validation studies. Also noteworthy is that if randomization is part of the
research design when employing DA, causal inference is justified, providing the proper
experimental controls are included.
DA assumes that multivariate normality exists for the sampling distributions of the
linear combinations of the predictor variables. For a detailed exposition of screening for
assumptions requisite to using DA, see Tabachnick and Fidell (2007). When the assump-
tions for MLR (and DA) are untenable (particularly multivariate normality), logistic
regression can be used instead to accomplish the same goal sought in DA or MDA.
The specific mathematical details of DA and MDA involve matrix algebra and are not
presented here due to space limitations; readers are referred to Pedhazur (1982, pp. 692–
710) and Huberty (1994) for a complete treatment and examples. Using DA to predict
which classification group subjects or examinees fall into based on an optimized linear
combination of predictor variables is the focus of the present section.
To illustrate the concepts and interpretation of DA specific to predictive validity
studies we will use the GfGc data in two examples. In our first example, suppose we want
to determine an examinee’s academic success measured as successful matriculation from
10th to 11th grade based on their scores on fluid, crystallized, and short-term memory
108  PSYCHOMETRIC METHODS

acquired at the start of their freshman year. When conducting a DA, the process begins
by finding the discriminant function with the largest eigenvalue, resulting in maximum
discrimination between groups (Huberty, 1994; Stevens, 2003). An eigenvalue represents
the amount of shared variance between optimally weighted dependent (criterion) and
independent (predictor) variables. The sum of the eigenvalues derived from a correlation
matrix equals the number of variables. If the DA (1) involves more than a small number
of predictors and/or (2) the outcome includes more than two levels, a second eigenvalue
is derived. The second eigenvalue results in the second most discriminating function
between groups. Discriminant functions 1 and 2 are uncorrelated with one another, thereby
providing unique components of the outcome variable. Application of DA and MDA requires
that scores on the outcome variable be available or known ahead of time. In our example,
the outcome is successful matriculation from 10th to 11th grade (labeled as “matriculate”
in the GfGc dataset). These optimal weights serve as elements in a linear equation that is used
to classify examinees for which the outcome is not known.
Using the information on the outcome variable matriculate and scores on fluid,
crystallized, and short-term memory for examinees, we can derive an optimal set of
weights using DA and Equation 4.1a. The result of Equation 4.1a is the production of
the first discriminant function (recall that a second discriminant function is also cre-
ated based on a second equation). With the weights derived from fitting the equation
to the observed data, status on the outcome variable (Y; matriculation) in Equation
4.1a can be calculated for examinees whose status is unknown. You can see the utility of
this technique in predicting the outcome for examinees knowing certain characteristics
about them (e.g., information about different components of their intelligence). To
review, the difference between linear regression and discriminant analysis is that mul-
tiple linear regression (MLR) is used to predict an examinee’s future score on a criterion
measured on a continuous metric (such as intelligence or undergraduate grade point
average) from a set of predictors, whereas DA is used to predict the future classification
of examinees into distinct groups (e.g., for diagnostic purposes, education attainment,
or employment success).
Next, we can use the following SPSS syntax to conduct a discriminant analysis. Selected
parts of the output are used to illustrate how the technique works with fluid intelligence
total scores, crystallized intelligence total scores, and short-term memory total scores.

SPSS syntax for two-group discriminant analysis

DISCRIMINANT
/GROUPS=matriculate(0 1)
/VARIABLES=fi_tot cri_tot stm_tot
/ANALYSIS ALL
/SAVE=CLASS SCORES PROBS
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE CROSSVALID
/PLOT=COMBINED MAP
/CLASSIFY=NONMISSING POOLED.
Statistical Aspects of the Validation Process   109

Equation 4.1a. Linear equation for deriving discriminant weights

YDF1 = aX10 + bX11 + cX12 + dX13

• YDF1 = first discriminant function for the regression of Y on


predictor variables.
• a = discriminant weight for the intercept.
• b = discriminant weight for the crystallized intelligence
test total score.
• c = discriminant weight for the fluid intelligence test
total score.
• d = discriminant weight for the short-term memory test
total or composite score.
• aX10 = product of the weight for intercept on discriminant
function 1.
• bX11 = product of the weight for variable 1 on discriminant
function 1 and the original value for an examinee
on variable 1.
• cX12 = product of the weight for variable 2 on discriminant
function 1 and the original value for an examinee
on variable 2.
• dX13 = product of the weight for variable 3 on discriminant
function 1 and the original value for an examinee
on variable 3.

To illustrate application of Equation 4.1a, we use a score of 30 (fluid intelligence


total score), 48 (crystallized intelligence total score), and 26 (short-term memory total
score) for a single examinee in Equation 4.1b and the results of the SPSS discriminant
analysis. Using the standardized weights in Table 4.1b and inserting these weights into
Equation 4.1a, we see that an examinee with the scores has a discriminant function
z-score of –2.39. This score classifies the examinee into the “nonmatriculating” group.
The discriminant score of –2.39 can be verified by inspecting the GfGc dataset because
the syntax that produced the output in Tables 4.1a–e includes the “SAVE” option (high-
lighted). This option creates two new variables in the GfGc dataset with discriminant
score and classification probability for every examinee.
Recall that in predictive validity studies the goal is to accurately predict how examin-
ees will perform or are classified in the future. The classification table that results from a
DA or MDA provides rich information about the accuracy of the DA or MDA. To facilitate
interpretation of the classification table, we can use the terminology from Pedhazur &
110  PSYCHOMETRIC METHODS

Equation 4.1b. Derivation of first discriminant function for an


examinee

Y F1 = AX 10 + BX 11 + CX 12 + DX13 = (− 5.69) + (.001) X1 + (.072) X2 +(−.007) X3


= (−5.69) + (.001)30 + (.072)48 + (−.007)26
= (−5.69) + .03 + 3.45 − .18
= − 2.39

Schmelkin, 1991, p. 40: (1) valid positives (VP), (2) valid negatives (VN), (3) false
positives (FP), and (4) false negatives (FN). Valid positives and their percentages (Table
4.1e) are those examinees who were predicted to matriculate and did matriculate (i.e.,
VP summarized as 492; 97.4%). Valid negatives (Table 4.1e) are those examinees who
were predicted not to matriculate and did not matriculate (i.e., VN summarized as 448;
90.5%). False positives (Table 4.1e) are examinees who are predicted to matriculate but
did not actually matriculate (i.e., FP summarized as 47; 9.5%). False negatives consist of
examinees predicted not to matriculate but do actually matriculate (i.e., FP summarized
as 13; 2.6%). Figure 4.2 illustrates the information provided in the classification table by
graphing the relationship among the four possible outcomes in our example.
By creating the horizontal (X-axis) and vertical (Y-axis) lines in Figure 4.2, four areas
are represented (i.e., FN, VP, VN, and FP). Partitioning the relationship between crite-
rion and predictors allows for inspection and evaluation of the predictive efficiency of
a discriminant analysis. Predictive efficiency is an evaluative summary of the accuracy of
predicted versus actual performance of examinees based on using DA. The selection ratio

Table 4.1a.  Eigenvalue and Overall Test of Significance


for Discriminant Analysis
Eigenvalues
Canonical
Function Eigenvalue % of Variance Cumulative % Correlation
1 1.521a 100.0 100.0 .777
a. First 1 canonical discriminant functions were used in the analysis. There is only
one discriminant function because there are only 2 categories in the criterion. A
canonical function summarizes the relationship between two linear composites.
Each canonical function has two canonical variates, one for the set of independent
variables and one for the set of dependent variables.

Wilks’ Lambda
Test of Function(s) Wilks’ Lambda Chi-square df Sig.
1 .397 921.265 3 .000
Note. This is the test of significance for the discriminant function.
Statistical Aspects of the Validation Process   111

Table 4.1b.  Canonical Functions and


Structure Matrix
Standardized Canonical Discriminant
Function Coefficients
Function
1
sum of fluid intelligence .014
tests 1 - 3
sum of crystallized 1.014
intelligence tests 1 - 4
sum of short term memory -.044
tests 1 - 3
Note. Standardized coefficients are analogous to
beta (β) coefficients in multiple regression.
These coefficients suffer from the same shortcomings
as in multiple regression (e.g., lack stability and are
affected by the variability of the variables with which
they are associated).

Structure Matrix
Function
1
sum of crystallized intelligence tests 1 - 4 .999
sum of short term memory tests 1 - 3 .404
sum of fluid intelligence tests 1 - 3 .312
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions.
Variables ordered by absolute size of correlation within
function.

Table 4.1c.  Discriminant Function Coefficients


Unstandardized Canonical Discriminant Function
Coefficients
Function
1
sum of fluid intelligence tests 1 - 3 .001
sum of crystallized intelligence tests 1 - 4 .072
sum of short term memory tests 1 - 3 -.007
(Constant) -5.688

Functions at Group Centroids


Function
successfully move from 1
grade 10th to 11th grade
no -1.244
yes 1.220
Notes. These are unstandardized canonical discriminant
functions evaluated at group means. Centroids are
mean discriminant z-scores for all examinees within
a category (e.g., for a two-category DA, there are 2
centroids).
112  PSYCHOMETRIC METHODS

Table 4.1d.  Classification Statistics


Prior Probabilities for Groups
successfully move from Cases Used in Analysis
grade 10th to 11th grade Prior Unweighted Weighted
no .500 495 495.000
yes .500 505 505.000
Total 1.000 1000 1000.000

The default prior for group classification is .50/.50. The prior can be changed to
meet the requirements of the analysis.
Classification Function Coefficients
successfully move
from grade 10th to
11th grade
no yes
sum of fluid intelligence tests 1 - 3 .048 .051
sum of crystallized intelligence tests 1 - 4 .229 .405
sum of short term memory tests 1 - 3 .426 .409
(Constant) - -
-14.280 -28.265
0 5
Notes. These are Fisher’s linear discriminant functions. This is a method
of classification in which a linear function is defined for each group.
Classification is performed by calculating a score for each observation on
each group’s classification function and then assigning the observation to
the group with the highest score.

Table 4.1e.  Classification Table


Classification Resultsb,c
successfully move Predicted Group Membership
from grade 10th to
11th grade no yes Total
Original Count no 448 47 495
yes 13 492 505
% no 90.5 9.5 100.0
yes 2.6 97.4 100.0
Cross-validateda Count no 446 49 495
yes 13 492 505
% no 90.1 9.9 100.0
yes 2.6 97.4 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 94.0% of original grouped cases correctly classified.
c. 93.8% of cross-validated grouped cases correctly classified.
Statistical Aspects of the Validation Process   113

Regression line-Equation 7.X

VP (492 or 97.4%)

Yc A
(13 or 2.6%) FN
(cutting score on D D + A = Base Rate (BR)
criterion; above Yc
line = successful (448 or 90.5%) VN
matriculation=1) FP (47 or 9.5%) C + B = 1 – BR
C B

-5.7
intercept

Xc
X
Examinees scoring<fi2_tot=31, cri1_tot=82, stm_tot=34 Examinees scoring≥fi2_tot=31, cri1_tot=82, stm_tot=34

D + C = 1 – Selection ratio (SR) A + B = Selection ratio (SR)

Figure 4.2.  Predictive efficiency from discriminant analysis. FN, false negative; FP, false posi-
tive; VN, valid negative; VP, valid positive. Total N = 1000 examinees.

pertains to those examinees (i.e., selected to the right of Xc on the X-axis) regardless of
their “true” status on the criterion. The base rate is the proportion of examinees who are
successful (i.e., above the horizontal Yc line on the Y-axis) regardless of their status (scores)
on the predictor(s). Taylor and Russell (1939) defined predictive efficiency as the number
of valid positives divided by the selection ratio (e.g., A/(A+B)). Using this formula in our
example, we find that the predictive efficiency is .91. Based on the bivariate normal distri-
bution of the relationship (i.e., the correlation) between a predictor and criterion, Taylor
and Russell developed tables that provide a way to tabulate the selection ratio as a function
of the degree to which validity coefficients vary. To aid in planning validity studies, the Taylor
and Russell tables provide for calculation of the success ratio based on manipulating the fol-
lowing: (1) selection ratio and base rate being constant but the correlation between Y and X
varies, (2) selection ratio and the correlation between Y and X are constant, and (3) base rate
and the correlation between Y and X are constant, but the selection ratio varies (Pedhazur &
Schmelkin, 1991, p. 42). Table 4.2 illustrates a portion of the Taylor–Russell tables.
As can be seen in Table 4.2, when conducting predictive validity studies where the
goal is selection or classification, relying only on the validity coefficient is insufficient.
For example, the complex interplay among false negative, valid positive, valid negative,
and false positive must be considered in relation to the goal of the validation study so
114  PSYCHOMETRIC METHODS

Table 4.2.  Excerpt from the Taylor and Russell Tables


Scenario
a b c
SR = .50 SR = .50 BR = .50
BR = .50 rxy = .40 rxy = .40
rxy BR SR
.20 .56 .10 .16 .10 .78
.30 .60 .30 .41 .30 .69
.60 .70 .70 .81 .70 .58
Note. From Pedhazur and Schmelkin (1991, p. 42). Copyright 1991 by Lawrence
Erlbaum Associates. Reprinted by permission.

that any unintended consequences on examinees are minimized (e.g., see the AERA,
APA, & NCME, 1999, Standards for a review). Allen and Yen (1979, pp. 101–108) pro-
vide an excellent discussion with examples regarding the use of the Taylor and Russell
tables in relation to the four outcomes FN, VP, VN, and FP.
Finally, note that Table 4.1e refers to cross validation classification results. An
­additional step in DA is a cross-validation analysis to evaluate the predictive accuracy of
the DA equation. The cross-validation procedure involves dividing the sample into two
parts: (1) the analysis sample used for estimating the discriminant function(s) or logistic
regression model and (2) the holdout sample used to validate the results. The purpose of
cross validation is to ensure that overfitting of the discriminant function has not occurred
by conducting a repeat analysis on a separate independent sample. The term overfitting
refers to the situation in which the solution from an analysis is so good that it is unlikely
to be able to be replicated in the population. To check if overfitting is a problem, cross
validation is often conducted using an independent random sample.

4.3 Multiple-Group Discriminant Analysis

As mentioned previously, DA can be extended to the case where the outcome includes
multiple categories (i.e., MDA). To illustrate an MDA, the following SPSS syntax pro-
duces results displayed in Tables 4.3a–4.3e. The MDA analysis is performed by includ-
ing the “/GROUPS=depress(1 3)” line in the syntax. The only difference in the syntax

Table 4.3a.  Eigenvalue and Overall Test of Significance for the Discriminant
Analysis
Wilks’ Lambda
Test of Function(s) Wilks’ Lambda Chi-square df Sig.
1 through 2 .446 804.727 6 .000
2 1.000 .271 2 .873
Statistical Aspects of the Validation Process   115

Table 4.3b.  Canonical Functions and Structure Matrix


Standardized Canonical Discriminant Function
Coefficients
Function
1 2
sum of fluid intelligence tests 1 - 3 .004 .808
sum of crystallized intelligence tests 1 - 4 .977 -.488
sum of short term memory tests 1 - 3 .049 .461
Structure Matrix
Function
1 2
sum of crystallized intelligence .999* -.035
tests 1 - 4
sum of fluid intelligence tests .339 .862*
1-3
sum of short term memory .463 .623*
tests 1 - 3

Table 4.3c.  Discriminant Function Coefficients


Canonical Discriminant Function Coefficients
Function
1 2
sum of fluid intelligence tests .000 .076
1-3
sum of crystallized intelligence .065 -.033
tests 1 - 4
sum of short term memory .008 .075
tests 1 - 3
(Constant) -5.565 -2.054
Unstandardized coefficients

Functions at Group Centroids


Function
level of depression 1 2
low .983 .006
moderate -.927 -.013
severe -3.023 .068

Table 4.3d.  Classification Statistics


Prior Probabilities for Groups
Cases Used in Analysis
level of depression Prior Unweighted Weighted
low .333 528 528.000
moderate .333 433 433.000
severe .333 39 39.000
Total 1.000 1000 1000.000
116  PSYCHOMETRIC METHODS

Table 4.3e.  Classification Table


Classification Function Coefficients
level of depression
low moderate severe
sum of fluid intelligence tests 1 - 3 .047 .045 .050
sum of crystallized intelligence tests 1 - 4 .336 .212 .073
sum of short term memory tests 1 - 3 .475 .458 .447
(Constant) -25.767 -15.044 -7.690
Fisher’s linear discriminant functions

between the MDA and the two-group DA is that the depression group now has three
categories. The interpretation is much the same as that in the two-group classification
example output tables, except that there are now two discriminant functions (see Table
4.3b). An additional feature of the MDA is the discriminant function plot of the group
centroids to aid interpretation of the analysis (Figure 4.3).

SPSS syntax for multiple-group discriminant analysis

DISCRIMINANT
/GROUPS=depression(1 3)
/VARIABLES=fi_tot cri_tot stm_tot
/ANALYSIS ALL
/PRIORS EQUAL
/STATISTICS=MEAN STDDEV UNIVF BOXM COEFF RAW TABLE CROSSVALID
/PLOT=COMBINED MAP
/CLASSIFY=NONMISSING POOLED.

Figure 4.3.  Discriminant function plot with group centroids.


Statistical Aspects of the Validation Process   117

4.4 Logistic Regression

In DA, the dependent (criterion) variable is dichotomous (nonmetric). Group member-


ship is predicted by estimating discriminant scores based on a z-score metric, then evalu-
ating the discriminant score against an established cutoff point. Because the dependent
variable is dichotomous, cut scores must be established in order to assign examinees to a
group (e.g., passing/not passing, diseased/not diseased, employed/not employed). Proper
use of DA described up to this point requires that the data meet the assumptions of the
general linear model (e.g., as was the case for multiple linear regression). However, some-
times in predictive studies, our data do not meet the requisite assumptions of the linear
model. In this case, logistic regression serves as a useful alternative.
Logistic regression is a nonlinear regression technique that estimates the probability
of an event occurring. In predictive validity studies, the event is the classification of exam-
inees into a category of group membership. Probability estimates are bounded or limited
to a range of between 0 and 1. Therefore, the predicted values of group classification
must also fall within this range. Mathematically, logistic regression assumes a relationship
between the independent (predictor) and dependent (criterion) variable that resembles an
S-shaped or logistic curve (see Figure 4.4). At very low levels of the dependent or crite-
rion variable (Y-axis), the probability approaches zero. Also, as the independent or predic-
tor variable increases, the probability increases in a nonlinear fashion. Notice in Figure 4.4
that as the independent variable (displayed on the X-axis) approaches the uppermost or
high end of the curve, the probability approaches one (on the Y-axis).
Logistic regression is a natural model for predicting group membership that is based
on a dichotomous outcome because it is not limited by the assumptions of the MLR model.
For example, dichotomous or binary variables (i.e., possible outcomes of only 0 or 1) fol-
low the binomial distribution (see the Appendix for a review) rather than the standard
normal distribution with continuous variables. The error distribution for dichotomous
variables is not normally distributed; and normality of errors is an assumption in MLR.

1.0
Probability of event (Dependent variable)

0
Low High
Level of the independent variable

Figure 4.4.  Logistic curve.


118  PSYCHOMETRIC METHODS

Also, the variance of the binomial distribution is not constant across the score scale (i.e.,
the homogeneity of variance assumption in MLR is violated when using a dichotomous
variable). Another useful feature of logistic regression is that the predictor variables can
be ordinal, interval, or a mixed level of measurement.
Recall that in MLR the method of least squares is used to estimate the regression
coefficients in an analysis by using the sum of squared differences from the mean and
individual scores (i.e., the sum of squares as the fundamental element for deriving param-
eter estimates). Estimating the model coefficients in logistic regression involves using
the maximum likelihood method (see the Appendix and Chapter 10 on item response
theory in this text). The method of maximum likelihood estimation is iterative, mean-
ing that the algorithm moves through a process whereby parameter estimates are refined
or improved up to a certain point where any further improvement is negligible. The
maximum likelihood estimation process results in regression parameter estimates that
are most likely (i.e., maximally likely) to result based on the observed data. The result
of maximum likelihood estimation is a likelihood value. Also, when using maximum
likelihood estimation, we evaluate the fit of the regression model to the data. Figures 4.5a
and 4.5b illustrate (1) the situation where the data to model fit is good and (2) a poor
model to data fit using logistic regression.
In conducting a logistic regression, we need to know whether an event has occurred
(e.g., matriculate or not from one grade to another in an educational setting or clinically

1.0
Probability of event (Dependent variable)

Region probability = 1 (matriculate = yes)

z = 56.5
.50

Region probability = 0 (matriculate = no)

0
Low .50 High
(< .50 = 0; > .50 = 1)
Level of the independent variable

Figure 4.5a.  Well-defined relationship.


Statistical Aspects of the Validation Process   119

1.0
Probability of event (Dependent variable)

Region of misclassification for predicted


values of 1 (matriculating)

.50

Region of misclassification for predicted


values of 0 (not matriculating)

0
Low .50 High
Level of the independent variable

Figure 4.5b.  Poorly defined relationship.

depressed or not clinically depressed in a psychological setting). Armed with this knowl-
edge, we can use this information as our dependent or criterion variable. Based on knowl-
edge of the outcome, the logistic regression procedure estimates the probability that an
event will or will not occur. If the probability is greater than .50, then the prediction is yes,
otherwise no. The logistic transformation is applied to the dichotomous dependent vari-
able and produces logistic regression coefficients according to Equation 4.2a.
To illustrate application of Equation 4.2a, as before in our discriminant analysis
example, we use a score of 30 (fluid intelligence total score), 48 (crystallized intelligence
total score), and 26 (short-term memory total score) for a single examinee in Equation
4.2b to predict successful matriculation from the 10th to 11th grade. The criterion vari-
able is labeled “matriculate” in the GfGc dataset. Figure 4.6 illustrates the location of the
examinee in relation to the logistic regression model.
The following syntax is used to conduct the logistic regression analysis. Tables 4.4a
through 4.4d provide the output from the analysis.

SPSS logistic regression syntax

LOGISTIC REGRESSION VARIABLES matriculate


/METHOD=ENTER cri_tot fi_tot stm_tot
/SAVE=PRED PGROUP COOK LRESID ZRESID
/CLASSPLOT
/PRINT=GOODFIT CI(95)
/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).
120  PSYCHOMETRIC METHODS

Equation 4.2a. Probability of event occurring in logistic regression

E B0 + B1X1+ . . . + BM XM
PROB(EVENT) = Yˆ I =
1 + E B0 + B1X1+ . . . + BM XM

• Prob(event) = probability of an event occurring (e.g., 1 = suc-


cessful matriculation).
• ŶI = predicted outcome category (1 or 0) for an
examinee.
• e = constant of 2.718 in the regression equation.
• B0 = intercept in the logistic equation.
• B1X1 = regression coefficient for scores on predictor
variable 1.
• BmXm = regression coefficient for scores on predic-
tor variables up to m, the total number of
predictors.

Equation 4.2b. Probability of matriculation in logistic regression

E −18.82 + .00(30)+ .23(48)+− .01(26)


PROB(EVENT) = Yˆ I =
1 + E −18.82 + .00(30)+ .23(48)+− .01(26)
0.0 + 1.0 + 1.25(48) − 1.0(26) 35.5
= = = .97
1 + 0.0 + 1.0 + 1.25(48) − 1.0(26) 36.5

Using the unstandardized weights in Table 4.4d and inserting these weights as illus-
trated in Equation 4.2b, we see that the result for the equation (i.e., the probability for
group membership = 1) for an examinee with this set of scores on the predictors is .97.
Furthermore, in Table 4.4d, we see that the only predictor variable that is statistically
significant is the crystallized intelligence test (cri1_tot; p < .001, odds ratio or Exp(B) =
1.25). The Wald test is similar to the t-test in MLR and is calculated as the squared B
divided by its standard error. Finally, the odds ratio, labeled as Exp(B), for cri1_tot is
1.25. To interpret, for an examinee scoring 48 on the language development test, the odds
of successfully matriculating increase by a factor of 1.25. An odds ratio of 1.0 is inter-
preted as an examinee having no greater than a 50% chance of successful matriculation.
Finally, odds ratios of 2.0 or higher are recommended in terms of practical importance
Statistical Aspects of the Validation Process   121

1.0
Probability of event (Dependent variable)

Region probability = 1 (matriculate =


yes)

z = 56.5
.50

Region probability = 0 (matriculate =


no)

0
Low .50 High
(< .50 = 0; > .50 = 1)
Level of the independent variable

Figure 4.6. Location of an examinee based on the logistics regression model. This figure is
based on an examinee who scores 30 (fluid intelligence total score), 48 (crystallized intelligence
total score), and 26 (short-term memory total score). Note that the examinee is located just to the
right of the probability = .50 vertical line, indicating that the student is predicted to successfully
matriculate from 10th to 11th grade.

Table 4.4a.  Overall Model Fit


Model Summary
Cox & Snell R Nagelkerke R
Step -2 Log likelihood Square Square
1 448.726a .608 .811
a. Estimation terminated at iteration number 8 because parameter
estimates changed by less than .001.

Table 4.4b.  Chi-Square Goodness-of-Fit Test


Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 937.469 3 .000
Block 937.469 3 .000
Model 937.469 3 .000
122  PSYCHOMETRIC METHODS

Table 4.4c.  Classification Table


Classification Tablea
Predicted
successfully move from
grade 10th to 11th grade Percentage
Observed no yes Correct
Step successfully move from no 478 17 96.6
1 grade 10th to 11th grade yes 13 492 97.4
Overall Percentage 97.0
a. The cut value is .500.

Table 4.4d.  Tests of Predictors and Odds Ratios


Variables in the Equation
95% C.I.for
EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step 1a cri_tot .229 .017 184.591 1 .000 1.257 1.216 1.299
fi_tot .000 .013 .001 1 .979 1.000 .975 1.025
stm_tot -.008 .023 .117 1 .732 .992 .947 1.039
Constant -18.822 1.403 179.923 1 .000 .000
a. Variable(s) entered on step 1: cri_tot, fi_tot, stm_tot. Exp(B) is the odds ratio.

(Tabachnick & Fidell, 2007; Hosmer & Lemeshow, 2000). Using this odds ratio guide-
line, the other two predictor variables are not practically important (and not statistically
significant). Next we turn to the situation where the outcome has more than two catego-
ries, an extension of logistic regression known as multinomial logistic regression.

4.5 Logistic Multiple Discriminant Analysis:


Multinomial Logistic Regression

The preceding example addresses the case where the criterion has only two possible
outcomes. The logistic model can be extended to the case where there are three or more
levels in the criterion. To illustrate, we use a criterion variable with three levels or pos-
sible outcomes (e.g., low, moderate, severe depression). The criterion variable is labeled
“depression” in the GfGc dataset. The logistic regression model that is analogous to the
multiple discriminant analysis presented earlier is provided in the SPSS syntax below
(Tables 4.5a–4.5f). Notice that SPSS uses multinomial regression to conduct the analy-
sis where the criterion has more than two levels of the outcome. Tables 4.5a through 4.5f
are interpreted as in the previous section, the only difference being in Table 4.5e and 4.5f,
where the parameter estimates and classification tables now include three levels of the
Statistical Aspects of the Validation Process   123

Table 4.5a.  Overall Model Fit


Model Fitting Information
Model
Fitting
Criteria Likelihood Ratio Tests
-2 Log
Model Likelihood Chi-Square df Sig.
Intercept Only 1648.169
Final (model 885.660 762.509 6 .000
with predictors)

Table 4.5b.  Chi-Square Goodness-of-Fit Test


Goodness-of-Fit
Chi-Square df Sig.
Pearson 73966262.375 1972 .000
Deviance 881.501 1972 1.000
Note. Highly sensitive to sample size, therefore other model-data fit
evaluation should be conducted (see Tabachnick & Fidell, 2007; Hosmer
& Lemeshow, 2000).

Table 4.5c.  Pseudo R-Square


Pseudo R-Square
Cox and Snell .534
Nagelkerke .660
McFadden .461
Note. Although the pseudo R-square is similar
to R-square in MLR, it is not precisely the same.
See Tabachnick and Fidell, 2007; Hosmer and
Lemeshow, 2000, for interpretation.

Table 4.5d.  Likelihood Ratio Tests of Predictors


Likelihood Ratio Tests
Model
Fitting
Criteria Likelihood Ratio Tests
-2 Log
Likelihood
of
Reduced Chi-
Effect Model Square df Sig.
Intercept 1336.039 450.379 2 .000
stm_tot 887.393 1.734 2 .420
fi_tot 885.738 .079 2 .961
cri_tot 1389.027 503.368 2 .000
Notes. The chi-square statistic is the difference in −2 log likelihoods
between the final model and a reduced model. The reduced model is
formed by omitting an effect from the final model. The null hypothesis is that
all parameters of that effect are 0.
124  PSYCHOMETRIC METHODS

Table 4.5e.  Parameter Estimates


Parameter Estimates
95% Confidence Interval for
Exp (B)
Level of depressiona B Std. Error Wald df Sig. Exp(B) Lower Bound Upper Bound
low Intercept –15.017 259.471 .003 1 .954
stm_tot .033 10.877 .000 1 .998 1.034 5.706E-10 1.874E9
fi_tot –0.007 5.226 .000 1 .999 .993 3.541E-5 27875.811
cri_tot .262 4.467 .003 1 .953 1.299 .000 8248.046
moderate Intercept –5.104 222.487 .001 1 .982
stm_tot .010 10.363 .000 1 .999 1.010 1.525E-9 6.696E8
fi_tot –.007 4.907 .000 1 .999 .993 6.603E-5 14934.614
cri_tot .148 4.188 .001 1 .972 1.160 .000 4261.056
a. The reference category is: severe.

Table 4.5f.  Classification Table


Classification
Predicted
Observed Low Moderate Severe Percent Correct
low 504 23 1 95.5%
moderate 44 388 1 89.6%
severe 3 15 21 53.8%
Overall Percentage 55.1% 42.6% 2.3% 91.3%

outcome. The syntax below is used to conduct a multinomial logistic regression as an


alternative to MDA. For a comprehensive yet understandable treatment on multinomial
and ordinal regression, see Hosmer and Lemeshow (2000, pp. 260–308).

SPSS multinomial logistic regression syntax

NOMREG depression (BASE=LAST ORDER=ASCENDING) WITH stm_tot fi_


tot cri_tot
/CRITERIA CIN(95) DELTA(0) MXITER(100) MXSTEP(5) CHKSEP(20)
LCONVERGE(0) PCONVERGE(0.000001) SINGULAR(0.00000001)
/MODEL
/STEPWISE=PIN(.05) POUT(0.1) MINEFFECT(0) RULE(SINGLE)
ENTRYMETHOD(LR) REMOVALMETHOD(LR)
/INTERCEPT=INCLUDE
/PRINT=ASSOCIATION CLASSTABLE FIT PARAMETER SUMMARY LRT CPS
STEP MFI
/SCALE=PEARSON
/SAVE ESTPROB PREDCAT PCPROB ACPROB.
Statistical Aspects of the Validation Process   125

4.6 Model fit in logistic regression

The –2 log likelihood statistic is the global model-fit index used in evaluating the ade-
quacy of the logistic regression model fit to the data. The –2 log likelihood represents
the sum of the probabilities associated with the predicted and actual outcomes for each
examinee or case in the dataset (Tabachnick & Fidell, 2007). A perfect model-data fit
yields a –2 log likelihood statistic of zero; therefore the lower the number, the better the
model-data fit. The chi-square statistic represents a test of the difference between the
intercept only model (i.e., in SPSS the “constant” only model) versus the model with one
or more predictors included. In our example (see Table 4.4a in the previous section), the
chi-square is significant (p < .001), meaning that our three-predictor model is better than
the intercept only model; the –2 log likelihood statistic is 448.726 (see Table 4.4a). As
in MLR, the decision regarding the method for entry of the predictors into the equation
depends on the goal of the study. Variable-entry options include enter or direct method
(all predictors enter the equation simultaneously), stepwise, forward, and backward selec-
tion. For guidance regarding the decision about using a particular variable-entry method,
see Tabachnick and Fidell (2007, pp. 454–456) or Hosemer and Lemeshow (2000). The
Cox and Schnell R2 and Nagelkerke R2 represent the proportion of variance accounted for
in the dependent variable by the predictors. For a comparison and interpretation of the
R2 statistics produced in logistic regression versus MLR, see Tabachnick and Fidell (2007,
pp. 460–461). As in MLR, larger values of R2 are desirable and reflect a better regression
model. Collectively, upon review of the results of the logistic regression analysis using
the same data as in the discriminant analysis example, we see a high degree of agreement.
We next turn to a type of validity evidence that depends on the information con-
tained in the items comprising a test—content validity. Specifically, the items comprising
a test reflect a representative sample of a universe of information in which the investiga-
tor is interested.

4.7 Content Validity

Content validity provides a framework for providing a connection between generat-


ing criterion scores and the use and interpretation of such scores (Gulliksen, 1950a).
Cronbach and Meehl (1955) defined content validity as a model that uses test items to
reflect a representative sample of a universe of information in which the investigator is inter-
ested. Additionally, content validity is ordinarily established deductively by defining a
universe of items and sampling systematically within this universe to establish the test
(Cronbach & Meehl, 1955, p. 281). The rationale underlying the content validity model
is that a sample of responses to test items or task performances in some area of activity
represents an estimate of an overall level of knowledge or skill in a task-related activity.
The central idea that justifies the content validity approach is that based on a sample of
tasks that measure traits (e.g., represented as test items, performance ratings, or attitu-
dinal measures), it is legitimate to take the observed performance or scores as an overall
126  PSYCHOMETRIC METHODS

estimate of performance in the domain. The previous statement holds if (1) the observed
scores are considered as being a representative sample from the domain, (2) the per-
formances are evaluated appropriately and fairly, and (3) the sample is large enough to
control for sampling error (Kane, 2006; Guion, 1977).

4.8 Limitations of the Content Validity Model

The content validity model of validation has been criticized on the grounds that it is
subjective and lends itself to confirmatory bias. The criticism of subjectivity stems from
the fact that judgments are made regarding the relevance and representativeness of
tasks to be included on a test (see Chapter 6 for a review of these issues). One attempt
to address the problem of subjectivity in the content validity model involves estimating
the content validity ratio (CVR; Lawshe, 1975). The CVR quantifies content valid-
ity during the test development process by statistically analyzing the performance of
expert judgments regarding how adequately a test or instrument samples behavior
from a universe of behavior it was designed to sample (Cohen & Swerdlik, 2010,
p. 173).
The issue of confirmatory bias in the content validity model stems from the fact that
the process or exercise one goes through to establish evidence for content validity is
driven by a priori ideas about what the content of the test item or tasks should be. To
minimize confirmatory bias, multiple subject matter or content experts are used along
with rating scales to reduce subjectivity in the content validity model. This information
can be used to derive the CVR. Used in isolation, the content validity model is par-
ticularly challenged when applied to cognitive ability or other psychological processes
that require hypothesis testing. Based on the challenges identified, the role content
validity plays in relation to the three components of validity is “to provide support for
the domain relevance and representativeness of the test or instrument” (Messick, 1989,
p. 17). Next, we turn to arguably the most comprehensive explanation of validity—
construct validity.

4.9 Construct Validity

Although criterion and content validity are important components of validity, neither one
provides a way to address the measurement of “complex, multifaceted and theory-based
attributes such as intelligence, personality, leadership,” to name a few examples (Kane,
2006, p. 20). In 1955, Cronbach and Meehl introduced an alternative to the criterion
and content approaches to validity that allowed for the situation where a test purports to
measure an attribute that “is not operationally defined and for which there is no adequate
criterion” (p. 282). A particularly important point Cronbach and Meehl argued was that
even if a test was initially validated using criterion or content evidence, developing a
Statistical Aspects of the Validation Process   127

deeper understanding of the constructs or processes accounting for test performance


requires consideration of construct-related evidence. Cronbach (1971) described the
need for construct validation in test development situations when “there is no criterion
to predict or domain of content to sample and the purpose of the test involves internal
process (e.g., unobservable attributes such as anxiety or intelligence)” (p. 451).
Cronbach was likely the first to argue for the integration of many types of validity
evidence. For example, he proposed that by examining the components of criterion, con-
tent, and construct validity in unison, the result of the exercise yields a comprehensive
and integrated approach to validation. Recall that the most important outcome of the
validation process is the interpretative argument we are able to make regarding the pro-
posed use and interpretation of test scores. To this end, the unified approach to validity and
validation has been forwarded since the mid- to late 20th century. Since that time, scholar
members of the American Educational Research Association, American Psychological
Association, and National Council of Measurement in Education have initiated collab-
orative effort (e.g., the Standards for Educational and Psychological Testing; AERA, APA, &
NCME, 1985, 1999, 2014).
The construct validity model gained acceptance in the early 1980s based on the ideas
previously presented and supported by scholars such as Anastasi (1986) and ­Messick
(1988, 1989). Messick (1988, p. 42; 1995) provides a unifying perspective on inter-
preting validity evidence that cross-classifies test interpretation and use by the evidence
and consequences of test use and score interpretation related to construct validity (Figure
4.7).
In the following section I overview several approaches and procedures for establish-
ing evidence of construct validity.

4.10 Establishing Evidence of Construct Validity

Given that construct validity is complex and multifaceted, establishing evidence of its
existence requires a comprehensive approach. This section presents four types of studies
useful for establishing evidence of construct validity: (1) correlational studies, (2) group
difference studies, (3) factor-analytic studies, and (4) multitrait–multimethod (MTMM)

Test Interpretation Test Use

Construct Validity
Evidential Basis Construct Validity +
Relevance/Utility
Social
Consequential Basis Value Implications
Consequences

Figure 4.7. Messick’s four facets of validity. From Messick (1988, p. 42). Copyright 1988 by
Taylor and Francis. Republished with permission of Taylor and Francis.
128  PSYCHOMETRIC METHODS

studies. The section ends with an example that incorporates an application of the various
components of validity.
What might a comprehensive and rigorous construct validation study look like?
Benson (1988) provides guidelines for conducting a rigorous, research-based construct
validation program of study. Benson’s guidelines (Table 4.6) propose three main compo-
nents: (1) a substantive/stage, (2) a structural/stage, and (3) an external stage. Finally,
Benson’s guidelines align with Messick’s (1995) unified conception of construct validity.
To illustrate how a researcher can apply the information in Table 4.6 to develop
a comprehensive validity argument, consider the following scenario. Suppose that aca-
demic achievement (labeled as X1; measured as reading comprehension) correlates .60
with lexical knowledge (Y) as one component of crystallized knowledge (i.e., in the GfGc
dataset). The evaluation of predictive validity is straightforward and proceeds by presen-
tation of the correlation coefficient and an exposition of the research design of the study.
The astute person will question whether there is another explanation for the correla-
tion of .60 between the crystallized intelligence subtest total score on lexical (i.e., word)
knowledge (Y) and reading comprehension (X1). This is a reasonable question since no
interpretation has occurred beyond presentation of the correlation (validity) coefficient.
A response to this question requires the researcher to identify and explain what additional
types of evidence are available to bolster their argument that crystallized intelligence is
related to an examinee’s academic achievement as measured by reading comprehension
ability. The explicative step beyond merely reporting the validity coefficient becomes
necessary when arguments are advanced that propose crystallized intelligence measures
academic achievement in a general or holistic sense. For example, the reading compre-
hension test (the proxy for academic achievement) may be measuring only an examinee’s
strength of vocabulary.
To this end, addressing alternative explanations involves inquiry into other kinds
of validity evidence (e.g., evidence provided from published validity-based studies). For
example, consider two other tests from the GfGc dataset: (1) language development and
(2) communication ability. Suppose that after conducting a correlation-based validity
study, we find that language development correlates .65 with reading comprehension and
communication ability correlates .40 with the comprehension test. Further suppose that
the mean reading comprehension score decreases for examinees who fail to produce pass-
ing scores on writing assignments in the classroom setting; and that writing assignments
involve correct application of the English language (call this measure X2). Also, say that
mean reading comprehension score increases for examinees on a measure of commu-
nication ability that was developed as an indicator of propensity to violence in schools
(call this measure X3). Under this scenario, a negative correlation between X1 (academic
achievement measured by reading comprehension) and X2 (in-class writing assignments)
eliminates it as a rival explanation for X1 (i.e., as a legitimate explanation for reading
achievement). Also, suppose that a negative correlation between X3 (number of violent
incidences by students on campus) and X1 (academic achievement measured by reading
comprehension) provides an additional aspect that deserves an explanation relative to
the word knowledge component of crystallized intelligence (Y).
Statistical Aspects of the Validation Process   129

Table 4.6.  Components of Construct Validation


Substantive stage/
component Structural stage/component External stage/component
Purpose • Define the theo- • Examine the internal rela- • Examine the external
retical and empiri- tions among the measures relations among the
cal domains of used to operationalize the focal construct (i.e.,
intelligence theoretical construct domain intelligence) and other
(i.e., intelligence) constructs and/or sub-
ject characteristics
Question • How should intel- • Do the observed mea- • Do the focal constructs
asked ligence be defined sures behave in a manner and observed measures
and operationally consistent with the theo- fit within a network if
measured? retical domain definition of expected construct rela-
intelligence? tions (i.e., the nomologi-
cal network)?
Methods­ • Theory develop- • Internal domain studies • Group differentiation
and ment and validation • Item/subscale • Structural equation
concepts • Generate defini- • Item intercorrelations modeling
tions and scale • Exploratory/confirmatory • Correlation of observed
development factor analysis measures with other
• Content validation • Item response theory (IRT) measures
• Evaluate construct • Multitrait–multimethod • Multitrait–multimethod
underrepresenta- matrix matrix
tion and construct • Generalizability theory
irrelevancy
Charac- • A strong psycho- • Moderate item internal • Focal constructs vary
teristics logical theory plays consistency in theorized ways with
of strong a prominent role • Measures covary in a manner other measures
validation • Theory provides a consistent with the intended • Measures of the con-
programs well-specified and theoretical structure structs differentiate
bounded domain of • Factors reflect trait rather existing groups that are
constructs than method variance known to differ on the
• The empirical • Items/measures are repre- constructs
domain includes sentative of the empirical • Measures of focal con-
measures of all domain structs correlate with
potential constructs • Items fit the theoretical other validated measures
• The empirical structure of the same constructs
domain includes • The theoretical/empirical • Theory-based hypoth-
measures that only model is deemed plausible eses are supported, par-
contain reliable (especially when compared ticularly when compared
variance related against other competing to rival hypotheses
to the theoretical models) based on substan-
construct (i.e., con- tive and statistical criteria
struct relevance)
Note. Based on Benson (1988).
130  PSYCHOMETRIC METHODS

In these examples, the correlation evidence exhibited in the two additional valid-
ity studies serves as evidence for eliminating measures X2 and X3 in the current study of
crystallized intelligence and academic achievement.

4.11 Correlational Evidence of Construct Validity

One way to establish construct validity evidence is to conduct a correlational study with
two goals in mind. The first goal is closely related to content validity and involves evalu-
ating the existence of item homogeneity (i.e., the items on the test tap a common trait or
attribute) for a collection of test items. If item homogeneity exists, then we have evidence
of a homogeneous scale. The second goal involves evaluating the relationship between
an existing criterion and the construct (represented by a collection of test items). From
the perspective of test users, the purpose of these approaches is to allow for the evaluation
of the quantity and quality of evidence relative to how scores on the test will be used. The
quantity and quality of evidence are evaluated by examining the following criteria:

1. The size of the correlation between the test item under study and the total test
score (e.g., the point–biserial correlation represents the association of a test item
with the total score—see Chapter 6 and the Appendix for a review—between
each item and the total score on the test).
2. The size of the correlation between the test under study and the criterion (for
criterion-related evidence).
3. Calculation of the proportion of variance (i.e., the correlation coefficient squared)
accounted for by the relationship between the test and the criterion.
4. Interpretation of the criterion validity coefficient in light of sampling error (e.g.,
the size and composition of the sample used to derive the correlation coefficients).

Ensuring that item homogeneity exists is an important first step in evaluating a test
for construct validity. However, when considered alone, it provides weak evidence. For
example, you may find through item analysis results from a pilot study that the items
appear to be appropriately related from a statistical point of view. However, relying on
item homogeneity in terms of the content of the items and the correlational evidence
between the items and the total score on the test can be misleading (e.g., the items may be
relatively inaccurate in terms of what the test is actually supposed to measure). Therefore,
a multifaceted approach to ensuring that test items accurately tap a construct is essential
(e.g., providing content plus construct validity evidence in a way that establishes a com-
plete argument for score validity; see Kline, 1986). A shortcoming of the correlational
approach to establishing construct validity evidence lies in the lack of uniformly accepted
criteria for what the size of the coefficient should be in order to provide adequate associa-
tional evidence. Also, the results of a correlational study must be interpreted in light of
previous research. For example, the range of correlation coefficients and proportions of
Statistical Aspects of the Validation Process   131

variance accounted for from previous studies should be provided to place any correlation
study in perspective.

4.12 Group Differentiation Studies of Construct Validity

Often, researchers are interested in how different groups of examinees perform relative
to a particular construct. Investigating group differences involves evaluating how test
scores differ between a group of examinees’ scores on a criterion who (1) are different on
some sociodemographic variable or (2) received some treatment expected to affect their
scores (e.g., in an experimental research study). Validity studies of group differences
posit hypothesized relationships in a particular direction (e.g., scores are expected to be
higher or lower for one of the groups in the validity study). If differences are not found,
one must explore the reasons for this outcome. For example, the lack of differences
between groups may be due to (1) inadequacy of the test or instrument relative to the
measurement of the construct of interest, (2) failure of some aspect of the research design
(e.g., the treatment protocol, sampling frame, or extraneous unaccounted for variables), or
(3) a flawed theory underlying the construct.

4.13 Factor Analysis and Construct Validity

Factor analysis plays an important role in establishing evidence for construct validity. This
section presents only a brief overview to illustrate how factor analysis is used to aid in
construct validation studies. Chapter 9 provides a comprehensive foundation on the topic.
Factor analysis is a variable reduction technique with the goal of identifying the minimum
number of factors required to account for the intercorrelations among (1) a battery of items
comprising a single test (e.g., 25 items measuring the vocabulary component of verbal
intelligence of crystallized intelligence) or (2) a battery of tests theoretically representing
an underlying construct (e.g., the four subtests measuring crystallized intelligence in the
GfGc dataset). In this way, factor analysis is a variable reduction technique that takes a large
number of measured variables (e.g., items on tests or total scores on subtests) and reduces
them to one or more factors representing hypothetical unobservable constructs.
In psychometrics, factor analysis is used in either an exploratory or a confirmatory
mode. In the exploratory mode, the goal is to identify a set of factors from a set of test items
(or subtest total scores) designed to measure certain constructs manifested as examinee
attributes or traits. In exploratory factor analysis (EFA), no theory is posited ahead of
time (a priori); instead, the researcher conducts a factor analysis using responses to a large
set of test items (or subtests) designed to measure a set of underlying constructs (e.g., attri-
butes or traits of examinees manifested by their responses to test items). Exploratory factor
analysis is sometimes used as an analytic tool in the process of theory generation (e.g., in
the substantive and structural stages in Table 4.6 during the development of an instrument
targeted to measure a construct where little or no previous quantitative evidence exists).
132  PSYCHOMETRIC METHODS

Alternatively, confirmatory factor analysis (CFA) is a theory-confirming technique


because one seeks to confirm a specific factor structure using the covariance (or correla-
tion) matrix generated from a sample of responses to test items (or subtest total scores).
For example, CFA might be used in the structural and external stages of test development
in an effort to confirm that scores on a test or instrument are functioning as expected
according to a particular theory (e.g., see Table 4.6). Formally, when using CFA one posits
the existence of a construct or set of constructs that accounts for the covariation among
the original set of variables (e.g., test items). The factor-analytic approach to establish-
ing evidence of construct validity is based on statistically evaluating a correlation (or
variance–­covariance) matrix based on a set of measurements (e.g., responses to test items)
from a sample of examinees.
CFA is a particularly useful technique in construct validation because it provides a
powerful framework for confirming or disconfirming a theory specific to what a test is
measuring. For example, a researcher may conduct a construct validation study, with the
objective being to test or evaluate a theory about the number and type of constructs (i.e.,
factors) that account for the intercorrelations among the variables (i.e., test items and
subtest total scores) being studied. For example, Price, Tulsky, Millis, and Weiss (2002)
examined the factor structure of the Wechsler Memory Scale—Third Edition (WMS-III;
Wechsler, 1997a) relative to the number of factors the test measured and sought to deter-
mine whether the factors were correlated. In the study, CFA models consisting of two,
three, or four factors (i.e., constructs) were rigorously evaluated to determine how many
factors optimally represented the underlying structure of the test. Additionally, the WMS-
III includes immediate and delayed memory components that created an additional chal-
lenge to the CFA analysis. For example, the WMS-III theory was evaluated for its factor
structure according to (1) the number of factors, (2) the items and subtests that composed
each factor, and (3) the immediate and delayed components of memory.
In test development, conducting a factor-analytic study involves administering a bat-
tery of tests to a representative sample of several hundred examinees. A general guideline
for the required sample size in factor-analytic studies is a minimum of 10 to 15 examinees
per analytic unit (e.g., see Chapter 9). The same guidelines apply to item-level data (e.g.,
one may want to factor-analyze a single test composed of 50 multiple-choice questions to
determine if the items represent a single factor). Therefore, for the example data used in
this book, the minimum sample size is 100 to 150 (i.e., 10 tests × 10 or 15 examinees per
test or subtest). If an item-level factor analysis is to be conducted, the same sample size
guidelines apply. Importantly, the sample size question is also to be considered in light of
the psychometric integrity of the tests used in the factor analysis (e.g., tests with highly
reliable scores allow one to use the lower end of the sample size recommendation).
Next, we review the role factor analysis plays in producing evidence for the con-
struct validity of a test using the GfGc intelligence test data. In our example, 10 tests
have been administered to the GfGc sample of 1,000 examinees. The first step in factor
analysis is the computation of the correlations (or covariance) between scores on the
45 possible pairs of tests (the 45 pairs are derived based on the formula N(N − 1)/2 =
10(9)/2 = 45). The computations are internal to factor-analysis routines incorporated in
Statistical Aspects of the Validation Process   133

statistical programs such as SPSS or SAS. Alternatively, one may use the variance–covariance
matrix when conducting factor analysis (e.g., when using structural equation modeling
[SEM], also known as covariance structure modeling). Using SEM to conduct factor analy-
sis requires using programs such as Mplus, LISREL, SPSS-AMOS, EQS, and SAS PROC
CALIS (to name only a few). Returning to our example, after running the factor-analysis
program, a table of factor loadings (Table 4.7) is produced, aiding in interpreting the
factorial composition of the battery of tests. A standardized factor loading is scaled on a
correlation metric (ranging between –1.0 and +1.0) and represents the size and strength
of an individual test on a factor. Below is the SPSS syntax that produces the factor load-
ings in Table 4.7.

SPSS syntax producing the loadings provided in Table 4.7

FACTOR
/VARIABLES stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_
tot cri4_tot fi1_tot fi2_tot fi3_tot
/MISSING LISTWISE
/ANALYSIS stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_tot
cri4_tot fi1_tot fi2_tot fi3_tot
/PRINT UNIVARIATE CORRELATION SIG KMO EXTRACTION ROTATION
/PLOT EIGEN ROTATION
/CRITERIA FACTORS(3) ITERATE(25)
/EXTRACTION PAF
/CRITERIA ITERATE(25)
/ROTATION PROMAX(4)
/METHOD=CORRELATION.

Table 4.7.  Factor Loadings for the 10 Subtests Comprising the GfGc Data
Factor loading
Test I II III
Gc—Vocabulary .87 .49 .44
Gc—Knowledge .83 .48 .43
Gc—Abstract Reasoning .83 .56 .62
Gc—Conceptual Reasoning .84 .45 .43
Gf—Graphic Orientation .51 .56 .82
Gf—Graphic Identification .53 .67 .80
Gf—Inductive & Deductive Reasoning .06 .16 .26
Stm—Short-term Memory—visual clues .69 .68 .54
Stm—Short-term Memory—auditory & visual .44 .78 .53
Stm—Short-term Memory—math reasoning .50 .80 .68
Note. Loadings are from the structure matrix produced from a principal axis factor analysis with promax (correlated
factors) rotation. In a principal axis factor analysis with promax (correlated factors), only elements of a structure
matrix may be interpreted as correlations with oblique (correlated) factors. See Chapter 9 on factor analysis for
details.
134  PSYCHOMETRIC METHODS

Table 4.8.  Factor Correlations from SPSS Factor


Analysis in Table 7.15
Factor Correlation Matrix
Crystallized Short-term Fluid
Factor intelligence memory intelligence
1 1.000 .607 .590
2 .607 1.000 .724
3 .590 .724 1.000
Extraction Method: Principal Axis Factoring.
Rotation Method: Promax with Kaiser Normalization.

To interpret the results of our example factor analysis, refer to the shaded areas
to highlight the pattern of loadings relative to each subtest comprising the total scores
for examinees on crystallized intelligence, fluid intelligence, and short-term memory. In
Table 4.7, we see that the crystallized intelligence subtests group together as factor I (e.g.,
because the highest factor loadings in columns labeled I–III are in column I (e.g., .87, .83,
.83, .84). No other subtest (Gf or Stm) exhibits a higher loading than those displayed in
column I. The same scenario exists when you examine the size of the loadings for the Gf
and Stm subtests. In summary, the subtests representing the Gc, Gf, and Stm composite
(i.e., total scores) factor-analyze in line with GfGc theory; resulting in factor-analytic
evidence for the constructs of each type of intelligence. Also, produced in the results of
the factor analysis is the correlation between the three factors (Table 4.8). The correla-
tion coefficients between the composites are .61 between crystallized intelligence and
short-term memory; .59 between crystallized intelligence and fluid intelligence; and .72
between fluid intelligence and short-term memory. As expected from GfGc theory, these
factors are related.

4.14 Multitrait–Multimethod Studies

Campbell and Fiske (1959) introduced a comprehensive technique for evaluating the
adequacy of tests as measures of constructs called multitrait–multimethod (MTMM).
The MTMM technique includes evaluation of construct validity while simultaneously
considering examinee traits and different methods for measuring traits. To review, a trait
is defined as “a relatively stable characteristic of a person . . . which is manifested
to some degree when relevant, despite considerable variation in the range of settings
and circumstances” (Messick, 1989, p. 15). Furthermore, interpretation of traits also
implies that a latent attribute or attributes accounts for the consistency in observed
patterns of score performance. For example, MTMM analysis is used during the struc-
tural and external stages of the validation process (e.g., see Table 4.6) in an effort to
evaluate (1) the relationship between the same construct and the same measurement
method (e.g., via the reliabilities along the diagonal in Table 4.9b); (2) the relationship
between the same construct using different methods of measurement (i.e., convergent
Statistical Aspects of the Validation Process   135

validity evidence—heterotrait–monomethod); and (3) the relationship between dif-


ferent constructs using different methods of measurement (i.e., discriminant validity
evidence—heterotrait–heteromethod coefficient). An example of an application of
the MTMM is provided in Tables 4.9a and 4.9b. Table 4.9a is the general layout of a
MTMM matrix. Table 4.9b includes the traits as identified in the GfGc dataset used
throughout this book. Note the interpretation of the coefficients in the body of the table(s)
below each table.

Table 4.9a.  Multitrait–Multimethod Matrix


Method 1 Method 2 Method 3
Traits A1 B1 C1 A2 B2 C2 A3 B3 C3
Method 1 A1 (88)
B1 50 (88)
C1 36 38 (78)
Method 2 A2 58 22 08 (90)
B2 22 60 10 66 (92)
C2 12 11 46 60 58 (86)
Method 3 A3 56 20 11 68 42 36 (94)
B3 22 58 12 44 66 34 67 (92)
C3 11 12 42 34 32 58 58 60 (85)
Note. Numbers in body of table are correlation coefficients except those in parentheses which are reliabilities. Letters
A, B, and C refer to traits; subscripts refer to methods.
  = discriminant validity (different traits measured by same methods—should be lowest of all).
Bold = convergent validity (same trait measured by different methods—should be strong and positive).
  = discriminant validity (different traits measured by different methods—should be lowest of all).
() = reliability coefficients for each test.

Table 4.9b.  Application MTMM Matrix


Method 1 Method 2 Method 3
  Traits A1 B1 C1 A2 B2 C2 A3 B3 C3
Multiple- Crystallized IQ (A1) (88)
choice Working Memory (B1) 50 (88)
(Method 1) Mathematics Achievement (C1) 36 38 (78)
Incomplete Crystallized IQ (A2) 58 22 8 (90)
sentence Working Memory (B2) 22 60 10 66 (92)
(Method 2) Mathematics Achievement (C2) 12 11 46 60 58 (86)
Vignette/­ Crystallized IQ (A3) 56 20 11 68 42 36 (94)
scenario item Working Memory (B3) 22 58 12 44 66 34 67 (92)
set (Method 3) Mathematics Achievement (C3) 11 12 42 34 32 58 58 60 (85)
Note. Numbers in body of table are correlation coefficients except those in parentheses which are reliabilities.
  = discriminant validity (different traits measured by same methods—should be lowest of all).
Bold = convergent validity (same trait measured by different methods—should be strong and positive).
  = discriminant validity (different traits measured by different methods—should be lowest of all).
() = reliability coefficients for each test.
136  PSYCHOMETRIC METHODS

4.15 Generalizability Theory and Construct Validity

The generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972) provides
another way to systematically study construct validity. Generalizability theory is covered in
detail in Chapter 8 of this book. In generalizability theory, the analysis of variance is used to
study the variance components (i.e., the errors or measurement due to specific sources) attrib-
utable to examinees’ scores and the method of testing (Kane, 1982). Importantly, generaliz-
ability theory is not simply the act of using analysis of variance and calling it generalizability
theory. As Brennan (1983) notes, “there are substantial terminology differences, emphasis
and scope and the types of designs that predominate” (p. 2). Readers should refer to Chapter
8 to review the foundations of generalizability theory to understand the advantages it pro-
vides in conducting validation studies. Referring to Table 4.6, we see that a generalizability
theory validity study falls into the structural stage of the construct validation process.
In generalizability theory, the score obtained for each person is considered a ran-
dom sample from a universe of all possible scores that could have been obtained (Brennan,
1983, pp. 63–68). The universe in generalizability theory typically includes multiple
dimensions known as facets. In the study of score-based validity evidence, a facet in
generalizability theory can be represented as different measurement methods. In design-
ing a generalizability theory validation study, a researcher must consider (1) the theory
specific to the construct and (2) the universe to which score inferences are to be made.
To illustrate how generalizability theory works with our example intelligence test data,
we focus on the crystallized intelligence test lexical (word) knowledge. It is possible to
measure lexical knowledge using a variety of item formats. For example, in Table 4.9b,
we see that three types of item formats are used (multiple-choice, incomplete sentence,
and vignette). These item formats might be the focus of a generalizability-based validity
study where the question of interest is, “How generalizable are the results over differ-
ent item formats?” To answer this question, a researcher can design a G-study (i.e., a
generalizability study). To conduct a G-study focusing on the impact of item format, all
examinees are tested using the different item formats (i.e., every examinee is exposed to
every item format). Also, within the context of a G theory study, we assume that the item
formats are a random sample of all possible item formats from a hypothetical universe. In
this scenario, item format is a random facet within the G-study. The goal in our example
is to estimate the generalizability coefficient within the G-study framework. Equation 4.3
(Kane, 1982) provides the appropriate coefficient for the item format random facet.
The validity coefficient in Equation 4.3 is interpreted as the average convergent coef-
ficient based on randomly choosing different methods for measuring the same trait from a
universe of possible methods (Kane, 1982). The astute reader may recognize the fact that
Equation 4.3 may also be used to estimate score reliability, a topic covered in Chapter 7.
However, the difference between interpreting Equation 4.3 as a validity coefficient versus
a reliability coefficient involves the assumptions applied. To interpret Equation 4.3 as a
reliability coefficient, the item format facet must be fixed. In this way, the reliability coef-
ficient is not based on randomly chosen methods but only represents score reliability spe-
cific to the methods included in the G-study design. For example, a researcher may want
Statistical Aspects of the Validation Process   137

Equation 4.3. Generalizability coefficient for an item format facet

r2P
r2 =
r2P + r2PF + s 2E

• ­r2 = generalizability coefficient.


• r2P = variance due to persons.
• r2P F = variance due to person by item format interaction.
• s 2E = remaining unaccounted for sources of error variance
in scores.

to study the impact of different test forms (e.g., an evaluation of parallel test forms) using
the same item format. In this case, the study focuses on how score reliability changes
relative to the different test forms but with the item format fixed to only one type. As
you see from this brief overview, generalizability theory provides a comprehensive way to
incorporate validity and reliability of test scores into validation studies.
Although there are several approaches to establishing evidence of construct validity
of a test and the scores it yields, the driving factor for selecting a technique depends on
the intended use of the test and any inferences to be drawn from the test scores. When
developing a test, researchers should therefore be sensitive regarding what type of evi-
dence is most useful for supporting the inferences to be made from the resulting scores.

4.16 Summary and Conclusions

This chapter extended the information presented in Chapter 3 on validity and the valida-
tion process. The information in this chapter focused on techniques for estimating and
interpreting content and construct validity. Establishing the validity evidence of tests and
test scores was presented as an integration of the three components criterion, content,
and construct validity. This idea aligns with Messick’s conceptualization of construct vali-
dation as a unified process. The four guidelines for establishing evidence for the validity
of test scores are: (1) evidence based on test response processes, (2) evidence based on
the internal structure of the test, (3) evidence based on relations with other variables, and
(4) evidence based on the consequences of testing. Content validity was introduced, and
examples were provided regarding the role it plays in the broader context of the validity
evidence. Construct validity was introduced as the unifying component of validity. Four
types of construct validation studies were introduced and examples provided. Ideally, the
information provided in Chapters 3 and 4 provide you with a comprehensive perspective
on validity as it relates to psychometric methods and research in the behavioral sciences
in general.
138  PSYCHOMETRIC METHODS

Key Terms and Definitions


Base rate. An index, usually expressed as a proportion, of the extent to which a par-
ticular trait, behavior, characteristic, or attribute exists in a population (Cohen &
Swerdlik, 2010, p. 189).
Classification table. Table assessing the predictive ability of the discriminant function(s)
or logistic regression. Created by crosstabulating actual group membership with
predicted group membership, this matrix consists of numbers on the diagonal rep-
resenting correct classifications and off-diagonal numbers representing incorrect clas-
sifications (Hair et al., 1998, p. 241).
Confirmatory factor analysis. A technique used to test (confirm) a prespecified relation-
ship or model representing a construct or multiple constructs; the opposite of explor-
atory factor analysis.
Construct validity. An evidence-based judgment about the appropriateness of inferences
drawn from test scores regarding individual standings on a variable defined as a
construct.
Content validity. An evidence-based judgment regarding how adequately a test or other
measurement instrument samples behavior representative of the universe of behavior
it was designed to sample.
Content validity ratio. Method for quantifying content validity during the test develop-
ment process that uses expert judgments regarding how adequately a test or instrument
samples behavior from a universe of behavior it was designed to sample (Cohen &
Swerdlik, 2010; Lawshe, 1975).
Cross validation. Procedure of dividing a sample into two parts: the analysis sample
used to estimate the discriminant function(s) or logistic regression model, and the
holdout sample used to validate the results (Hair et al., 1998, p. 241).
Decision theory. A theoretical view that proposes that the goal of psychological measure-
ment and testing is to make decisions (e.g., about employment, educational achieve-
ment, or diagnosis).
Descriptive discriminant analysis. Used to describe differences among groups after a multi-
variate analysis of variance (MANOVA) is conducted (Huberty, 1994, pp. 25–30).
Discriminant analysis. A widely used method for predicting a categorical outcome such
as group membership consisting of two or more categories (e.g., medical diagnosis,
occupation type, or college major; Glass and Hopkins, 1996, p. 184).
Discriminant function. A variate of the independent variables selected for their discrimi-
natory power used in the prediction of group membership. The predicted value of
the discriminant function is the discriminant z-score, which is calculated for each
examinee (or object such as an organization) in the analysis (Hair et al., 1998,
p. 241).
Discriminant z-score. Score defined by the discriminant function for each examinee (or
object) in the analysis expressed in standardized units (i.e., z-score units; Hair et al.,
1998, p. 241).
Statistical Aspects of the Validation Process   139

Eigenvalue. A value representing the amount of variance contained in a correlation


matrix so that the sum of the eigenvalues equals the number of variables. Also known
as a characteristic or latent root.
Exploratory factor analysis. A technique used for (1) identifying the underlying structure
of a set of variables and (2) variable reduction; in the analysis EFA uses the variance–
covariance matrix or the correlation matrix where variables (or test items) are the
elements in the matrix.
Factor loading. The correlation between each variable (e.g., a test item or total test
score) and the factor.
False negative. Examinees predicted not to be successful but actually are successful.
False positive. Examinees predicted to be successful but are actually not successful.

Generalizability theory. An extension of classical test theory–based measurement and


reliability where multiple sources of measurement error are distinguished by using
complex analysis of variance designs; the question of score reliability centers on
the question of the accuracy of generalization from an observed score to a universe
score; a universe score is the mean score for an examinee over all conditions in the
universe of generalization (Brennan, 1983, p. 4; Pedhazur & Schmelkin, 1991;
Cronbach et al., 1972).
Heterotrait–heteromethod. A multitrait–multimethod analysis in which multiple traits are
assessed and multiple methods are simultaneously evaluated.
Heterotrait–monomethod. A multitrait–multimethod analysis in which multiple traits are
assessed and a single method is simultaneously evaluated.
Homogeneous scale. A test or other instrument that is comprised of items that represent
information common to the trait or attribute being measured.
Item homogeneity. Qualitative and quantitative evidence that items on the test or instru-
ment tap a common trait or attribute.
Least squares. Estimation procedure used in simple or multiple linear regression whereby
the regression coefficients are estimated so as to minimize the total sum of squared
residuals (Hair et al., 1998, p. 144).
Likelihood value. A measure used in logistic regression and logit analysis to represent
the degree or lack of predictive fit; similar to the sum of squared error in multiple
linear regression (Hair et al., 1998, p. 242).
Logistic curve. An S-shaped curve formed by the logit transformation that represents the
probability of an event (Hair et al., 1998, p. 242).
Logistic regression. A type of regression where the dependent or criterion variable is
dichotomous or nonmetric. Due to the dichotomous criterion variable, the regression
of the criterion on the predictor(s) is nonlinear.
Maximum likelihood method. An alternative to least-squares estimation; an estima-
tion method that iteratively improves parameter estimates to minimize a specified fit
function (Hair et al., 1998, p. 581).
140  PSYCHOMETRIC METHODS

Multinomial regression. Type of regression where the dependent variable is not restricted
to only two categories.
Multiple discriminant analysis. Technique used to describe differences among multiple
groups after a multivariate analysis of variance (MANOVA) is conducted. MDA is
applicable to descriptive discriminant analysis and predictive discriminant analysis
(Huberty, 1994, pp. 25–30).
Multitrait–multimethod. An analytic method that includes evaluation of construct valid-
ity relative to multiple examinee traits in relation to multiple (different) methods for
measuring such traits (Campbell & Fiske, 1959).
Multivariate analysis of variance. Technique used to assess group differences across
multiple dependent variables on a continuous scale or metric level of measurement
(Hair et al., 1998, p. 327).
Odds ratio. The ratio of the probability of an event occurring to the probability of an
event not occurring; the dependent variable in logistic regression.
Predictive discriminant analysis. Technique used to predict the classification of subjects
or examinees into groups based on a combination of predictor variables or measures
(Huberty, 1994, pp. 25–30).
Predictive efficiency. A summary of the accuracy of predicted versus actual performance
of examinees based on using discriminant analysis or other regression techniques.
Selection ratio. The proportion of examinees selected based on their scores on the crite-
rion being above an established cutoff.
Structural equation modeling. A multivariate technique that combines multiple regression
(examining dependence relationships) and factor analysis (representing unmeasured
concepts or factors comprised of multiple items) to estimate a series of interdependent
relationships simultaneously (Hair et al., 1998, p. 583).
Success ratio. The ratio of valid positives to all examinees who are successful on the
criterion.
Sum of squares and cross–products matrices. A row-by-column matrix where the diagonal
elements are sums of squares and the off-diagonal elements are cross-products.

Trait. A relatively stable characteristic of a person which is manifested to some degree


when relevant, despite considerable variation in the range of settings and circum-
stances (Messick, 1989, p. 15).
Valid negatives. Those examinees who were predicted to be unsuccessful and actually
were.
Valid positives. Those examinees who were predicted to be successful and actually were.

Variate. Linear combination that represents the weighted sum of two or more independent
or predictor variables that comprise the discriminant function.
5

Scaling

This chapter introduces scaling and the process of developing scaling models. As a foun-
dation to modern psychometrics, three types of scaling approaches are presented along
with their application. The relationship between scaling and psychometrics is provided.
Finally, commonly encountered data layout structures are presented.

5.1 Introduction

In Chapters 3 and 4 establishing validity evidence for scores obtained from tests was
described as a process incorporating multiple forms of evidence (e.g., through criterion,
content, and construct components—with construct validity representing a framework that
is informed by criterion and content elements). In this chapter, scaling and scaling mod-
els are introduced as essential elements to the measurement and data acquisition pro-
cess. The psychological and behavioral sciences afford many interesting and challenging
opportunities to formulate and measure constructs. In fact, the myriad possibilities often
overwhelm researchers. Recall from Chapter 1 that the primary goal of psychological
measurement is to describe the psychological attributes of individuals and the differences
among them. Describing psychological attributes involves some form of measurement or
classification scheme. Measurement is broadly concerned with the methods used to pro-
vide quantitative descriptions of the extent to which persons possess or exhibit certain
attributes. The development of a scaling model that provides accurate and reliable acqui-
sition of numerical data is essential to this process.
The goal of this chapter is to provide clarity and structure for researchers as they
develop and use scaling models. The first section in this chapter introduces scaling as
a process, provides a short history, and highlights its importance. The second section
constitutes the majority of the chapter; it introduces three types of scaling models and

141
142  PSYCHOMETRIC METHODS

provides guidance on when and how to use them. The chapter closes with a brief discus-
sion of the type of data structures commonly encountered in psychometrics.
Scaling is the process of measuring objects or subjects in a way that maximizes preci-
sion, objectivity, and communication. When selecting a scaling method, order and equal-
ity of scale units are desirable properties. For example, the Fahrenheit thermometer is a
linear scale that includes a tangible graphic component—the glass tube containing mer-
cury sensitive to temperature change. Alternatively, measuring and comparing aspects of
human perception requires assigning or designating psychological objects (e.g., words,
sentences, names, and pictures), then locating individuals on a unidimensional linear
scale or multidimensional map. Psychological objects are often presented to respondents
in the form of a sentence or statement, and persons are required to rank objects in terms
of similarity, order, or preference. In Chapter 1, the development of an effective scaling
protocol was emphasized as an essential step in ensuring the precision, objectivity, and
effective communication of the scores obtained from the scale or instrument.
A scaling model provides an operational or relational framework for assigning num-
bers to objects, thereby facilitating the transformation from qualitative constructs into
measurable metrics. Scaling is the process of using the measurement model to produce
numerical representations of the objects or attributes being measured. The scaling pro-
cess includes a visual interpretation in the form of a unidimensional scale or multi-
dimensional map. For scaling to be effective, the researcher needs to utilize a process
known as explication. This process involves conceptualizing and articulating a new or
undefined concept based on identifying meaningful relations among objects or variables.
Related to explication, Torgerson (1958, pp. 2–15) cites three interrelated issues essential
to the scaling process:

1. Clearly defining the theoretical approach to the scaling problem—including the


formulation of how variables or objects are constructed or measured.
2. Selecting an optimal research design for acquiring the data for subsequent use in
the scaling model.
3. Selecting an appropriate analytic technique for the analysis of data.

Notice that in applied psychometric work these three points provide a unified
approach to measurement, scaling research design, and analysis. Attention to these issues
is crucial because the accuracy of the results obtained from the scaling process affects score
interpretation. For example, lack of careful attention to the first point directly affects score
interpretation and ultimately the validation process as discussed in Chapters 3 and 4.

5.2 A Brief History of Scaling

History provides important insights regarding how scaling has proven integral to the
evolution of psychological measurement. Such a perspective is useful for providing a
Scaling  143

foundation and frame of reference for work in this area. As a precursor to modern psycho-
metrics, Stanley Smith Stevens’s chapter “Mathematics, Measurement, and Psychophys-
ics” in the Handbook of Experimental Psychology (Stevens, 1951b) provides an extensive
and unified treatment of psychological scaling. Stevens’s seminal work provided a cogent
foundation for the emerging discipline of psychological measurement, today known as
psychometrics (i.e., mind or mental measurement). The term psychometrics (i.e., mind
measuring) is based on the relationship between f (i.e., the magnitude of the stimulus)
and y (i.e., the probability that a subject detects or senses the stimulus, as in Figure 5.1).
Figure 5.1 displays an absolute threshold measured by the method of constant stimuli
for a series of nine stimulus intensities. Stimulus intensity is plotted on the X-axis. In Fig-
ure 5.1, an absolute threshold intensity of 9.5 corresponds to the proportion of trials yield-
ing a “yes, I sense the stimulus” response 50% (i.e., probability of .50) of the time. That is,
to arrive at a proportion of “yes” responses occurring 50% of the time, cross-reference the
Y-axis with the X-axis and you see that a stimulus intensity of 9.5 corresponds to a prob-
ability on the Y-axis of 50%. Figure 5.2 illustrates the relationship between the psychomet-
ric function in Figure 5.1 and the normal curve (i.e., the standard normal distribution).
Stevens focused on the interconnectivity among science, mathematics, and psycho-
physics in modeling empirical (observable) events and relations using mathematical sym-
bols and rules in conjunction with well-conceived scales. Stevens’s work provided much of
the foundation for modern psychometrics and was based on the idea that “when descrip-
tion gives way to measurement, calculation replaces debate” (Stevens, 1951b, p. 1).
Psychometric methods have evolved substantially since Stevens’s time and now
include an expanded philosophical ideology that has moved far beyond classic psycho-
physics (i.e., the mathematical relationship between an observable physical stimulus and
a psychological response). In fact, psychometric methods now consist of a broad array
of powerful scaling, modeling, and analytic approaches that facilitate the investigation

1.0
.9
Probability of yes response, Ψ

.8
.7
.6
.5
.4
.3
.2
Absolute threshold
.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Absolute threshold, ϕ

Figure 5.1.  Psychometric function.


144  PSYCHOMETRIC METHODS

z-score –3 –2 –1 0 1 2 3
raw score 10 20 30 40 50 60 70
1.00
Probability of yes response, Ψ

.50

0
50% 100%
z-score –3 –2 –1 0 1 2 3
Absolute threshold, Ф

Figure 5.2.  Relationship between the psychometric function and the normal curve.

of problems in psychology, sociology, business, biology and education, to name a few.


An important evolutionary shift in the practice and focus of psychometric methods
came with the incorporation of a connection with the philosophy of science in relation
to research. Perhaps this shift emerged based on a society composed of individuals who
asked for well-informed answers to increasingly complex social, behavioral, and bio-
logical problems—whether independently or synergistically. In any event, this change
reminds us to remain philosophically mindful and analytically precise without becoming
lost in mathematical and statistical complexity.

5.3 Psychophysical versus Psychological Scaling

Formally, the term scaling is the process of measuring stimuli by way of a mathemati-
cal representation of the stimulus–response curve (Birnbaum, 1998; Guilford, 1954;
Torgerson, 1958). Once the transformation from qualitative constructs into measur-
able metrics is accomplished, developing a mathematical representation of the rela-
tionship between a stimulus and response is a crucial step, allowing measurements to
Scaling  145

be used to answer research questions. Here the term stimulus broadly means (1) the
ranking of preference, (2) the degree of agreement or disagreement on an attitudinal
scale, or (3) a yes/no or ordered categorical response to a test item representing a con-
struct such as achievement or ability. In psychophysical scaling models, the goal is
to locate stimuli along a continuum, with the stimuli, not persons, being mapped onto
a continuum. For example, a stimulus is often directly measurable, with the response
being the sensory-based perception in either an absolute or a relative sense (e.g., reac-
tion time). Examples where psychophysical scaling models are useful include studies
of human sensory factors such as acoustics, vision, pain, smell, and neurophysiology.
Conversely, when people are the focus of scaling, the term psychological scaling is
appropriate. Psychological scaling models where people are the focus are classified
as response-centered (see Table 5.1). Some examples of how psychological scaling
occurs in measurement include tests or instruments used to measure a person’s ability,
achievement, level of anxiety or depression, mood, attitude, or personality. Next we
turn to a discussion of why scaling models are important to psychometrics specifically
and research in general.

Table 5.1.  Three Approaches to Scaling


Formal test of
model and level of
measurement (e.g.,
Method Purpose Examples ordinal or interval)
Stimulus- Purpose is to Focuses on responses to physical stimuli in rela- Yes
centered locate items tion to the stimuli themselves. Class of research
or stimuli on is psychophysics with problems associated with
a continuum. detecting physical stimuli such as tone, visual
acuity, brightness, or other sensory perception.
Response- Response Response data are used to scale subjects along a Yes
centered data are used psychological continuum, while simultaneously
to scale or subjects are also scaled according to the strength
locate sub- of the psychological trait they possess. Example
jects along a scaling techniques include Guttman scaling,
psychological unidimensional and multidimensional unfolding,
continuum. item response theory, latent class analysis, and
mixture models.
Subject- Purpose is to Tests of achievement or ability or other psycho- No
centered scale subjects logical constructs where, for example, a subject
only. responds to an item or statement indicating the
presence or absence of a trait or attribute. Atti-
tude scaling that includes a subject responding to
a statement indicating the level of agreement as
in a Likert scale.
146  PSYCHOMETRIC METHODS

5.4 Why Scaling Models Are Important

Developing an effective scaling model is essential for the measurement and acquisition
of data. For a scaling model to be effective, accuracy, precision of measurement, and
objectivity are essential elements. A scaling model provides a framework for acquiring
scores (or numerical categories) on a construct acquired from a series of individuals,
objects, or events. Scaling models are developed based on (1) the type of measurement
(e.g., composites consisting of the sum of two or more variables, an index derived as a
linear sum of item-level responses, or fundamental meaning that values exhibit properties
of the real number system) and (2) the type of scale (i.e., nominal, ordinal, interval, or
ratio). Scaling methods that produce models are categorized as stimulus-, response-, or
subject-centered (Torgerson, 1958, p. 46; Crocker & Algina, 1986, pp. 49–50). Table 5.1
provides an overview of each type of scaling approach.

5.5 Types of Scaling Models

The process of developing a scaling model begins with a conceptual plan that produces
measurements of a desired type. This section presents three types of scaling models—
stimulus-centered, response-centered, and subject-centered (Nunnally & Bernstein,
1994; Torgerson, 1958, p. 46)—relative to the type of measurements they produce. Two
of the models, response-centered and stimulus-centered, provide a statistical framework
for testing the scale properties (e.g., if the scale actually conforms to the ordinal, interval,
or ratio level of measurement) based on the scores obtained from the model. Alterna-
tively, in the subject-centered approach, scores are derived by summing the number of
correct responses (e.g., in the case of a test of cognitive ability or educational achieve-
ment) or by averaging scores on attitudinal instruments (e.g., Likert-type scales). In the
subject-centered approach, test scores are composed of linear sums of items (producing
a total score for a set of items) and are assumed to exhibit properties of order and equal
intervals (e.g., see Chapter 2 for a review of the properties of measurement and associ-
ated levels of measurement). At this juncture, you may ask whether you should analyze
subject-centered data using ordinal- or interval-based techniques. The position offered
here is the same as the one Frederic Lord and Melvin Novick (1968, p. 22) provided:

If scores provide more useful information for placement or prediction when they are treated
as interval data, they should be used as such. On the other hand, if treating the scores as
interval-level measurements actually does not improve, or lessens their usefulness, only the
rank order information obtained from this scale should be used.

Finally, an important point to remember in the decision between treating scores as


interval or ordinal level is that the distributional assumptions of the score values must be
evaluated prior to applying any statistical technique. Without such an evaluation of dis-
tributional assumptions, you will not know whether or not you are applying parametric
Scaling  147

statistical (e.g., normal distribution theory-based) models to nonparametric data struc-


tures (i.e., violating assumptions of the parametric statistical model and associated tests).

5.6 Stimulus-Centered Scaling

The stimulus-centered approach to scaling is grounded in the psychophysical measure-


ment and scaling tradition (see Figures 5.1 and 5.2). Formally, the study of psychophys-
ics preceded psychometrics and today remains an important field of study in its own
right. Psychophysics is the study of dimensions of physical stimuli (usually, intensity
of sound, light, sensation, etc.) and the related response to such stimuli known as sen-
sory perception or sensation. Bruce, Green, and Georgeson (1996, p. 6) describe psycho­
physics as “the analysis of perceptual processes accomplished by studying the effect on
a subject’s experience or behavior of systematically varying the properties of a stimulus
along one or more physical dimensions.”
The psychophysical methods provided useful mathematical models for determining
thresholds along a continuous response curve over a direct physical dimension (Figure
5.1). One example of a stimulus is the frequency of auditory sound, with the percep-
tion of frequency noted by a sound’s pitch. The physical dimensions are represented by
f and the associated sensations by y. Although it is qualitatively feasible for a person to
rank-order sound pitch, psychophysics focuses on expressing the relationship between y
and f in a psychometrically rigorous and objective manner (e.g., on an interval or ratio
level of measurement). Specifically, psychophysical methods answer two primary ques-
tions: (1) What is the minimal amount of a stimulus (i.e., intensity) for an event to be
perceived by a person (an absolute judgment question), and (2) how different must two
stimuli be in order for a person to be able to detect a difference (a relative comparison
question)? Therefore, a concept central to psychophysics is the determination of a sen-
sory threshold.
In the early 19th century, E. H. Weber (1795–1878) and G. T. Fechner (1801–1887)
investigated sensitivity limits in human sensory organs using principles of measurement
from physics along with well-trained observers (Nunnally & Bernstein, 1994). Weber
defined an absolute threshold as the smallest amount of stimulus necessary to produce
a sensation (auditory, visual, or tactile) on the part of a subject. When a stimulus above
threshold is provided to a subject, an associated amount of intensity change (i.e., either
above or below threshold) is necessary before a sensory differential is detectable. This
critical amount of intensity change is known as a just noticeable difference, or JND,
and the difference limen (DL) is the amount of change in a stimulus required to produce
a JND. As an example of an application of the absolute threshold, consider the case of
an audiologist testing a person’s hearing. In such a test, the audiologist is interested in
determining the degree of hearing impairment or loss in relation to established normative
information based on an established absolute threshold.
As one studies the foundations of psychophysics, a common ground emerges in rela-
tion to psychometrics. Most apparent to this common ground is the relationship between
148  PSYCHOMETRIC METHODS

some form of stimulus and response. Mosier (1940) suggested that the theorems of psy-
chophysics could be applied to psychometrics by means of transposing postulates and
definitions in a logical and meaningful way. For example, researchers in psychophysics
model the response condition as an indicator of a person’s invariant (i.e., unchanging)
attribute. Response conditions stem from sensory perception of a visual, auditory, or
tactile stimulus. These person-specific attributes vary in response to the stimulus but are
invariant or unchanging from person to person (Stevens, 1951a). Conversely, psychome-
tricians treat the response condition as indicative of an attribute that varies from person
to person (e.g., knowledge on an ability or achievement test). However, the critical con-
nection between psychophysics and psychometrics is the stimulus–response relationship in the
measurement of perceptions, sensation, preferences, judgments, or attributes of the persons
responding.
To summarize, the main difference between psychophysics and psychometrics is
in the manner each mathematically models the invariance condition. In psychophysics,
the attribute varies within persons for the stimulus presented but is invariant from per-
son to person. In psychometrics, responses are allowed to vary from person to person,
but the attribute is invariant in the population of persons. In the mid- to late 20th cen-
tury, psychometrics incorporated the fundamental principles of classic psychophysics
to develop person- or subject-oriented, response-based measurement models known as
item response theory or latent trait theory, which involves studying unobserved attri-
butes (see Chapter 10).

5.7 Thurstone’s Law of Comparative Judgment

Louis Leon Thurstone (1887–1955) developed a theory for the discriminate modeling of
attitudes by which it is possible to construct a psychological scale. Thurstone’s law of
comparative judgment (1927) provided an important link between normal distribution
(Gaussian or cumulative normal density function) statistical theory and the psychophysi-
cal modeling tradition by defining a discriminal process as a reaction that correlates
with the intensity of a stimulus on an interval scale. Thurstone’s law uses the variability
of judgments to obtain a unit of measurement and assumes that the errors of observations
are normally distributed. The assumption of normality of errors allows for application
of parametric statistical methods to scaling psychological attributes. Although the com-
parative judgment model was formulated for use on preferential or paired comparison
data, it is applicable to any ordinal scaling problem. Thurstone’s method is different from
methods previously introduced in that it is falsifiable, meaning that the results are able to
be subjected to a statistical test of model-data fit. For example, responses by subjects to
stimuli must behave in a certain way (i.e., response patterns are expected to conform to a
particular pattern); otherwise the model will not “fit” the data. Application of Thurstone’s
law of comparative judgment requires that equally often noticed differences in stimuli by
persons are in fact equal. The law is provided in Equation 5.1.
Scaling  149

Equation 5.1. Thurstone’s law of comparative judgment

(
S1 – S2 = X 12 √ σ12 + σ22 − 2R σ1σ2 )
• S1 – S2 = linear distance between two points on a psycho-
logical continuum.
• x12 = standard deviation of the observed proportion,
PR1 > R2 of judgments.
• σ12 = relative discriminal dispersion of stimulus 1.
• σ22 = relative discriminal dispersion of stimulus 2.
• r = correlation between the two discriminal devia-
tions involved in the judgments.

Because it is not always possible to obtain the information requisite to applying


Equation 5.1, the following assumptions are usually made (Bock & Jones, 1968). First,
the correlations between discriminal dispersions are zero, and second, the observations
are statistically independent. Third, the standard deviations of all of the discriminal dis-
persions are equal. Under these conditions, Equation 5.1 reduces to Equation 5.2.
Given Equation 5.2, we can derive the proportion of times one stimulus is preferred
over another by applying the discrepancy between proportional areas under the normal
distribution. Also, important to Equation 5.2 is the fact that the magnitude of the stimu-
lus is not present. Therefore, the law is independent of the magnitude of a stimulus, thereby

Equation 5.2. Thurstone’s law of comparative judgment

S1 – S2 = Z JKs 2

• S1 – S2 = linear distance between two points on a psycho-


logical continuum.
• s = standard deviations of the discriminal dispersions
of stimuli 1 and 2.
• zjk = normal curve ordinate of linear distance between
two points on a psychological continuum.
150  PSYCHOMETRIC METHODS

allowing for a natural framework for measuring psychological attributes on a latent con-
tinuum. In Thurstone’s law, the process of response or discrimination functions indepen-
dently of stimulus magnitudes; therefore, there is no objective criterion for the accuracy
of each judgment. For example, the judgments are not proportions of correct judgments;
rather, they represent a choice between two stimuli. For an applied example of an appli-
cation of Thurstone’s equal-interval approach to measuring attitudes, see Gable and
Wolfe (1993, pp. 42–49). The exposition includes example item generation and selec-
tion through locating persons on response continuum using Thurstone’s equal-interval
approach.

5.8 Response-Centered Scaling

Response-centered approaches to scaling focus on locating subjects on a psychological


continuum based on their responses to objects (words, sentences, pictures, tones, etc.).
The psychological continuum may be unidimensional (i.e., measuring a single construct)
or multidimensional (i.e., measuring more than one construct). Response-centered scal-
ing models include those that focus on (1) ranking or order, (2) categorical ratings based
on choices (e.g., a personal point of view), (3) judgments of similarities between objects
(e.g., objective ratings of the degree to which a person likes or dislikes something), and
(4) clustering objects or subjects. Examples of response-centered scaling approaches
include judgment- or choice-based measurement of attitudes, opinions, preferences,
knowledge, ability, and interests (Birnbaum, 1998). The following sections present each
approach with an example.

5.9 Scaling Models Involving Order

Ordinal scaling approaches involve rank-ordering objects or people from highest to low-
est (e.g., on measure of preference or on how similar pairs of objects are). The rank-
ordering approach to scaling provides data in the form of dominance. For example, in
preference scaling, a particular stimulus dominates over another for a respondent (i.e.,
a person prefers one thing over another). Therefore, in the rank-ordering approach,
dominance relative to one stimulus over another is dictated by greater than or less than
inequalities based on rank-order values. Rank-order approaches to scaling are ordinal in
nature, and two commonly used methods are (1) paired comparisons and (2) direct
rankings. The method of paired comparisons (Tables 5.2a and 5.2b) involves counting
the votes or judgments for each pair of objects by a group of respondents. For example,
objects may be statements that subjects respond to. Alternatively, subjects may rank-
order pairs of objects by their similarities. To illustrate, in Table 5.2a pairs of depres-
sion medications are presented and subjects are asked to rank-order the pairs from most
to least similar in terms of their effectiveness based on their experience. The asterisk
denotes the respondent’s preferred drug. The votes or judgments are inversely related to a
Scaling  151

Table 5.2.  Paired Comparisons and Preference


Table 5.2a
Similarity
Drug by therapist rank
Prozac*—Paxil 5
Prozac—Cymbalta* 6
Prozac*—Zoloft 4
Paxil—Cymbalta* 3
Paxil—Zoloft* 1
Cymbalta*—Zoloft 2

Table 5.2b
  Similarity matrix
  Prozac Paxil Zoloft Cymbalta
Prozac
Paxil 5
Zoloft 4 1
Cymbalta 6 3 2  

ranking; for example, the category or statement receiving the highest vote count receives
the highest ranking (in traditional scaling methods a value of 1 is highest). The rankings
are then compiled into a similarity matrix as shown in Table 5.2b.
Direct ranking involves providing a group of people a set of objects or stimuli (e.g.,
pictures, names of well-known people, professional titles, words), and having the people
rank-order the objects in terms of some property (Tables 5.3a and 5.3b). The property
may be attractiveness, reputation, prestige, pay scale, or complexity. Table 5.4 extends
direct ranking to rating the similarity between pairs of words. For extensive details on
a variety of scaling approaches specific to order, ranking, and clustering, see Guilford
(1954) and Lattin, Carroll, and Green (2003). Additionally, Dunn-Rankin, Knezek, Wal-
lace, and Zhang (2004) provide excellent applied examples and associated computer pro-
grams for conducting a variety of rank-based scaling model analyses.
Next we turn to an important scaling model, the Guttman scaling model, whose
focus is on locating subjects along a continuum based on the strength of their response
to a stimulus (e.g., a test item). This model is one of the first to appear in psychological
measurement.

5.10 Guttman Scaling

One important use of response-centered scaling models is to locate subjects on a uni-


dimensional psychological continuum in relation to the strength and pattern of their
152  PSYCHOMETRIC METHODS

Table 5.3.  Rank-Ordering Pairs

Table 5.3a
Pairs Similarity rank
sister–brother 1
sister–niece 3
sister–nephew 6
brother–niece 5
brother–nephew 4
nephew–niece 2

Table 5.3b
  Similarity matrix
  Sister Brother Niece Nephew
Sister
Brother 1
Niece 3 5
Nephew 6 4 2  

Table 5.4.  Word Similarity Data


Meaning similarity—Degree
Word pairs Least similar       Most similar Word similarity matrix
0 1 2 3 4 5 6 eye bye sight site
eye–bye X eye
eye–sight X = bye 0
eye–site X sight 6 4
bye–sight X site 3 1 0
bye–site X
sight–site X            

responses to items. In turn, items are scaled based on the amount or magnitude of the
trait manifested in persons. The Guttman (1941) and Aiken (2002) scaling model was one
of the first approaches that provided a unified response-scaling framework. In Guttman’s
technique, statements (e.g., test items or attitudinal statements) are worded in a way that
once a person responds at one level of strength or magnitude of the attribute, the person
should (1) agree with attitude statements weaker in magnitude or (2) correctly answer
test items that are easier. Based on these assumptions, Guttman proposed the method of
scalogram analysis (Aiken, 2002, p. 36) for evaluating the underlying dimensionality of a
set of items comprising a cognitive test or attitudinal instrument. For example, the unidi-
mensionality and efficacy of a set of items can be evaluated based on a comparison of the
expected to actual response patterns to test items for a sample of subjects.
Scaling  153

A result of applying the Guttman scaling approach is that persons are placed or
located in perfect order in relation to the strength of their responses. In practice, pat-
terns of responses that portray perfect Guttman scales are rare. For this reason, the
Guttman approach also provides an equation to derive the error of reproducibility
based on expected versus actual item response patterns obtained from a sample of
persons (i.e., a test of fit of the model based on the responses). Overall, the frame-
work underlying the Guttman approach is useful in the test or instrument develop-
ment process where person response profiles of attitude, ability, or achievement are of
interest relative to developing items that measure attributes of progressively increasing
degree or difficulty. For detailed information on Guttman scaling, see Guttman (1944),
Torgerson (1958), and Aiken (2002).

5.11 The Unfolding Technique

One of the most widely accepted models for scaling subjects (i.e., people) and items (i.e.,
stimuli) on object preference or similarity is the unidimensional unfolding technique
(Coombs, 1964). Unfolding was developed to study people’s preferential choice (i.e.,
behavior). Central to the technique is the focus on analysis of order relations in data that
account for as much information as possible. The order relations in unfolding techniques
are analyzed based on distances. By quantifying distances rigorously, interval level of
measurement is attained for nonmetric-type data. This approach differs from the scaling
of test scores based on an underlying continuous construct or trait (e.g., in intelligence
or achievement testing). The term preference refers to the manner in which persons prefer
one set of objects over another set modeled as an order relation on the relative proximity
of two points to the ideal point.
Unfolding is based on representational measurement, which is a two-way process
“defined by (1) some property of things being measured and (2) some property of the
measurement scale” (Dawes, 1972, p. 11). The goal of unfolding is to obtain an interval
scale from ordinal relations among objects. Unfolding theory is a scaling theory designed
to construct a space with two sets of points, one for persons and one for the set of
objects of choice. By doing so, unfolding uses all of the possible data in rank-order tech-
niques. To this end, the unfolding model is the most sophisticated approach to scaling
preference data. The “things” being measured in the unfolding model are objects and
may be physical in nature such as (1) an image or picture, a weight, actions or services,
or (2) they may be sensory perceptions such as smell or taste, or (3) psychological, such
as word meaning, or mathematical concepts. The “property of the scale” is distance
or location along a straight line. Taken together, a two-way correspondence model is
established: (1) the property of the things being measured (the empirical part) and (2)
the measurement scale (the formal relational system part). Based on such two-way cor-
respondence, the unfolding model qualifies as a formal measurement model residing
somewhere between an ordinal and interval level of measurement by Stevens’s (1951a)
classification system.
154  PSYCHOMETRIC METHODS

The unidimensional unfolding technique involves the representation of persons


(labeled as I) and stimuli or objects (labeled as J) in a single dimension represented on a
number line. In psychological or educational measurement, data are sometimes acquired
based on respondents providing global responses to statements such as (1) concept A is
more similar to concept B than to C or (2) rate the similarity of word meanings A and
B on a 10-point scale. As Dawes (1972) states, “unfolding provides a way to represent
people and stimuli jointly in space such that the relative distances between the points
reflect the psychological proximity of the stimuli to the people or their ideals” (p. 61).
The first step in conducting an unfolding analysis is to rank-order stimuli or objects
on the dimension by finding the two extreme I-scale response patterns for persons that
are mirror (inverse) images of each other (e.g., see Table 5.5). By using this informa-
tion, the endpoints of the dimension are established. The order of the person’s I-scales is
identical to his or her J-scales. Once the order of the stimuli on the J-scale is determined,
unfolding is possible by linking persons and midpoints of stimuli. Figure 5.3 illustrates
these concepts using items 1–4 in Table 5.5.
To illustrate an application of the unidimensional unfolding technique, consider the
following scenario. A respondent in a group of subjects is asked to rank-order a set of
statements regarding the minimum annual salary college graduates entering the work-
force should earn based on selected degree type. Table 5.6 provides responses from 400
college seniors, and we want to answer the question: Which two pairs of stimuli are
closer on the psychological continuum for person “X”? Note that the statements provide
four types of degrees (i.e., the objects or stimuli).
Plainly speaking, which pairs of responses do the respondents perceive as closer—­
Business and English majors or Education and Biology majors?
The items are located on the joint or J-scale (horizontal axis on a Cartesian graph),
with six possible midpoints (noted as AB, AC, AD, BC, BD, and CD) and seven total
regions (see [b] and [c] in Figure 5.3). The person response patterns are located on the
I-scale (vertical axis) after the action of “folding” occurs ([d] in Figure 5.3).
Notice that in (c) in Figure 5.3, there are seven possible preference response patterns
(i.e., the numbers in parentheses) for a person, depending on where the person is located

Table 5.5.  Salary Response Statements


1. S tudents with baccalaureate degree in biology should earn $60,000 annually their first year of
employment.
2. S tudents with baccalaureate degree in business field should earn $60,000 annually their first year of
employment.
3. L
 iberal arts (English) students with baccalaureate degree in English should earn $60,000 annually
their first year of employment.
4. S tudents with baccalaureate degree in education should earn $60,000 annually their first year of
employment.
Note. In Figure 5.3a–d, business majors are labeled as A; education majors are labeled as B; liberal arts majors are
labeled as C; and biology majors are labeled as D.
Scaling  155

J-scale
A B C D

(a) Location of four stimuli on the J-scale

AB AC AD BC BD CD

J-scale
A B C D

(b) Letter pairs indicating the location of the midpoints of stimuli on the J-scale

Regions (1) (2) (3) (4) (5) (6) (7)


X

A B C D

(c) Location of subject X on the J-scale

I-scale or axis

J-scale
X
A B C D
(d) Folding the J-scale at point (person) X

Figure 5.3.  Unfolding technique with salary response data.

Table 5.6.  Salary Response Data from 400 College Seniors


Business–English–Education–Biology 150 ABCD
English–Education–Biology–Business 91 BCDA
Education–Biology–Business–English 64 CDAB
Biology–Business–English–Education 37 DABC
Biology–Education–English–Business 21 DCBA
Biology–English–Business–Education 18 DBAC
English–Biology–Education–Business 10 BDCA
English–Business–Education–Biology 6 BACD
Business–Education–English–Biology 2 ACBD
Business–Biology–Education–English 1 ADCB
Note. N = 400. A, business; B, English; C, education; D, biology. There can be only seven
regions in Figure 5.3 because the unique response patterns are ABCD, BACD, BCAD,
BCDA, CBDA, CDBA, and DCBA.
156  PSYCHOMETRIC METHODS

on the J-scale. For example, in (d) in Figure 5.3, when the J-scale is “folded” up into an
“I”-axis (called the individual scale), we see the response pattern and relational proxim-
ity for person (X) located in region two of the J-scale. After folding the J-scale, the “I”
scale represents the final rank order of person X. This result is interpreted as the relative
strength of preference expressed for a particular object or pair of objects. Each person
mapped onto an unfolding model will have a location on the J-scale and will therefore
have a corresponding I-scale that provides a rank order. Finally, when there are more
than four objects and more than a single dimension is present (as is sometimes the case),
the unidimensional unfolding model has been extended to the multidimensional case
by Bennett and Hayes (1960) and Lattin et al. (2003). Readers interested in the multidi-
mensional approach to unfolding and extensions to nonmetric measurement and metric
multidimensional scaling (MDS) are referred to Lattin et al. (2003) for applied examples.

5.12 Subject-Centered Scaling

The subject-centered approach to scaling is based on index measurement (Crocker &


Algina, 1986; Dawes, 1972, p. 14). The focus of index measurement is on the property of
the attribute being measured, resulting in a numerical index or scale score. Two examples are
provided here to illustrate the meaning of index measurement. The first example aligns
with the data used throughout this book and is for a test of ability (i.e., intelligence),
and the second is for an attitude scale. Figure 5.4 (introduced in Chapter 1) provides
an example of how the general theory (GfGc) of intelligence is conceptually represented
and subsequently mapped onto measurable space. In Figure 5.4, a scale or test consists
of the sum of a set of items (e.g., any one of the 10 tests in the figure) that measures an
underlying psychological continuum and the location of a person, within a sample, rela-
tive to his or her response. This type of model is known as a normative or cumulative
scaling model.
Examples of constructs in education that hypothetically exhibit a continuous under-
lying continuum include reading or mathematics achievement. Examples in psychol-
ogy and the behavioral sciences include depression, memory, intelligence, anxiety, and
mood. Figure 5.5 illustrates a hypothetical normative scale created from the conceptual
framework in Figure 5.4 intended to produce meaningful measurement on an underlying
attribute of crystallized intelligence for five people. In this figure, the relative position of
each person (e.g., P1 through P5) is indexed on the straight line representing a person’s
level of intelligence.
With regard to the ability or intelligence score example, referring to Figure 5.5,
assume that the total score for test 1 of crystallized intelligence is composed of the sum
of 25 items. A person’s sum score for the 25-item test provides an index or scale score for
that person. Also, in the present example, the index can be aligned with the percentile
point in the normal distribution (or any other type of distribution) based on a group or
sample of subjects given the interval level of measurement.
Scaling  157

item 01
Fluid
intelligence
test 1
item 10

item 01
Fluid
Fluid Intelligence intelligence
(Gf) test 2
item 20

item 01
Fluid
intelligence
test 3
item 20

item 01
Crystallized
intelligence
test 1
item 25

item 01
Crystallized
intelligence
test 2
item 25
General Intelligence Crystallized
(G) Intelligence (Gc)
item 01
Crystallized
intelligence
test 3
item 15

item 01
Crystallized
intelligence
test 4
item 15

item 01
Short-term
memory
test 1
item 20

item 01
Short-Term Memory Short-term
memory
(Stm)
test 2
item 10

item 01
Short-term
memory
test 3
item 15

Figure 5.4.  General theory of intelligence.


158  PSYCHOMETRIC METHODS

P1 P2 P3 P4 P5
Lower Higher

Intelligence

Figure 5.5.  Scaling model for an attribute such as intelligence.

A second example of subject-centered scaling is in the area of attitude measure-


ment where the level of measurement is ordinal or ordered categorical. For example, a
researcher may want to measure a person’s attitude toward his or her political views on
a controversial topic (e.g., environment, abortion, immigration). To collect numerical
information resulting in index scores on attitudinal measurements, ordered categorical
scaling methods are used. Within this classification of scaling the methods, summated
rating scales are frequently used in a variety of disciplines. Researchers using these meth-
ods ask subjects to respond to statements by marking their degree of positive affect based
on reading items consisting of symbols, statements, or words. Figure 5.6 displays an
example item from the Morally Debatable Behavior Scale—Revised (MDBS-R; Cohen &
Swerdlik, 2010, p. 239; Katz, Santman, & Lonero, 1994), a summated rating scale cre-
ated to measure opinions on moral issues. The purpose of the MDBS-R is to tap a per-
son’s strength of convictions on specific moral issues that elicit widely differing opinions.
Researchers may also use the MDBS-R to examine individual differences based on the sam-
ple responses from a group of participants. For example, the total score for the MDBS-R
is calculated for each person by summing responses to all items on the instrument for
all persons responding. Each person receives a total score that is indicative of that indi-
vidual’s overall attitude or opinion regarding the content of the items. Differences in these
individuals’ opinions are then examined using analytic techniques designed to detect dif-
ferences between groups of like-minded persons.
Another form of summated rating scale used primarily for scaling attitudes is the
Likert scale (Likert, 1932). Figure 5.7 displays a Likert scale designed to measure the
level of agreement regarding the use of intelligence tests in psychological assessment. We
see from this figure that ordered categorical scores are produced from respondents on
the Likert scale. These scores are mapped onto an underlying bipolar continuum ranging
from strongly disagreeing with the statement to strongly agreeing. Also, we see a neutral
point on the scale providing an undecided response option.

Cheating on taxes if you have a chance is:

1 2 3 4 5 6 7 8 9 10

never always
justified justified

Figure 5.6.  Summated rating scale item.


Scaling  159

Intelligence tests are an essential component of psychological assessment.

1 2 3 4 5

Strongly Moderately Moderately Strongly


Undecided
disagree disagree agree agree

Figure 5.7.  A Likert-type item for the measurement of attitude toward the use of intelligence tests.

Figure 5.8 illustrates the semantic differential scale. This scale (Osgood, Tannenbaum,
& Suci, 1957) is an example of an ordered categorical scale (Figure 5.8). It measures a
person’s reaction to words and/or concepts by eliciting ratings on bipolar scales defined
with contrasting adjectives at each end (Heise, 1970). According to Heise, “Usually, the
position on the scale marked 0 is labeled ‘neutral,’ the 1 positions are labeled ‘slightly,’ the
2 positions ‘quite,’ and the 3 positions ‘extremely’” (p. 235).
Yet another type of ordered categorical scale is the behavior rating scale. Figure 5.9
illustrates a behavior rating scale item that measures student participation in class. We
see that we are measuring a student’s frequency of participation in class. The behavior we
are measuring is “class participation.” After acquiring data from a sample of students on
such a scale, we can evaluate individual differences among students according to their
participation behavior in class.
Ideally, items that comprise ordered categorical, summated rating, or Likert-type
scales have been developed systematically by first ensuring that objective ratings of

Intelligence tests

fun: _____: _____: _____: _____: _____: _____: _____: work

easy: _____: _____: _____: _____: _____: _____: _____: hard

good: _____: _____: _____: _____: _____: _____: _____: bad

Figure 5.8. A semantic differential scale for the measurement of attitude toward intelligence
tests.

Student offers own opinions in class.

5 4 3 2 1

Always Frequently Occasionally Seldom Never

Figure 5.9.  A behavior rating scale for the measurement of student participation in class.
160  PSYCHOMETRIC METHODS

similarity, order, and/or value exist for the set of items relative to the construct or attribute
being measured. Second, the unidimensionality of the set of items should be examined
to verify if the items actually measure a single underlying dimension (e.g., see Chapter
9). The step of verifying the dimensionality of a set of items usually occurs during some
form of pilot or tryout study. If a set of items exhibits multidimensionality (e.g., it taps
two dimensions rather than one), the analytic approach must provide for the multidi-
mensional nature of the scale. The topic of dimensionality and its implications for scale
analysis and interpretation will be covered in detail in Chapter 9 on factor analysis and in
Chapter 10 on item response theory. Finally, although the assumption of equal intervals
(i.e., widths between numbers on an ordinal scale) is often made in practice, this assump-
tion often cannot be substantiated from the perspective of fundamental measurement.
Given this apparent quandary, the question regarding how one should treat scores based
on index measurement—at an interval or ordinal level—often arises. Lord and Novick
(1968) provide an answer to this question by stating that one should treat scores acquired
from index-type measurement as interval level:

If scores provide more useful information for placement or prediction when they are treated
as interval data, they should be used as such. On the other hand, if treating the scores as
interval-level measurements actually does not improve, or lessen their usefulness, only the
rank order information obtained from this scale should be used. (p. 22)

Summated rating scales and Likert-type scales are not grounded in a formal mea-
surement model, so statistical testing of the scale properties of the index scores is not
possible (Torgerson, 1958). However, in using summated rating and Likert scaling pro-
cedures, the scaling model yields scores that are assumed to exhibit properties of order
and approximately equal units. Specifically, the following assumptions are applied: (1)
category intervals are approximately equal in length, (2) category labels are preset sub-
jectively, and (3) the judgment phase usually conducted during item or object devel-
opment as a precursor to final scale is replaced by an item analysis performed on the
responses acquired from a sample of subjects. Therefore, Likert scaling combines the
steps of judgment scaling and preference scaling into a single step within an item
analysis. Importantly, such assumptions should be evaluated based on the distribu-
tional properties of the actual data. After assumptions are examined and substantiated,
subject-centered scaling models often provide useful scores for a variety of psychologi-
cal and educational measurement problems.

5.13 Data Organization and Missing Data

Organizing data in a way that is useful for analysis is fundamental to psychometric meth-
ods. In fact, without the proper organizational structure, any analysis of data will be
unsuccessful. This section presents several data structures that are commonly encoun-
tered and concludes with some remarks and guidance on handling missing data.
Scaling  161

Table 5.7.  Two-Way (Two-Dimensional) Raw Data Matrix


  Items/stimuli (variables) k
Objects (subjects) j 1 2 3 4 ... j k
1 x11 x12 x13 x14 ... x1j x1k
2 x21 x22 x23 x24 ... x2j x2k
3 x31 x32 x33 x34 ... x3j x3k
4 x41 x42 x43 x44 ... x4j x4k
... ... ... ... ... ... ... ...
i xi1 xi2 xi3 xi4 ... xij xik
n xn1 xn2 xn3 xn4 ... xnj xnk

The most basic data matrix consists of N persons/subjects (in the rows) by k stimuli/​
items (in the columns). This two-way data matrix is illustrated in Table 5.7. The entire
matrix is represented symbolically using an uppercase bold letter X. The data (i.e., sca-
lar) and information may take the form of 1 or 0 (correct/incorrect), ordinal, multiple
categorical (unordered), or interval on a continuous scale of, say, 1 to 100. The first
subscript denotes the row (i.e., the subject, person, or object being measured) and the
second subscript the column (e.g., an exam or questionnaire item or variable); that is, xij,
denotes the response of subject i to item j. Scalars are integers, and each scalar in a matrix
(rows × columns) is an element (Table 5.7).
A more complex data arrangement is the two-dimensional matrix with repeated mea-
surement occasions (time points) (Table 5.8). Still another data matrix commonly encoun-
tered in psychometrics is a three-dimensional array. Matrices of this type are encountered
in the scaling and analysis of preferences where multiple subjects are measured on mul-
tiple attributes (e.g., preferences or attitudes) and multiple objects (e.g., products or ser-
vices). Using the field of market research as an example, when a company manufactures a
product or offers a service in a for-profit mode, we find that it is essential that the company

Table 5.8.  k-Way (Three-Dimensional) Raw Data Matrix


Items/stimuli (variables) k
Persons Time
(subjects) j (order of measurement) i 1 2 3 ... j k
1 1 x111 x112 x113 ... x1ij x1ik
1 2 x121 x122 x123 ... x1ij x1ik
1 3 x131 x132 x133 ... x1ij x1ik
2 1 x211 x212 x213 ... x2ij x2ik
2 2 x221 x222 x223 ... x2ij x2ik
2 3 x231 x232 x233 ... x2ij x2ik
3 1 x311 x312 x313 ... x3ij x3ik
3 2 x321 x322 x323 ... x3ij x3ik
3 3 x331 x332 x333 ... x3ij x3ik
n i xni1 xni2 xni3 ... xnij xnik
162  PSYCHOMETRIC METHODS

evaluate its marketing effectiveness related to its product or service. Such research informs
the research and development process, so that the company remains financially solvent. To
effectively answer the research questions and goals, some combination of two- and three-
dimensional matrices may be required for a thorough analysis. Usually, the type of data
matrix is multivariate and involves people’s or subjects’ judgment of multiple attributes
of the product or service in question. Such matrices include multiple dependent variables
and repeated measurements (e.g., ratings or responses on an attitude scale) on the part of
subjects who are acting as observers or judges.

5.14 Incomplete and Missing Data

Incomplete data poses unique problems for researchers on the level of measurement, research
design, and statistical analysis. Regardless of the reason for the incomplete data matrix,
researchers have multiple decision points to consider regarding how to properly proceed.
The missing data topic is complex and beyond the scope of this text. Excellent information
and guidance on the topic is available in Enders (2011) and Peters and Enders (2002).

5.15 Summary and Conclusions

This chapter began with connecting ideas from Chapters 3 and 4 on validity and the
validation process to the role of scaling and developing scaling models. Also, we were
reminded that essential to any analytic process is ensuring the precision, objectivity, and
effective communication of the scores acquired during the course of instrument devel-
opment or use. The development of a scaling model that provides accurate and reliable
acquisition of numerical data is essential to this process. The goal of this chapter has been
to provide clarity and structure to aid researchers in developing and using scaling models
in their research. To gain perspective, a short history of scaling was provided. The chap-
ter focused on three types of scaling models, stimulus-, subject-, and response-centered.
Next, guidance on when and how to use these models was provided along with examples.
The chapter closed with a brief discussion of the type of data structures or matrices com-
monly encountered in psychometrics and a brief mention of the problem of missing data.

Key Terms and Definitions


Absolute threshold. Defined by Ernst Weber as the smallest amount of stimulus neces-
sary to produce a sensation.
Cumulative scaling model. A scale or test consisting of the sum of a set of items that
measures an underlying psychological continuum and the location of a person, within
a sample, relative to their response.
Data matrix. A two-way matrix that consists of N persons/subjects (in the rows) by k
stimuli/items (in the columns).
Scaling  163

Difference limen. The amount of change in a stimulus required to produce a just notice-
able difference.
Direct rankings. Involve providing a group of people a set of objects or stimuli (e.g., pic-
tures, names of well-known people, professional titles, words) and having the people
rank-order the objects in terms of some property.
Discriminal process. A reaction that correlates with the intensity of a stimulus on an
interval scale.
Element. A scalar in a row-by-column matrix.

Error of reproducibility. An equation to test the Guttman scaling model assumptions that
is based on expected versus actual item response patterns obtained from a sample
of persons.
Index measurement. Measurement that focuses on the property of the attribute being
measured, resulting in a numerical index or scale score.
Item response theory. A theory in which fundamental principles of classic psychophysics
were used to develop person-oriented, response-based measurement.
Judgment scaling. Scaling that produces absolute responses to test items such as yes/
no or correct/incorrect.
Just noticeable difference. The critical amount of intensity change when a stimulus above
or below a threshold is provided to a subject that produces an absolute threshold.
Multidimensional map. Map used in multiple dimensional scaling to graphically depict
responses in three-dimensional space.
Nonmetric measurement. Categorical data having no inherent order that are used in
unidimensional and multidimensional scaling.
Paired comparisons. Involve counting the votes or judgments for each pair of objects by a
group of respondents. For example, objects may be statements that subjects respond to.
Alternatively, subjects may rank-order pairs of objects by their similarities.
Person response profiles. Used when, for example, the measurement of attitude, ability,
or achievement is of interest relative to developing items that measure attributes of
progressively increasing degree or difficulty.
Preference scaling. Scaling that involves the relative comparison of two or more attri-
butes such as attitudes, interests, and values.
Psychological objects. Words, sentences, names, pictures, and the like that are used to
locate individuals on a unidimensional linear scale or multidimensional map.
Psychological scaling. The case in which people are the objects of scaling, such as
where tests are developed to measure a person’s level of achievement or ability.
Psychometrics. A mind-measuring function based on the relationship between f (i.e., the
magnitude of the stimulus) and Y (i.e., the probability that a subject detects or senses
the stimuli).
Psychophysical scaling. Stimulus is directly measurable, with the response being the
sensory perception in either an absolute or relative sense.
164  PSYCHOMETRIC METHODS

Psychophysics. The study of dimensions of physical stimuli (usually intensity) and the
related response to such stimuli known as sensory perception or sensation.
Response-centered scaling. Response data are used to scale subjects along a psycholog-
ical continuum while simultaneously subjects are also scaled according to the strength
of the psychological trait they possess. Examples of scaling techniques include
Guttman scaling, unidimensional and multidimensional unfolding, item response theory,
latent class analysis, and mixture models.
Scaling. The process by which a measuring device is designed and calibrated and the
manner by which numerical values are assigned to different amounts of a trait or
attribute.
Scaling model. Scaling that begins with a conceptual plan that produces measurements
of a desired type. Scaling models are then created by mapping a conceptual frame-
work onto a numerical scale.
Sensory threshold. A critical point along a continuous response curve over a direct physical
dimension, where the focus of this relationship is often the production of scales of human
experience based on exposure to various physical or sensory stimuli.
Stimulus-centered scaling. Scaling that focuses on responses to physical stimuli in rela-
tion to the stimuli themselves. The class of research is psychophysics with problems
associated with detecting physical stimuli such as tone, visual acuity, brightness, or
other sensory perception.
Subject-centered scaling. Tests of achievement or ability or other psychological con-
structs where, for example, a subject responds to an item or statement indicating
the presence or absence of a trait or attribute. Attitude scaling includes a subject
responding to a statement indicating the level of agreement, as in a Likert scale.
Thurstone’s law of comparative judgment. A discriminal process is defined as a reac-
tion that correlates with the intensity of a stimulus on an interval scale and uses the
variability of judgments to obtain a unit of measurement and assumes the phi-gamma
hypothesis (i.e., normally distributed errors of observations).
Unidimensional scale. A set of items or stimuli that represent a single underlying con-
struct or latent dimension.
Unidimensional unfolding technique. A technique involving the representation of per-
sons (labeled as i) and stimuli or objects (labeled as j) in a single dimension repre-
sented on a number line. In psychological or educational measurement, data are
sometimes acquired based on respondents providing global responses to statements
such as (1) concept A is more similar to concept B than to C, or (2) rate the similarity
of word meanings A and B on a 10-point scale. Unfolding provides a way to repre-
sent people and stimuli jointly in space such that the relative distances between the
points reflect the psychological proximity of the stimuli to the people or their ideals in
a single dimension.
6

Test Development

This chapter provides foundational information on test and instrument development, item
analysis, and standard setting. The focus of this chapter is on presenting a framework and
process that, when applied, produces psychometrically sound tests, scales, and instruments.

6.1 Introduction

Developing psychometrically sound tests or instruments requires attention to a variety


of complex information and numerous details. When tests or instruments are developed
effectively, they exhibit sufficient reliability and validity evidence to support the proposed
uses of resulting scores. To achieve this goal, a systematic and well-conceived approach is
required. This chapter covers three major areas of the test and instrument development
process; test construction, item analysis, and standard setting. The first section on test
construction begins by providing a set of guidelines that is useful for the types of scaling
approaches introduced in Chapter 5. The information on test and instrument construc-
tion provided here is aimed at guiding the effective production of tests and instruments
that maximize differences between persons (i.e., interindividual differences). The sec-
ond section of this chapter provides the details of item analysis with applied examples.
The third section describes the various approaches to standard setting and how they are
applied.
Chapter 5 presented three types of scaling approaches: (1) stimulus-centered, (2)
response-centered, and (3) subject-centered. In this chapter, we focus primarily on
subject-centered measurement where the goal of measurement is to locate or “index” a
person at some point on a psychological continuum (e.g., for constructs such as intelli-
gence or achievement). The test development process presented in this chapter therefore
focuses on maximizing differences between persons specific to a construct.

165
166  PSYCHOMETRIC METHODS

Identifying and defining the construct or constructs to be measured by a test is a


critical first step. A construct originates from a set of ideas resulting from various forms
of human knowledge acquisition and perception. Synthesis of these ideas forms mental
impressions. Delineating a construct in the test development process is enhanced by link-
ing the ideas or mental perceptions to a theory (e.g., as the theory of general intelligence
used throughout this book). Because psychological constructs are not directly observ-
able, researchers are tasked with developing a framework that links a construct to a set of
observable qualities, attributes, or behaviors.
The information presented in this chapter primarily focuses on tests of ability and
to a lesser degree on tests of achievement. As a point of comparison, tests of educa-
tional achievement emphasize what an examinee knows and can do at some point in time
and are usually developed primarily through establishing content evidence for validity
of scores. Alternatively, tests of ability or intelligence stress what examinees can do in
the future and are primarily developed by establishing construct evidence for validity of
scores. In either case, this chapter provides sufficiently general, yet effective, guidance
for test development. This information is an essential component for reporting compre-
hensive validity evidence as recommended in the Standards for Educational and Psycho-
logical Testing published by the American Educational Research Association (AERA), the
American Psychological Association (APA), and the National Council on Measurement
in Education (NCME) (1999).

6.2 Guidelines for Test and Instrument Development

The following guidelines describe the major components and technical considerations
for effective test and/or instrument construction (Figure 6.1). In addition to providing a
coherent approach, application of the following framework provides evidence for argu-
ments regarding the adequacy of validity evidence relative to the purported use of scores
obtained from using tests and/or measurement instruments.

Guideline 1: Articulate a Philosophical Foundation for the Test


The philosophical foundation of a test or instrument should provide a logical and mean-
ingful link between what the test purports to measure and a related body of material. A
body of material providing a meaningful philosophical link is referred to as a domain
of content (Nunnally & Bernstein, 1994, p. 295). The next step is to link domain con-
tent with domain-related criteria. In the example used throughout this book, this is
accomplished by including specific guidelines in a comprehensive document that maps
the attributes of interest to cognitive skills or tasks. Such cognitive skills or tasks serve
as the criterion for the domain of interest. Recall that three components of the the-
ory of generalized intelligence we are using for examples in this book are fluid (Gf),
crystallized (Gc), and short-term memory (Gsm). Using the G theory example, we see
that an important step in the process of test development is to link the philosophical
Test Development  167

Articulate a philosophical or theoretical


foundation for the test or instrument

Identify the purpose of the testor


instrument

Select the behaviors or attributes


reflective of the construct

Identify the testing audience or


population

Define or delineate the content that the


items will include

Write the test or instrument items

Develop test administration procedures

Conduct pilot test with representative Conduct item analyses and factor
sample analysis

Dashed line represents steps in the process


that may require multiple iterations

Revise test or instrument Validation studies

Develop norms or other standard


Develop the technical manual
scores

Figure 6.1.  Test and/or instrument development process.

and theoretical foundation of G theory with representative attributes of general intel-


ligence as articulated by the theory. Recall that the philosophy of the theory of general
intelligence is expressed as a factor-analytic-based model of cognitive ability consist-
ing of an overall, general level of intelligence with factors (i.e., subtests) that measure
specific aspects of each of the major components of the theoretical model (e.g., fluid
intelligence, crystallized intelligence, short-term memory). The conceptual model of the
general theory of intelligence used in this book is displayed in Figure 6.2 (introduced
in Chapter 1).
The scores obtained from the model of general intelligence through the test as the
measuring instrument must link to the theoretical aspect of intelligence theory in order
to exhibit adequate evidence of score validity (i.e., accuracy). As detailed in Chapters 3
and 4, the various types of evidence that support the validity of score inferences include
168  PSYCHOMETRIC METHODS

f1 i item 1

fluid intelligence test 1

fi1 item 10

fi2 item 1

Fluid Intelligence (Gf) fluid intelligence test 2


fi2 item 20

fi3 item 1

fluid intelligence test 2


fi3 item 20
General Intelligence (G)
ci1 item 1
crystallized intelligence test 1

ci1 item 25

ci2 item 1
crystallized intelligence test 2

ci2 item 25
Crystallized Intelligence (Gc)

ci3 item 1
crystallized intelligence test 3

ci3 item 15

ci4 item 1
crystallized intelligence test 4

ci4 item 15

stm1 item 1
short-term memory test 1

stm1 item 20

stm2 item 1
Short-Term Memory (Stm) short-term memory test 2

stm2 item 10

stm3 item 1

short-term memory test 3

stm3 item 15

Figure 6.2.  The GfGc theory of intelligence.

construct-related, content-related, and criterion-related information. Each of these sources


of evidence will be introduced as this chapter evolves.

Guideline 2: Identify the Purpose of the Test


The literature on test development contains numerous approaches to identifying the
purpose(s) of a test. Delineating these purposes is essential for several reasons. For exam-
ple, consider placement into a college algebra course. If skill prerequisites are not clearly
identified and defined, students may be allowed to enroll in the course only to experience
failure. Another scenario is related to test scores used as key criteria for college entrance.
If the domains of knowledge and skills are not clearly identified in a way that operational-
izes student success, then students may have a poor academic experience or possibly not
return beyond their freshman year. Table 6.1 provides a coherent guide for identifying
the purpose of a test.
Returning to the example used throughout this book, we measure three compo-
nents of general intelligence: fluid intelligence, crystallized intelligence, and short-term
Test Development  169

Table 6.1.  Test Purposes, Context, Type, and Inferences


Purpose Setting Type Conclusion or inference
Person-level
Diagnostic • Remediation and NR/CR • Strengths and weaknesses in ability
improvement or knowledge across various content
domains expressed in relative terms
Classification • Education or clinical- CR • Knowledge proficiency relative to
based interventions content standards
• Identification leading to NR • Targeted treatment based on diagnosis
treatment
• Licensure/certification CR • Acquired knowledge relative to estab-
lished standards of safe and effective
professional practice
Selection • College admission NR/CR • Knowledge for success in higher
education
• Career direction NR/CR • Predicted career success based on
knowledge, skill, or ability
Progress • Matriculation CR • Longitudinal knowledge gain or
change relative to established
­standard (e.g., curriculum)
• Course-end CR • Knowledge acquired after course
completion
• Grade promotion NR/CR • Level of knowledge upon comple-
tion of grade as prescribed as level of
educational attainment
• Growth over time NR/CR • Level of knowledge upon comple-
tion of grade as prescribed as level of
educational attainment
Placement • Course placement CR • Prerequisite knowledge needed to
enter a specific course so person is
ready or able to learn
• Counseling NR/CR • Prerequisite knowledge needed to
enter a specific course so person is
ready or able to learn

Group/class-level
Modification of • Pretest at outset of course CR • Informs instructional plan using stu-
instruction dent achievement scores
Instructional • Posttest at end of course CR • Knowledge required for standard of
value or success acceptable course attainment
• Critical review and CR • Within and between comparison of
evaluation of course for course domain to courses in other
improvement schools
Program value • Evaluation of progress CR • Educational achievement over time
across courses in subject- relative to established expectations of
matter area improvement or progress
Note. CR, criterion-referenced; NR, norm-referenced.
170  PSYCHOMETRIC METHODS

memory (Figure 6.2). Table 6.2 (introduced in Chapter 1) provides a review of the con-
structs and associated subtests for our GfGc example.
Our next task is to specify how much emphasis (weight) to place on each subtest
within the total test structure. To accomplish this task, we use a test blueprint. Table 6.3
provides an example of a table of specifications based on Figure 6.2. Table 6.3 is known
as a table of specifications or test blueprint. Note in Table 6.3 the two-way framework
for specifying how the individual components work in unison in relation to the total test.
In Table 6.3, each of the subtests within these components of intelligence is clearly
identified, weighted by influence, and aligned with a cognitive skill level as articulated by
Bloom’s taxonomy (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956), Ebel and Frisbie’s
relevance guidelines (1991, p. 53), and Gagné and Driscoll’s (1988) learning outcomes
framework. For a comparison of the three frameworks, see Table 6.4.
Millman and Greene (1989, p. 309) provide one other approach to establishing a
clear purpose for a test that focuses on the testing endeavor as a process. Millman and
Greene’s approach includes consideration of the type of inference to be made (e.g., indi-
vidual attainment, mastery, or achievement) cross-referenced by the domain to which
score inferences are to be made.
Reviewing Table 6.1 reveals the myriad options and various decisions to be con-
sidered in test development. For example, test scores can be used to compare exam-
inees or persons to each other (e.g., a normative test) or to indicate a particular level
of achievement (e.g., on a criterion-based test). With regard to the test of intelligence

Table 6.2.  Constructs Measured by the General Theory of Intelligence Used


in This Text
Number
Name of subtest of items Scoring
Fluid intelligence (Gf)
Quantitative reasoning—sequential Fluid intelligence test 1 10 0/1/2
Quantitative reasoning—abstract Fluid intelligence test 2 20 0/1
Quantitative reasoning—induction and
deduction Fluid intelligence test 3 20 0/1

Crystallized intelligence (Gc)


Language development Crystallized intelligence test 1 25 0/1/2
Lexical knowledge Crystallized intelligence test 2 25 0/1
Listening ability Crystallized intelligence test 3 15 0/1/2
Communication ability Crystallized intelligence test 4 15 0/1/2

Short-term memory (Gsm)


Recall memory Short-term memory test 1 20 0/1/2
Auditory learning Short-term memory test 2 10 0/1/2
Arithmetic Short-term memory test 3 15 0/1
Note. Scaling key: 0 = no points awarded; 1 = 1 point awarded; 2 = 2 points awarded. Sample size is N = 1,000.
Test Development  171

Table 6.3.  Test Blueprint for Cognitive Skill Specifications


Content-level weight
Content
Items weight Comprehension Application
I. Crystallized intelligence
a. Language development 25 12% 12% 0%
b. Lexical knowledge 25 11% 5.5% 5.5%
c. Listening ability 15 12% 12% 0%
d. Communication ability 15 11% 5.5% 5.5%
Section total 80 46% 35% 11%

II. Fluid intelligence


a. Quantitative
reasoning—sequential 10 13% 6.5% 6.5%
b. Quantitative
reasoning—abstract 20 8% 3% 5%
c. Quantitative
reasoning—­induction/
deduction 20 8% 4% 4%
Section total 50 29% 13.50% 15.50%

III. Short-term memory


a. Recall memory 20 8% 0% 8%
b. Auditory learning 10 7% 3.5% 3.5%
c. Arithmetic 15 10% 4% 6%
Section total 45 25% 7.5% 17.5%
       
Test total   100% 56% 44%
Note. Only the comprehension and application levels of Bloom’s taxonomy are used here because of their appropriate-
ness for these items and subtests.

used throughout this book, scores are often used in a normative sense where a person is
indexed at a certain level of intelligence relative to scores that have been developed based
on a representative sample (i.e., established norms). This type of score information is
also used in diagnosis and/or placement in educational settings. Because placement and
selection are activities that have a high impact on people’s lives, careful consideration is
crucial to prevent misclassification. For example, consider a child who is incorrectly clas-
sified as being learning disabled based on his or her test score. The implication of such
a misclassification results in the child being placed in a class that is at an inappropriate
educational level.
Another (perhaps extreme) example in the domain of intelligence is the case where
an incarcerated adult may be incorrectly classified in a way that requires him or her to
remain on death row. If during the process of test development, inadequate attention is
paid to what criteria are important for accurate classification or selection, a person or
persons might be placed in an incorrect educational setting or be required to serve in a
capacity that is unfitting for their actual cognitive ability.
172  PSYCHOMETRIC METHODS

Table 6.4.  Comparison of Classification Systems of Bloom,


Ebel, and Gagné
Gagné’s learning outcomes
Bloom’s taxonomy Ebel’s relevance guide framework
A. Knowledge • Terminology • Verbal information
• Factual information

B. Comprehension • Explanation • Intellectual skills


• Cognitive strategies
C. Application • Calculation
• Prediction

D. Analysis

E. Synthesis

F. Evaluation • Recommended action


• Evaluation
G. • Attitudes

H. • Motor skills
Note. From Ebel and Frisbie (1991, p. 53). Reprinted with permission from the authors.

Finally, careful selection of the domains that test scores are intended to be linked
to will minimize the risk of inappropriate inferences and maximize the appropriateness
of score inferences and use (an important issue in the process of test validation; see
Chapters 3 and 4).
The criterion score approach to using test scores is perhaps best exemplified in high-
stakes educational testing. For example, students must earn a certain score in order to qual-
ify as “passing,” resulting in their matriculating to the next grade level. This is an example
of using criterion-based test scores for absolute decisions. Notice that a student’s perfor-
mance is not compared to his or her peers, but is viewed against a standard or criterion.

Guideline 3: Select the Attributes Reflective of the Construct


Perhaps one of the most challenging aspects of the test or instrument construction process
is related to identifying the attributes that accurately represent the construct the test is
targeted to measure. The following points are essential to the process of ensuring that the
attributes are accurately linked to the target construct. Although the points that follow are
aligned with the theory of intelligence and data used in this book, much of the information
is applicable to other types of test and instrument development (e.g., achievement tests,
personality, attitude scales, or other instruments with a clearly articulated construct).
Subject-matter experts play an important role in ensuring that the attributes that the
test purports to measure are in fact the ones being measured. Experts such as practicing
Test Development  173

clinical adult and/or school psychologists, licensed professional counselors, and oth-
ers in psychiatry provide invaluable expert judgment based on their first-hand experi-
ence with the construct a test purports to be measuring. The actual manner of collecting
information from psychologists may involve numerous iterations of personal interviews,
group meetings to ensure adequate content coverage, or written survey instruments. The
input gleaned from subject-matter experts is an essential part of test development and the
validity evidence related to the scores as they are used following publication of the test.
The process of interviewing key constituents is iterative and involves a cyclical approach
(with continuous feedback among constituents) until no new information is garnered
regarding the construct of interest. Closely related to expert judgment is a comprehensive
review of the related literature; subject-matter experts make an important contribution
by providing their expertise on literature reviews.
Content analysis is sometimes used to generate categorical subject or topic areas.
Applying content analysis involves a brainstorming session in which questions are posed
to subject-matter experts and others who will ultimately be using the test with actual
examinees. The responses to the open-ended questions are used to identify and then cat-
egorize subjects or topics. Once the topic areas are generated, they are used to guide the
test blueprint (e.g., see Table 6.3).
Another approach to identifying attributes relative to a construct is to acquire infor-
mation based on direct observations. For example, through direct observations con-
ducted by actively practicing clinical or school psychologists, professional counselors
and licensed behavioral therapists often provide a way to identify critical behaviors or
incidents specific to the construct of interest. In this approach, extreme behaviors can be
identified, offering valuable information at extreme ends of the underlying psychologi-
cal continuum that can then be used to develop the score range to be included on the
distribution of scores or normative information. Finally, instructional objectives serve an
important role in test development because they specify the behaviors that students are
expected to exhibit upon completion of a course of instruction. To this end, instructional
objectives link course content to observable measurable behaviors.

Guideline 4: Identify the Examinee Population


Sampling is defined as the selection of elements, following prescribed rules from a
defined population. In test development, the sample elements are the examinees or per-
sons taking the test (or responding to items on an instrument). A sampling protocol is
used primarily to allow researchers to generalize or make inferences about the popula-
tion of interest in a way that avoids acquiring data from the entire population. Selecting
a sample of persons specific to how the test will be used involves collecting data from
a large sample representative of the population for which the scores are intended. In
preparing to acquire the sample, the following criteria are relevant: (1) who the sample
should consist of; (2) how credible or accurate this group is relative to the purpose of the
test; (3) what obstacles are likely to be encountered in acquiring the sample; and (4) how
might these obstacles be avoided or addressed.
174  PSYCHOMETRIC METHODS

To acquire the sample, some form of sampling technique is required. There are two
general approaches to sampling—nonprobability (nonrandom) and probability (ran-
dom). In nonprobability sampling, there is no probability associated with sampling a
person or unit. Therefore, no estimation of sampling error is possible. Conversely, prob-
ability samples are those that every element (i.e., person) has a nonzero chance of select-
ing and the elements are selected through a random process; each element (person) must
have at least some chance of selection although the chance is not required to be equal. By
instituting these two requirements, values for an entire population can be estimated with
a known margin of error. Two other types of sampling techniques (one nonprobability
and the other probability) are (1) proportionally stratified and (2) stratified random
sampling. In proportionally stratified sampling, subgroups within a defined population
are identified as differing on a characteristic relevant to a researcher or test developer’s
goal. Using a proportionally stratified sampling approach helps account for these char-
acteristics that differ among population constituents, thereby preventing systematic bias
in the resulting test scores. Using the stratified random sampling approach gives every
member in the strata of interest (e.g., the demographic characteristics) a proportionally
equal chance of being selected in the sampling process. The explicit details of conducting
the various approaches of random and nonrandom sampling protocols are not presented
here. Readers are referred to excellent resources such as Levy and Lemeshow (1991) and
Shadish, Cook, and Campbell (2002) to help develop an appropriate sampling strategy
tailored to the goal(s) of their work.

Guideline 5: Delineate the Content of the Items


Prior to defining the content of the test items, the construct must be selected and opera-
tionalized in a way that will serve as the basis for the test or instrument. Recall from
Guideline 3 above that deciding on the construct is usually based on review of related lit-
erature, along with consultation with subject-matter experts. Once a decision is reached
regarding what the construct will be for the test, a concise definition of the construct
should be written. Using this definition, one can write the item content with precision
and clarity. Defining the content to be sampled by the test or instrument is likely the most
important core exercise in test and/or instrument development if any valid score-based
inferences are to be made. No other component of test or instrument construction is as
important as identifying and articulating the domain of content to be sampled. Methods
for defining content vary depending on the purpose of the test or instrument, regardless
of whether the test is normative or criterion-referenced, consequences resulting from
uses of test scores, and the amount of defensibility required for any decisions resulting
from test scores. However, the primary goal at this stage of the test development process
is operationalizing the construct in a way that behaviors are observable and measure-
able. Table 6.3 provides the constructs measured by the test of general intelligence used
throughout this book, and Table 6.4 provides an example table of specifications that
details the level of taxonomy and influence (weight) of each subtest relative to the total
test.
Test Development  175

Guideline 6: Write the Test Items


Developing test items to measure a construct involves several considerations. First is the
selection of an item format that is appropriate for the measurement task and is effective
for examinees. Second, persons must be selected and/or trained regarding the techniques
of effective item writing. Third is the task of generating or writing the items. Fourth, the
process of item writing must be monitored for quality assurance.
Item formats come in a variety of flavors. The driving factor in the selection of a par-
ticular format is based on which format is perceived as being most likely to yield the best
(i.e., most accurate) response from an examinee. The following information provides an
overview of various item formats and when each is to be used. The two major types of test
items are objective and subject-generated responses. This chapter presents information
on objective item formats because such items are congruent with the goal of objective
measurement, which in turn affects score precision.
As stated in Chapter 2, objective measurement is an important goal in psychometric
methods. There are several types of objective test item formats, depending on the goal(s)
of the test being developed. Examples include multiple-choice, pictorial item sets, alter-
nate choice (i.e., an item that requires only one choice out of two alternatives such as
true-false), word analogies, numerical problems, short-answer items, and matching items
(e.g., see Table 6.5). Common to these item types is their inherent objectivity and mini-
mal subjectivity in scoring.
The multiple-choice test item has long been the most highly regarded and widely used
type of objective test item. This format is highly versatile and effective for discriminating
persons exhibiting high and low levels of ability or achievement (Haladyna, 2004). Critics
of multiple-choice items cite their weakness in measuring higher-order cognitive skills.
The argument offered to support this claim lies in the idea that because the test item has
provided important information by way of the response alternatives, the item is inher-
ently flawed. To avoid this inherent flaw, multiple-choice items must be constructed in a
way that prevents critical clues to examinees regarding how they answer the item. Once
the multiple-choice item response alternatives have been carefully reviewed and edited for
possible correct-answer clues, the strength of the multiple-choice format item is that it
requires examinees to use original thought, creative thinking, and abstract reasoning to
select among reasonable response alternatives. In ability and achievement testing, multiple-
choice and other objective item types have proven effective for measuring achievement or
ability specific to knowledge, comprehension, ability to solve problems, make predictions,
and judgment. In fact, any element of ability or understanding can be measured by multiple-
choice test items if thoughtfully constructed. An important characteristic of objective test
items is that the response options should appear reasonable or plausible to examinees who
do not have adequate knowledge or skill related to the item content.

Recommendations for Writing Objective Items


Numerous test item types are subsumed under the general heading of the multiple-choice
format. There are a variety of item formats because certain formats are more effective in
176  PSYCHOMETRIC METHODS

eliciting responses specific to content or ability than others. A detailed presentation of the
numerous types of item formats is beyond the scope of this book. For a summary of the
types of multiple-choice item formats available and when they are appropriate for use, see
Haladyna (2004, p. 96). Table 6.5 provides Haladyna’s recommendations.
Haladyna (2004, p. 99) provides a general set of item-writing guidelines aided
by an extensive discussion with 31 guidelines. The guidelines are grouped according
to (1) content guidelines, (2) style and format concerns, (3) writing item stems, and
(4) writing choice options. Some important points highlighted by Haladyna include
the following:

1. Items should measure a single important content as specified in the test specifi-
cations or blueprint.
2. Each test item should measure a clearly defined cognitive process.
3. Trivial content should be avoided.
4. Items should be formatted (i.e., style considerations) in a way that is not distract-
ing for examinees.
5. Reading comprehension level should be matched to the examinee population.
6. Correct grammar is essential.
7. The primary idea of a question should be positioned on the stem rather than in
the options.
8. Item content must not be offensive or culturally biased.

Table 6.5.  Multiple-Choice Item Formats and Type of Content Measured


Cognitive Item format (intelligence theory
Format Knowledge skills Ability example data in this book)
Conventional
  multiple-choice X X X
Alternate choice X X
Matching X X
Extended matching X X X
True–false X X
Complex
  multiple-choice X X
Multiple true–false X X
Pictorial item set X X X X
Problem-solving item set X X
Vignette or scenario
  item set X X
Interlinear item set   X  
Note. Adapted from Haladyna (2004, p. 96). Copyright 2004 by Lawrence Erlbaum Associates. Adapted by permission.
Test Development  177

The following items provide examples from the fluid and crystallized intelligence
subtests used throughout this book.

Fluid intelligence quantitative reasoning subtest item example


Administration instructions: For this test, you will be asked to perform calculations to decide
on your answer. Please tell me the answer to the following question:

A sweater that normally sells for 90 dollars is reduced by 20% during a sale. What is the sale
price of the sweater?
A. 71 dollars
B. 75 dollars
C. 72 dollars
D. 76 dollars

Scoring rule: 1 point awarded for correct response, 0 points awarded for incorrect response.
Time limit is 30 seconds on this item.

Crystallized intelligence language ability subtest item example


Administration instructions: For this test, you will be asked to state the meanings of words.
Please tell me the meaning of the following word: DELINEATE.

Scoring rule: To earn 2 points, the following answer options are acceptable: (a) to describe, (b) to
outline, (c) to explain in detail, (d) point awarded for correct response, 0 points awarded for incor-
rect response. To earn 1 point, the following answer options are acceptable: (a) to explain with
accuracy, (b) to mark, (c) portray, (d) to characterize. The criteria for earning 0 points include the
following answer options: (a) ambiguous, (b) to be vague, (c) nonsense, (d) to portray.

Note that the scoring rule produces a polytomous score of 0, 1, or 2 points for an exam-
inee, yielding an ordinal level of measurement (i.e., on the crystallized intelligence exam-
ple item). Also, in tests of cognitive ability, scoring rules are often more complex than the
preceding example. For example, there are additional scoring rule components: (a) dis-
continue rules specific to how many items an examinee fails to answer correctly in a row
(e.g., the examiner stops the test if the examinee earns 0 points on 5 items in a row), and
(b) reverse rules (e.g., a procedure for reversing the sequence of previously completed
items administered if an examinee earns a low score such as 0 or 1 on certain items that
subject-matter experts have deemed that the examinee should earn maximum points).

Short-term immediate memory subtest item example


Administration instructions: For this test, a series of numbers is presented to you. Your task
is to repeat the numbers immediately after they are presented in the same order. Next, if you
successfully complete the previous question, a more difficult question of the same format will
be given to you.

Item: 3-6-7-11-13-17-18

Scoring rule: 1 point awarded for correct response, 0 points awarded for incorrect response. To
earn 1 point, the series of numbers must be required in exact sequence.
178  PSYCHOMETRIC METHODS

Writing Items for Measuring Attitudes or Personality


The construct of attitude has played an important role in social psychology for some time.
Techniques for measuring and scaling attitude have received a great deal of attention over
the past half-century (Kerlinger & Lee, 2000). Common item formats for measuring atti-
tudes, interests, and personality include Likert-type (Figure 6.3a, introduced in Chapter
5), bipolar adjective lists (e.g., Figure 6.3b [the semantic differential scale], introduced
in Chapter 5), the summated rating scale (Figure 6.3c, introduced in Chapter 5), and
agree–disagree (Figure 6.3a) type items (Gable & Wolfe, 1993; Kerlinger & Lee, 2000).
Gable and Wolfe (1993, pp. 40–60) provide comprehensive coverage regard-
ing the technical aspects of developing well-crafted items that measure attitude.

Intelligence tests are an essential component of psychological assessment.

1 2 3 4 5

strongly moderately moderately strongly


undecided
disagree disagree agree agree

Figure 6.3a.  Likert-type item with agreement response format.

Intelligence tests

fun: _____: _____: _____: _____: _____: _____: _____: work

easy: _____: _____: _____: _____: _____: _____: _____: hard

good: _____: _____: _____: _____: _____: _____: _____: bad

Figure 6.3b.  Semantic differential item.

Cheating on taxes if you have a chance is:

1 2 3 4 5 6 7 8 9 10

never always
justified justified

Figure 6.3c.  Summated rating scale item.


Test Development  179

The following list includes important considerations when writing items to measure
attitude:

1. Avoiding statements written in the past tense.


2. Constructing statements that include a single thought, selecting statements that
cover the range of the scale.
3. Avoiding the use of double-negative wording.
4. Constructing statements that reflect simple sentence structure.
5. Avoiding use of words with absolute connotation such as only or just.
6. Avoiding statements that are likely to be endorsed by all respondents.
7. Avoiding statements that have multiple interpretations.
8. Avoiding statements that include absolute terms such as always or none.
9. Keeping language simple, clear, and direct.
10. Keeping statements under 20 words.

Guideline 7: Develop the Test Administration Procedures


Test administration procedures include (1) available time, (2) mode of delivery—group
or individual, and (3) delivery platform (computer or paper/pencil). Establishing an
appropriate timeframe for examinees to take the test is critical, and several factors are
of concern. First, the purpose and length of the test are to be optimized to ensure accu-
racy (i.e., validity) of scores. For example, given the purpose of the test, what is the
minimum number of items that can be administered while adequately measuring the
target ability? Age of the examinee(s) is also an important factor, with younger exam-
inees requiring shorter administration time periods. Examinee fatigue is yet another
factor to be considered. The examinee’s fatigue is affected by the type of item format
used, time of day the test is administered, and whether the test is delivered by computer
or paper/pencil.

Guideline 8: Conduct the Pilot Test with a Representative Sample


Pilot (a.k.a. tryout) test administrations serve as an excellent opportunity for researchers to
acquire information from examinees regarding their behavior during test taking. The two
main objectives of the pilot testing phase are obtaining statistical information on the items
and obtaining comments and suggestions from the examinees after taking the test under
actual conditions or circumstances. Often the comments from examinees are extremely
useful in refining test items or administrative procedures. Certain examinee behaviors dur-
ing the testing experience may be indicative of problems with certain items. For example,
behavior such as repeated changing of answers or lengthy pauses on items may indicate a
problematic item or items. Examination of descriptive statistics through conduct of an item
analysis (next section) also provides important information regarding how examinees are
180  PSYCHOMETRIC METHODS

collectively responding to an item. Taken together, the item analysis and examinee feed-
back are the two most useful activities that should occur during the pilot test.

Guideline 9: Conduct the Item and Factor Analyses


Item analyses involve a collection of statistical techniques that provide a basis for select-
ing the best items. Conducting an item analysis allows a way for researchers to detect
items that (1) are ambiguous, (2) are incorrectly keyed or scored, (3) are too easy or
too hard, and (4) do not discriminate well. The objectives of the test drive the criteria
for which the element of an item analysis is considered most important. For example,
a researcher may want to create a test using items that will maximize its internal con-
sistency. However, another researcher may want to select items that maximize the test’s
criterion-related validity (e.g., in occupational or placement testing). Factor analysis is
a statistical technique that provides a rigorous approach for confirming whether the set
of test items comprises a test function in a way that is congruent with the underlying G
theory of the test (e.g., the G theory of intelligence in the examples used in this book).

Guideline 10: Develop Norms or Interpretative Scores


In many, if not most, testing situations, a frequent practice is to provide normative mean-
ing to the definition of a scale that produces scores (Angoff, 1984, p. 39). Normative
scores (a.k.a. norms) are descriptive statistics that enable comparisons of a particular
score with scores earned by other members of a well-defined group (see Chapter 11). The
well-defined group is based on specific criteria that reflect the target population for which
the test will be used. Norms are linked to percentile ranks within a score distribution,
making it possible to identify an examinee’s relative standing to others in a normative
population. Identifying the relative location of an examinee’s score offers a way to make
interpretative statements. For example, using the example data in this book, an examinee
with a fluid intelligence scale score of 115 is at the 84th percentile. Yet, the same examinee
is located at the 50th percentile on fluid intelligence by exhibiting a scale score of 100.
Generally, norms are used in two ways. First, norms are used to enable classifications of
examinees or persons into categories such as acceptable or desired. Second, norms are often
used to classify an examinee or person according to a standard or clinical ideal (Angoff, 1984).
As an example of when norms are used for classification, consider the term body mass index
(BMI) and how it is used to evaluate obesity level. A person is determined to be clinically
obese if he or she exceeds a certain range on the body mass index (BMI) normative table.
The BMI normative table was developed independently, based on a representative sample
of persons in the United States. For a second example of how norms are used, consider the
example where an examinee takes one of the subtests on fluid intelligence used in this book.
The interpretation of norms in this case is statistical because the examinee’s score-based per-
formance is classified as being high or low in relation to a defined population of persons.
The preceding explanation is typically how norms are used in education and psy-
chology. Importantly, using norms properly involves clearly understanding (and not
Test Development  181

confusing) norms that represent standards to be achieved (e.g., compared to achieve-


ment as it exists (e.g., in educational settings). To this end, the technical manual for a test
should clearly articulate how norms are aligned with the purpose of the test and are used.
Angoff (1984, p. 41) provides the following guidelines in developing norms and
subsequently providing information for inclusion in the technical manual regarding their
use. First, the attribute being measured must allow for examinees to be ordered along a
continuum and measured on at least an ordinal scale. Second, the test must include an
operational definition of the attribute under consideration such that other tests mea-
suring the same attribute will yield similar ordering of examinees. Third, the test must
provide an evaluation of the same construct throughout the range of scores. Fourth, the
group on which descriptive statistics are based should be appropriate to the test and for
the purpose for which the test was designed. Fifth, data should be made available for as
many distinct norm populations with which it is useful for an examinee or a group to be
compared (Angoff, 1984, p. 41). Finally, several types of norms may be derived based on
the purpose of the test, including (1) national norms, (2) local or state norms, (3) norms
by age or grade level, (4) age and grade-equivalent norms, (5) item norms, (6) school
or organization-level norms, (7) user-defined norms, and (8) special study norms. For
details on each on these types of norms, see Angoff (1984).

Guideline 11: Write the Technical Manual and Associated Documentation


Developing the technical manual is a concluding activity in the test development process
and provides comprehensive documentation of all the processes and procedures used to
develop the test. The process of writing the technical manual forces one to thoroughly revisit
and evaluate all of the procedures used in developing a test. Furthermore, technical manu-
als aid in external evaluations of any test by independent researchers. The technical manual
provides a summary source for all of the psychometric and validity evidence for a test and
should be written in enough detail to allow the reader to form a clear judgment about the
rigor and adequacy of the procedures used in each step. The technical manual includes the
systematic documentation of all important components of the test development process.
Minimally, the components of the technical manual should include (1) a synopsis of the
test development process, including the purpose, psychometric foundation, and intended
use of scores; (2) administration procedures, including any required training for adminis-
trators or users of the test; (3) scaling (e.g., classical or item response theory) and scoring
procedures; and (4) normative information and score reporting (i.e., norms tables and sup-
plemental analyses such as validity studies, reliability studies, and factor-analytic results).
Section 6.2 has provided guidelines and technical considerations for effective test
and/or instrument construction. These guidelines offer a systematic and comprehensive
approach to test development. In addition to providing a set of principles to follow, the
information provides evidence for arguments regarding the validity of the use of scores
obtained. The next section provides detailed information on item analysis, the process of
examining the statistical properties of items in a test or instrument based on responses
obtained from a pilot or tryout sample of examinees.
182  PSYCHOMETRIC METHODS

6.3 Item Analysis

In test construction, the goal is to produce a test or instrument that exhibits adequate evi-
dence of score reliability and validity relative to its intended uses. Several item and total
test statistics are derived to guide the selection of the final set of items that will comprise
the final version of the test or instrument. Key statistics that are derived in evaluating
test items specifically include item-level statistics (e.g., proportion correct, item valid-
ity, and discrimination) and total test score parameters such as mean proportion correct
and variance. Item analysis of attitudinal or personality instruments includes many but
not necessarily all of the indexes provided here. The decision about which item analysis
indexes are appropriate is dictated by the purpose of the test and how the scores will be
used. Table 6.6a illustrates item-level statistics for crystallized intelligence test 2 (measur-
ing lexical reasoning) based on 25 items scored on a 0 (incorrect) and 1 (correct) metric
for the total sample of N = 1000 examinees. Item analyses are presented next based on
the SPSS syntax below.

SPSS syntax for generating Tables 6.6a–d

RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13
cri2_14 cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21
cri2_22 cri2_23 cri2_24 cri2_25
/SCALE(‘ALL VARIABLES’) ALL
/MODEL=ALPHA
/STATISTICS=DESCRIPTIVE SCALE
/SUMMARY=TOTAL MEANS VARIANCE.

6.4 Item Difficulty

The difficulty of an item is defined as the proportion of examinees responding correctly


to it. Item difficulty is synonymously called item difficulty or p (proportion correct).
These values are displayed in the column labeled “Mean” in Table 6.6a. In Table 6.6a,
notice how the items at the beginning are easy (i.e., items have a high proportion of cor-
rect values) and become increasingly difficult throughout the set of 25 items. The range
of proportion correct is 0.0 (i.e., all examinees respond incorrectly to an item) to 1.0
(i.e., all examinees respond correctly to an item). A higher proportion of correct values
are indicative of an easy test. For example, criterion-referenced mastery tests typically
exhibit high values of p (i.e., the distribution of scores is negatively skewed because the
majority of scores cluster or group at the high end of the score range). Item difficulty
plays an important role in item analysis because practically all test score statistics are at
least partially influenced by it.
Test Development  183

Table 6.6a–d.  Descriptive Statistics and


Reliability for Crystallized Intelligence Test 2

Table 6.6a.
Mean SD N

crystallized intelligence test 2 item 1 1.00 .071 1000

crystallized intelligence test 2 item 2 .99 .109 1000

crystallized intelligence test 2 item 3 .87 .334 1000

crystallized intelligence test 2 item 4 .81 .391 1000

crystallized intelligence test 2 item 5 .73 .446 1000

crystallized intelligence test 2 item 6 .72 .449 1000

crystallized intelligence test 2 item 7 .83 .379 1000

crystallized intelligence test 2 item 8 .67 .471 1000

crystallized intelligence test 2 item 9 .61 .488 1000

crystallized intelligence test 2 item 10 .58 .494 1000

crystallized intelligence test 2 item 11 .52 .500 1000

crystallized intelligence test 2 item 12 .52 .500 1000

crystallized intelligence test 2 item 13 .52 .500 1000

crystallized intelligence test 2 item 14 .52 .500 1000

crystallized intelligence test 2 item 15 .48 .500 1000

crystallized intelligence test 2 item 16 .44 .497 1000

crystallized intelligence test 2 item 17 .33 .469 1000

crystallized intelligence test 2 item 18 .26 .439 1000

crystallized intelligence test 2 item 19 .24 .428 1000

crystallized intelligence test 2 item 20 .21 .409 1000

crystallized intelligence test 2 item 21 .19 .395 1000

crystallized intelligence test 2 item 22 .16 .370 1000

crystallized intelligence test 2 item 23 .12 .327 1000

crystallized intelligence test 2 item 24 .07 .247 1000

crystallized intelligence test 2 item 25 .03 .171 1000

Table 6.6b.
Summary Item Statistics

Mean Minimum Maximum Range Maximum / Minimum Variance N of Items

Item Means .497 .030 .995 .965 33.167 .082 25

Item Variances .171 .005 .250 .245 50.200 .006 25


184  PSYCHOMETRIC METHODS

Table 6.6c.
Total Scale Statistics

Mean Variance Std. Deviation N of Items

12.43 29.540 5.435 25

Table 6.6d.
Reliability Statistics

Cronbach’s
Alpha Based on
Cronbach’s Alpha Standardized Items N of Items

.891 .878 25

6.5 Item Discrimination

Item discrimination indexes in test construction provide researchers a measure of the


influence a test item exhibits on the total test. Two broad categories of item discrimina-
tion statistics are the D-index, which is derived using information on the lowest and
highest performing examinees, and the correlation-based indexes that capitalize on
the relationship between each item and the total test score for a group of examinees.
The discrimination index (D) measures the magnitude to which test items distinguish
between examinees with the highest and lowest scores on a test. The upper and lower
examinee ability groups (also known as criterion groups) can be constructed in differ-
ent ways, depending on the purpose of the test. Examples of methods for establishing
criterion groups include using the (1) upper and lower halves (i.e., 50%) of a group of
examinees, (2) upper and lower thirds (i.e., 33%) or quarters (25%), and (3) extreme
(upper and lower) (27%) groups.
Item D-indexes enjoy a direct relationship with item proportion correct statistics and
provide additional information about how examinees perform across the score range. For
example, an extremely easy item is not useful for discriminating between high- and low-
ability examinees because the majority of scores cluster or group at the high end of the score
distribution (i.e., producing a negatively skewed distribution). For an item to exhibit maxi-
mum or perfect discrimination (i.e., a value of .50), all examinees with high ability (i.e.,
examinees in the upper score level of the criterion) will answer an item correctly, whereas
all examinees with lower ability will answer the item incorrectly. However, the previous
statement assumes that no guessing has occurred among the examinees. Another important
point about the relationship between item difficulty and discrimination is that items exhib-
iting high discrimination require some “optimal” level of proportion correct (difficulty).
In turn, item difficulty is directly considered in light of the established purpose of the test.
To this end, item proportion correct and item discrimination are to be considered relative
to one another during test construction. In fact, a test item displaying optimal difficulty
does not ensure a high level of discrimination (Sax, 1989, p. 235). Table 6.7 illustrates the
Test Development  185

Table 6.7.  Relationship between Item Difficulty


and Maximum Values of Item Discrimination
Proportion correct values (p) Maximum values of D

1.00 .00
.90 .20
.80 .40
.70 .60
.60 .80
.50 1.00
.40 .80
.30 .60
.20 .40
.10 .20
.00 .00
Note. Assumes that examinees have been divided into upper and lower cri-
terion groups of 50% each. Adapted from Sax (1989, p. 235). Copyright­
1989 by Wadsworth Publishing Company. Adapted by permission.

Table 6.8.  Item Discrimination Index Screening Criteria


Index of
discrimination Item evaluation
.40 and up Very good items
.30 to .39 Reasonably good but possibly subject to improvement
.20 to .29 Marginal items, usually needing improvement
below .19 Poor items, to be rejected or improved by revision
Note. Adapted from Ebel and Frisbie (1991, p. 232). Adapted with permission from the authors.

relationship between item proportion correct and discrimination. Ebel and Frisbee (1991,
p. 232) provide guidelines (Table 6.8) for screening test items based on the D-index.
On objective test items such as multiple choice, guessing is a factor that must be
considered. To establish an optimal proportion correct value that accounts for guessing,
the following information is required: (1) the chance level score based on the number of
response alternatives and (2) the number of items comprising the test. Consider the sce-
nario where the test item format is multiple choice; there are four response alternatives,
and a perfect score on the test is 1.0 (i.e., 100% correct). Equation 6.1 provides a way to
establish the optimal proportion correct value for a test composed of 30 multiple-choice
items with four response alternatives.
In Equation 6.1, the chance score for our multiple-choice items with four response
alternatives is derived as 1.0 (perfect score) divided by 4, resulting in .25 (i.e., a 25%
chance due to guessing). Taking one-half of the difference between a perfect score for
the total test yields a value of .375. Next, adding the chance-level value (.25) to .375
186  PSYCHOMETRIC METHODS

Equation 6.1. Derivation of the optimal proportion correct value


accounting for guessing

PERFECT SCORE - CHANCE SCORE 1.00 - .25 .75


CHANCE SCORE + = .25 + = = .625
2 2 2

Table 6.9.  Optimal Difficulty Levels for Items Having Different


Number of Options as Determined by Two Different Procedures
Number of response Optimal difficulty Optimal difficulty
options using Equation 6.1 according to Lord
0 .50 .50
2 .75 .85
3 .67 .77
4 .63 .74
5 .60 .69
Note. Adapted from Sax (1989, p. 236). Copyright 1989 by Wadsworth Publishing Company.
Adapted by permission.

yields .625 or ~63%. The interpretation of the result in the previous sentence is that 63%
of the examinees are expected to answer the items on the test correctly. This approach
is less than optimal because it does not account for the differential difficulty of the indi-
vidual items comprising the total test. A revised approach presented by Fred Lord (1952)
accounts for differential difficulty among test items. Table 6.9 provides a comparison of
Lord’s work to the results obtained using Equation 6.1.
Correlation-based indexes of item discrimination are used more often in test con-
struction than the D-index. Correlation-based indexes are useful for test items that are
constructed on at least an ordinal level of measurement (e.g., Likert-type or ordered
categorical response formats) or higher (e.g., interval-level scores such IQ scores). Foun-
dational to the correlation-based item discrimination indexes is the Pearson correlation
coefficient that estimates the linear relationship between two variables. For item dis-
crimination indexes, the two variables that are correlated include the response scores to
individual items and the total test score.

6.6 Point–Biserial Correlation

The point–biserial correlation is used to estimate the relationship between a test item
scored 1 (correct) or 0 (incorrect) and the total test score. The formula for deriving the
point–biserial correlation is provided in Equation 6.2 (see also the Appendix).
Test Development  187

Equation 6.2. Point–biserial correlation coefficient

X S - Xµ P
RPBIS = .
SY Q

• X S = mean score on a continuous variable for a group that is


successful on a dichotomous variable.
• X µ = mean score on a continuous variable for a group that is
unsuccessful on a dichotomous variable.
• sY = overall standard deviation of the scores on the continuous
variable.
• q = proportion of individuals in the unsuccessful group,
1 – p.
• p = proportion of individuals in the successful group.

The point–biserial coefficient is not restricted to the underlying distribution of each


level of the dichotomous variable or test item being normally distributed. Therefore, it
is more useful than the biserial coefficient (presented next), where the coefficient does
assume a normal distribution underlying both levels of the dichotomous variable. In test
development and/or revision, the point–biserial is useful for examining the contribution
of a test item to the total test score and its impact on the reliability of scores on the total
test. If the total test comprises fewer than 25 items, a correction to the point–biserial cor-
relation is recommended whereby the item under study is removed from calculation of
the coefficient (e.g., see Crocker & Algina, 1986, p. 317). This step removes any spurious
effect that may occur due to including the item under study in the calculation of the total
test score. Table 6.10 provides the point–biserial coefficients (column 6) for the 25-item
crystallized intelligence test 2. The results in Table 6.10 are from the phase I output of
the program (Du Toit, 2003):

POINT BISERIAL AND BISERIAL.BLM - CRYSTALLIZED INTELLIGENCE


TEST 2 ITEMS 1-25
>COMMENTS
>GLOBAL NPARM=2, LOGISTIC, DFNAME=’C:\rpbispoly.DAT’;
>LENGTH NITEMS=25;
>INPUT NTOTAL=25, NGROUPS=1, NIDCHAR=9;
>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
>CALIB NQPT=10, CYCLES=15, CRIT=0.005, NEWTON=2, PLOT=1;
188  PSYCHOMETRIC METHODS

Table 6.10.  BILOG-MG Point–Biserial and Biserial Coefficients for the 25-Item
Crystallized Intelligence Test 2
Pearson r
Name N #right PCT LOGIT (pt.–biserial) Biserial r
ITEM0001 1000 0.00 0.00 99.99 0.00 0.00
ITEM0002 1000 995.00 99.50 –5.29 0.02 0.11
ITEM0003 1000 988.00 98.80 –4.41 0.09 0.30
ITEM0004 1000 872.00 87.20 –1.92 0.31 0.49
ITEM0005 1000 812.00 81.20 –1.46 0.37 0.54
ITEM0006 1000 726.00 72.60 –0.97 0.54 0.72
ITEM0007 1000 720.00 72.00 –0.94 0.57 0.76
ITEM0008 1000 826.00 82.60 –1.56 0.31 0.45
ITEM0009 1000 668.00 66.80 –0.70 0.48 0.62
ITEM0010 1000 611.00 61.10 –0.45 0.52 0.67
ITEM0011 1000 581.00 58.10 –0.33 0.51 0.64
ITEM0012 1000 524.00 52.40 –0.10 0.55 0.69
ITEM0013 1000 522.00 52.20 –0.09 0.67 0.85
ITEM0014 1000 516.00 51.60 –0.06 0.62 0.77
ITEM0015 1000 524.00 52.40 –0.10 0.53 0.67
ITEM0016 1000 482.00 48.20 0.07 0.56 0.71
ITEM0017 1000 444.00 44.40 0.22 0.60 0.76
ITEM0018 1000 327.00 32.70 0.72 0.57 0.74
ITEM0019 1000 261.00 26.10 1.04 0.49 0.66
ITEM0020 1000 241.00 24.10 1.15 0.46 0.64
ITEM0021 1000 212.00 21.20 1.31 0.53 0.75
ITEM0022 1000 193.00 19.30 1.43 0.47 0.68
ITEM0023 1000 164.00 16.40 1.63 0.46 0.69
ITEM0024 1000 122.00 12.20 1.97 0.37 0.59
ITEM0025 1000 65.00 6.50 2.67 0.34 0.65
Note. No point–biserial/biserial coefficient is provided for item 1 because all examinees responded correctly to the
item. LOGIT, logistic scale score based on item response theory; PCT, percent correct.

6.7 Biserial Correlation

The biserial correlation coefficient is used when both variables are on a continuous metric
and are normally distributed but one variable has been artificially reduced to two discrete
categories. For example, the situation may occur where a cutoff score or criterion is used to
separate or classify groups of people on an attribute (e.g., mastery or nonmastery). An unde-
sirable result that occurs when using the Pearson correlation on test scores that have been
dichotomized for purposes of classifying masters and nonmasters (e.g., when using a cutoff
score) is that the correlation estimates and associated standard errors are incorrect owing to
the truncated nature of the dichotomized variable. To address this problem, mathematical
corrections are made for the dichotomization of the one variable, thereby resulting in a correct
Pearson correlation coefficient. Equation 6.3 provides the formula for the biserial correlation.
Test Development  189

Equation 6.3. Biserial correlation coefficient

X S - X µ . PQ
RBIS =
SY Z

• X S = mean score on a continuous variable for a group that is success-


ful on a dichotomous variable.
• X µ = mean score on a continuous variable for a group that is unsuc-
cessful on a dichotomous variable.
• sY = overall standard deviation of the scores on the continuous
variable.
• pq = proportion of individuals in the successful group times the pro-
portion of individuals in the unsuccessful group.
• z = ordinate of the standard normal distribution corresponding to p.

6.8 Phi Coefficient

If a researcher is tasked with constructing a test for mastery decisions (as in the case of a
criterion-referenced test), the phi coefficient (see the Appendix) can be used to estimate
the discriminating power of an item. For example, each item score (0 or 1) can be cor-
related with the test outcome (mastery or nonmastery) using cross tabulation or contin-
gency table techniques, as shown in Figure 6.4.
To illustrate how the table in Figure 6.4 works, if examinees largely answer correctly
(the value in cell A is large) and the nonmasters largely answer incorrectly (the value in
cell D is large), the item discriminates well between the levels of achievement specified.
This interpretation is directly related to the false-positive (i.e., the resulting probability of
a test that classifies examinees in the mastery category when they are in fact a nonmaster)
and false-negative (i.e., resulting in the probability of the test classifying examinees in
the nonmastery category when they are in fact a master) outcomes. (See the Appendix
for more information on using contingency table analysis in the situation of making deci-
sions based on group classification.)

Mastery Decision
Mastery Nonmastery
Item +1 A B
Score 0 C D

Figure 6.4.  Cross tabulation of mastery/nonmastery using 2 × 2 frequency table.


190  PSYCHOMETRIC METHODS

6.9 Tetrachoric Correlation

Another correlation coefficient used in item analysis, the tetrachoric correlation, is


presented in the Appendix (including computational source code). The tetrachoric cor-
relation is useful in the test construction process when a researcher wants to create
artificial dichotomies from a variable (item) that is assumed to be normally distrib-
uted (e.g., perhaps from a previously developed theory verified by empirical research).
Use of this correlation coefficient has proven highly useful for factor-analyzing a set of
dichotomously scored test items that are known to be modeling an underlying construct
that is normally distributed in the population. However, calculation of the tetrachoric
correlation is complex, thus requiring statistical programs designed for this purpose.
One may wonder why the phi correlation coefficient (which is easier to calculate and is
more widely accessible) is not used rather than the tetrachoric correlation. The primary
reason is that the phi coefficient suffers from an artificial restriction owing to unequal
proportions being compared. This problem results from being a derivative of the Pear-
son correlation. The tetrachoric correlation does not suffer from this problem and is
therefore the correct coefficient to use when items comprising a test are based on a con-
struct that is normally distributed.

6.10 Item Reliability and Validity

One strategy designed to improve test score reliability is to select items for the final test
form based on the item reliability, while also simultaneously considering the item valid-
ity index. Item reliability and validity indexes are a function of the correlation between
each item and the variability (i.e., standard deviation) of each item score for a sample of
examinees. The item reliability index is a statistic designed to provide an indication of
a test’s internal consistency, as reflected at the level of an individual item. For example,
the higher the item reliability index, the higher the reliability of scores on the total test.
When using the item reliability index to evaluate individual test items, the total test score
(that the item under review is a part of) serves as the criterion.
To calculate the item reliability index, one needs two components—the propor-
tion correct for the item and the variance of the item. The variance of a dichotomously
scored test item is the proportion correct times the proportion incorrect (piqi). Using
these components, we can derive the item reliability index as PI QI rIX, where pi is the
proportion correct for an item, qi is 1 minus the proportion correct for an item, and riX
is the point–biserial correlation between an item and the total test score. Remember that
taking the square root of the variance (piqi) yields the standard deviation, so the item
reliability index is weighted by the variability of an item. This fact is helpful in item
analysis because the greater the variability of an item, the greater influence it will have
on increasing the reliability of test scores. To illustrate, using the values in Table 6.10, the
item reliability index for item number four on crystallized intelligence test 2 (measuring
lexical knowledge—actual usage of a word in the English language), and multiplying the
Test Development  191

standard deviation for item number four (.39) times the item point–biserial correlation
(.31), results in an item reliability index of .12 (see the underlined values in Table 6.11).
Alternatively, the item validity index is expressed as siriY, where si is the standard
deviation of an item and riY is the correlation between an item and an external criterion
(e.g., an outcome measure on a test of ability, achievement, or short-term memory). The
item validity index is a statistic reflecting the degree to which a test measures what it pur-
ports to measure as reflected at the level of an individual item—in relation to an external
measure (criterion). In item analysis, the higher the item validity index, the higher the
criterion-related validity of scores on the total test. Returning to the crystallized intel-
ligence test 2 (lexical knowledge), consider the case where a researcher is interested in
refining the lexical knowledge subtest in a way that maximizes its criterion validity in
relation to the external criterion of short-term memory. Again, using item number 4 on

Table 6.11.  Item Reliability and Validity Indexes for Crystallized Intelligence Test 2
and the Total Score for Short-Term Memory Tests 1–3
Point–biserial Point–biserial
short-term crystallized Item reliability Item validity
Mean SD memory tests 1–3 intelligence test 2 indexa indexb
item 1 1.00 0.07 0.03 0.00 0.00 0.00
item 2 .99 0.11 0.10 0.02 0.00 0.01
item 3 .87 0.33 0.18 0.09 0.03 0.06
item 4 .81 0.39 0.28 0.31 0.12 0.11
item 5 .73 0.45 0.40 0.37 0.17 0.18
item 6 .72 0.45 0.39 0.54 0.24 0.17
item 7 .83 0.38 0.23 0.57 0.22 0.09
item 8 .67 0.47 0.19 0.31 0.15 0.09
item 9 .61 0.49 0.25 0.48 0.23 0.12
item 10 .58 0.49 0.25 0.52 0.26 0.12
item 11 .52 0.50 0.32 0.51 0.26 0.16
item 12 .52 0.50 0.39 0.55 0.28 0.19
item 13 .52 0.50 0.33 0.67 0.34 0.16
item 14 .52 0.50 0.30 0.62 0.31 0.15
item 15 .48 0.50 0.36 0.53 0.27 0.18
item 16 .44 0.50 0.27 0.56 0.28 0.13
item 17 .33 0.47 0.29 0.60 0.28 0.14
item 18 .26 0.44 0.36 0.57 0.25 0.16
item 19 .24 0.43 0.31 0.49 0.21 0.13
item 20 .21 0.41 0.21 0.46 0.19 0.09
item 21 .19 0.40 0.37 0.53 0.21 0.15
item 22 .16 0.37 0.23 0.47 0.17 0.09
item 23 .12 0.33 0.28 0.46 0.15 0.09
item 24 .07 0.25 0.21 0.37 0.09 0.05
item 25 .03 0.17 0.19 0.34 0.06 0.03
a
Item reliability = the point–biserial correlation multiplied by the item standard deviation.
b
Item validity = the point–biserial correlation defined as the correlation between an item and the criterion score, the
total score for short-term memory, multiplied by the item standard deviation.
192  PSYCHOMETRIC METHODS

the crystallized intelligence test 2, the item validity index is calculated by multiplying the
item standard deviation (.39) by the point–biserial correlation of the item with the short-
term memory total score (i.e., total score expressed as the sum of the three subtests). The
resulting item validity index is .11 (see the underlined values in Table 6.11).
Using the item reliability and validity indexes together is helpful in constructing a
test that meets a planned (e.g., in the test blueprint) minimum level of score variance
(and reliability), while also considering criterion-related validity of the test. An important
connection to note is that the total test score variance is expressed as the sum of the item
reliabilities (see Chapter 2 on the variance of a composite). To aid in test construction,
Figure 6.5 is useful because a researcher can inspect the items that exhibit optimal bal-
ance between item reliability and item validity. In this figure, the item reliability indexes
are plotted in relation to the item validity indexes.
For test development purposes, items farthest from the upper left-hand corner of the
graph in Figure 6.5 should be selected first for inclusion on the test. The remaining items
can be included, but their inclusion should be defensible based on the purpose of the test as
articulated in the test specifications (e.g., in consideration of content and construct validity).
The goal of the item analysis section of this chapter was to introduce the statistical
techniques commonly used to evaluate the psychometric contribution items make in
producing a test that exhibits adequate evidence of score reliability and validity relative
to its intended uses. To this end, several item and total test statistics were derived to guide
the selection of the final set of items that will comprise the final version of the test. Key
statistics derived in evaluating test items included item-level statistics such as the mean
and variance of individual items, proportion correct for items, item reliability, item valid-
ity, and item discrimination indexes.

Figure 6.5.  Relationship between item validity and item reliability on crystallized intelligence
test 2.
Test Development  193

6.11 Standard Setting

Tests are sometimes used to classify or select persons on the basis of score performance
along some point along the score continuum. The point along the score continuum is
known as the cutoff score, whereas the practice of establishing the cutoff score is known
as standard setting. Establishing a single cutoff score results in the score distribution of
the examinees being divided into two categories. The practice of standard setting com-
bines judgment, psychometric considerations, and the practicality of applying cutoff
scores (AERA, APA, & NCME, 1999). Hambleton and Pitoniak (2006, p. 435) state that
“the word standards can be used in conjunction with (a) the content and skills candi-
dates are viewed as needing to attain, and (b) the scores they need to obtain in order to
demonstrate the relevant knowledge and skills.” Cizek and Bunch (2006) offer further
clarification by stating that

the practice of standards setting should occur early in the test development process so as to
(a) align with the purpose of the test, test items, task formats, (b) when there is opportunity
to identify relevant sources of evidence bearing on the validity of categorical classifications,
(c) when evidence can be systematically gathered and analyzed, and (d) when the standards
can meaningfully influence instruction, examinee preparation and broad understanding of
the criteria or levels of performance they represent. (p. 6)

For example, in educational achievement testing, students are required to attain a certain
level of mastery prior to matriculating to the next grade. In licensure or occupational test-
ing, examinees are required to meet a particular standard prior to a license or certification
being issued.
Central to all methods for establishing cutoff scores for determining a particular
level of mastery is the borderline examinee. A borderline examinee is defined as a hypo-
thetical examinee used by subject-matter experts as a reference point to make a judgment
regarding whether such an examinee of borderline ability or achievement would answer
the test item under review correctly. More recently, the establishment of the No Child
Left Behind program (NCLB, 2001) and the Individuals with Disabilities Act (IDEA,
1997) has resulted in multiple score performance categories (e.g., Basic, Proficient, and
Advanced). In this scenario, two cutoff scores are required to partition the score distribu-
tion into three performance categories.
An example of the impact standard setting has in relation to the decision-making
process is perhaps no more profound than when intelligence tests have been used as one
criterion in the decision to execute (or not) a person convicted of murder. For example,
in Atkins vs. Virginia (2002), a person on death row was determined to have a full-scale
intelligence score of 59 (classifying him as mentally retarded) on the Wechsler Adult
Intelligence Scale—III (Wechsler, 1997b). Based on the person’s score, the Supreme Court
overturned the sentence by ruling that the execution of mentally retarded persons is
“cruel and unusual” and therefore prohibited by the Eighth Amendment to the United
States Constitution (Cizek & Bunch, 2006, p. 6).
194  PSYCHOMETRIC METHODS

In summary, standard setting is a measurement activity that plays an important


role in making informed decisions about examinee performance. Given the importance
of the decision(s) being made, the role of measurement in standard setting is to provide
accurate and relevant information. Psychometrics plays an important part in standard
setting by ensuring that any classifications or decisions being made are based on high-
quality data. The term high quality means that the data are objective, defensible, and
reproducible. The activity of standard setting in general and establishing cutoff scores
specifically are substantiated by a comprehensive process that incorporates explicit
criteria.

6.12 Standard-Setting Approaches

Numerous schemes have been suggested regarding the classification of standard-setting meth-
ods. Standard-setting methods are classified as norm-referenced or criterion-referenced,
depending on the purpose and type of test being used. The norm-referenced approach is
a method of deriving meaning from test scores by evaluating an examinee’s test score and
comparing it to scores from a group of examinees (Cohen & Swerdlik, 2010, p. 656). For
example, a certification or licensure test may be administered on an annual or quarterly
basis, and the purpose of the test is to ensure that examinees meet a certain standard (i.e.,
a score level such as 80% correct) relative to one another. This approach is appropriate if
the examinee population is stable across time and the test meets the goals (i.e., is prop-
erly aligned with content-based standards of practice) for the certification or licensing
organization or entity.
Alternatively, criterion-referenced methods are absolute in nature because they focus
on deriving meaning from test scores by evaluating an examinee’s score with reference
to a set standard (Cohen & Swerdlik, 2010, p. 644). For example, the method is driven
according to the knowledge and skills an examinee must possess or exhibit in order
to pass a course of study. Central to the criterion-referenced method is the point that
adequate achievement by an examinee is based solely on the examinee and is in no way
relative to how other examinees perform.
Finally, the standards-referenced method, a modified version of criterion-
referenced method, has recently emerged in high-stakes educational achievement testing.
The standards-referenced method is primarily based on the criterion-referenced method
(e.g., examinees must possess a certain level of knowledge and skill prior to matriculating
to the next grade). Normative score information is also created by the testing organization
in charge of test development and scoring for (1) educational accountability purposes
(e.g., NCLB) and (2) statewide reporting of school district performance. The follow-
ing sections introduce four common approaches to establishing cutoff scores. Readers
seeking comprehensive information on the variety of approaches currently available for
specific testing scenarios should refer to Cizek and Bunch (2006) and Zieky, Peirie, and
Livingston (2008).
Test Development  195

6.13 The Nedelsky Method

The Nedelsky method (Nedelsky, 1954) is one of the first standard-setting methods intro-
duced, and was developed in an educational setting for setting cutoff scores on criterion-​
referenced tests composed of multiple-choice items. However, because the method
focuses on absolute levels of performance, the Nedelsky method is also widely used in
setting standards in the area of certification and licensure testing. A useful aspect of the
method is that subject-matter experts (SMEs) must make judgments about the level of
severity of the incorrect response alternatives—in relation to how an examinee with border-
line passing ability will reason through the answer choices. A subject-matter expert partici-
pating in the standard-setting exercise is asked to examine a question and to eliminate the
wrong answers that an examinee of borderline passing ability would be able to recognize
as wrong. For example, the item below is an example of the type of item contained in a
test of crystallized intelligence. In this item, participants using the Nedelsky method are
asked to evaluate the degree of impact selecting a certain response alternative will have
relative to successfully finding their way back to safety.

If you lost your way in a dense forest in the afternoon, how might you
find your way back to a known area?

A. Use the sun to help you find your way


B. Follow a path
C. Shout for help
D. Wait for authorities to locate you

In the Nedelsky method, the subject-matter expert might decide that a borderline
examinee would be able to eliminate answer choices C and D because the options might
leave the stranded person lost indefinitely. Answer B is a reasonable option, but a path
may or may not be present, whereas option A, using the sun, is the best option—although
it is possible that the sun may not be shining. Establishing a cutoff score on a test for
an examinee based on a set of test items similar to the example above proceeds as fol-
lows. First, the probability of a correct response is calculated for each item on the test.
For example, the probability of a correct response by an examinee is 1 divided by the
number of remaining response alternatives—after the examinee has eliminated the wrong-
answer choices. So, in the example item above, the borderline examinee is able to elimi-
nate answer choices C and D, leaving the probability of a correct response as 1 divided
by 2 or 50%. After the probabilities for each test item are calculated, they are summed to
create an estimate of the cutoff score.
The Nedelsky method has at least two drawbacks. First, if a borderline examinee can
eliminate all but two answer choices or perhaps all of the answer choices, then the prob-
ability of a correct response is either .5 or 1.0. No probabilities between .5 and 1.0 are
196  PSYCHOMETRIC METHODS

possible. Second, test item content can be substantially removed from what test exam-
inees are actually used to seeing in practice. For this reason, using actual item responses
from a pilot test is very useful to aid subject-matter experts in the procedure. Requiring
actual pilot data with item responses is problematic for any cutoff score technique that
focuses only on test items, giving little consideration to practical reality.

6.14 The Ebel Method

In the Ebel method (Ebel & Frisbie, 1991), subject-matter experts classify test items into
groups based on each item’s difficulty (easy, medium, or hard) and relevance or impor-
tance (essential, important, acceptable, and questionable). Next, subject-matter experts
select the probability that a borderline examinee will respond to each item correctly.
The same probability is specified for all items in the group of items that examinees are
expected to answer correctly. Cutoff scores are derived by taking each respective group
of items (e.g., 15 items) and multiplying the subject-matter expert’s specified probability
for those 15 items. This step is repeated for each group of items, and then the sum of the
products for each group of items is derived. To obtain the group’s cutoff score, subject-
matter experts’ cutoff scores are averaged using the mean or possibly a trimmed mean if
desired. A disadvantage of the Ebel method is that if there are 15 items in each grouping,
a subject-matter expert must only make 15 judgments about the probability of a border-
line examinee responding correctly regardless of the number of total test items. However,
a strength of the method is that subject-matter experts must consider the relevance and
difficulty of each test item.

6.15 The Angoff Method and Modifications

The Angoff method (Angoff, 1984) was introduced in the early 1980s and is the most
commonly used approach to standard setting, although it is used mainly for certifica-
tion and licensing tests. The Angoff method (and variations of it) is the most researched
standard-setting method (Mills & Melican, 1988). In this method, subject-matter experts
are asked to (1) review the test item content and (2) make judgments about the propor-
tion of examinees in a target population that would respond to a test item correctly. The
target population or examinee group of interest is considered to be minimally competent,
which means that they are perceived as being barely able to respond correctly (or pass)
to a test item. This process is repeated for every item on the test. Finally, the sum of the
item scores represents the score for a minimally acceptable examinee. In a variation of
the Angoff method, for each test item, subject-matter experts are asked to state the prob-
ability that an acceptable number of persons (not just a single person) can be identified
as meeting the requisite qualifications as delineated by established standards for certi-
fication, licensure, or other type of credential. The probability is expressed as the pro-
portion of minimally acceptable examinees who respond correctly to each test item. In
Test Development  197

Table 6.12.  Modified Angoff Method with Eight Raters, Two Ratings Each
Item number
Rater 1 2 3 4 5 6 7 8 9 10 Mean SD
1a 100 90 100 100 90 80 80 80 70 70 86.00 11.14
1b 90 90 90 90 90 90 80 70 70 60 82.00 10.77
2a 100 100 100 90 90 90 80 80 80 70 88.00 9.80
2b 90 100 90 100 90 90 80 80 70 70 86.00 10.20
3a 90 100 90 90 90 80 90 70 80 80 86.00 8.00
3b 100 100 100 90 80 90 80 70 80 80 87.00 10.05
4a 100 90 100 90 90 80 80 80 80 70 86.00 9.17
4b 90 90 100 90 100 80 80 70 70 70 84.00 11.14
5a 90 100 90 100 100 90 90 80 80 80 90.00 7.75
5b 90 90 90 100 90 80 80 80 70 80 85.00 8.06
6a 100 100 100 90 90 80 90 80 80 70 88.00 9.80
6b 90 90 100 80 80 80 80 90 80 80 85.00 6.71
7a 90 90 90 90 90 80 80 70 70 70 82.00 8.72
7b 90 100 100 80 80 100 90 80 80 80 88.00 8.72
8a 90 90 80 90 90 80 80 70 70 70 81.00 8.31
8b 90 80 80 80 80 70 70 80 80 70 78.00 6.00
Mean(a) 95.00 95.00 93.75 92.50 91.25 82.50 83.75 76.25 76.25 72.50 85.88 8.25
Mean(b) 91.25 92.50 93.75 88.75 86.25 85.00 80.00 77.50 75.00 73.75 84.38 7.01
Note. Total number of items on crystallized intelligence test 2 is 25. Ratings are in 10-percentage point increments.
Totals in the shaded area represent rater average and standard deviation across 10 items.

preparing or training subject-matter experts to use the Angoff method, considerable time
is required to ensure that subject-matter experts thoroughly understand and can apply
the idea of a minimally acceptable examinee.
The modified Angoff method involves subject-matter experts contributing multiple
judgments over rounds or iterations of the exercise of assigning proportions of mini-
mally acceptable examinees. Table 6.12 provides an example of results based on the first
10 items on crystallized intelligence test 2.
To interpret Table 6.12, we can examine the rater averages across trials 1 and 2
(indexed as “a” and “b”). For example, using the trial 1 ratings, we observe a recom-
mended average passing proportion of 85.88 across all raters. The proportion of 85.88
yields 8.58 items correctly passed out of the 10 items. Finally, another adaption of the
Angoff method is available for standard setting based on constructed response-type test
items. Readers interested in this adaption are encouraged to see Hambleton and Plake
(1995) for the methodological details.
The Angoff method proceeds by requesting subject-matter experts to assign a
probability to each item on a test that is expressed as the probability that a borderline
examinee will respond correctly to the item. If the test is composed of multiple-choice
items and a correct response yields 1 point, then the probability that an examinee will
respond correctly to an item is defined as the examinee’s expected score. By summing
the expected scores on all items on the test, one obtains the expected score for the
198  PSYCHOMETRIC METHODS

entire test. Using the probability correct for each item, one can find the expected score
for a borderline examinee on the total test. The subject-matter expert’s cutoff score is
determined by summing his or her judgments about the probability that a borderline
examinee will respond correctly to each item. The Angoff method is well established and
thoroughly researched. A disadvantage of the method is in not having actual pilot test
responses available to help subject-matter experts to become grounded in the practical
reality of examinees; as usual, judgments about examinee performance can be very dif-
ficult to estimate subjectively.
The Angoff method is also applicable to constructed response items with a slight
modification. To illustrate, suppose a test item is of such a form that an examinee is
required to construct a response that is subsequently scored on a score range of 1–10
points. Next, subject-matter experts are asked to estimate the average score that a group
of borderline examinees would obtain on the item. Furthermore, the score can be a non-
integer (e.g., subject-matter experts may estimate that the average score for a group of
borderline examinees is 6.5 on a scale of 1 to 10). Another subject-matter expert might
estimate the average score to be 5.5. Deriving an estimate of the cutoff score proceeds by
first summing the cutoff scores of the individual subject-matter experts and then taking
the average of the group of subject-matter experts.

6.16 The Bookmark Method

The bookmark method is used for test items scaled (scores) using item response theory
(IRT). (IRT is covered in detail in Chapter 7.) The protocol for establishing cutoff scores
using the bookmark method proceeds as follows. First, subject-matter experts are pro-
vided a booklet comprising test items that are ordered in difficulty from easy to hard. The
subject-matter expert’s task is to select the point in the progression of items where an
examinee is likely to respond correctly from a probabilistic standpoint. In the bookmark
method, the probability often used for the demarcation point where easy items shift to
hard items is .67. For example, the demarcation point establishes a set of easy items that
a borderline examinee answers correctly with a probability of .67. Conversely, the remain-
ing “harder” group of items would not be answered correctly by a probability of less than
.67. An advantage of IRT scoring is that item difficulty (expressed as a scale score) and
examinee ability (expressed as an ability scale score) are placed on a common scale. So,
once a bookmark point is selected, an examinee’s expected score at a cutoff point is easily
determined.
An advantage of the bookmark method is that multiple cutoff scores are able to be
set in a set of test items (e.g., gradations of proficiency level such as novice, proficient,
and advanced). Also, the method works for constructed response test items as well as for
multiple-choice items. Subject-matter experts often find that working with items ordered
by increasing difficulty makes their task more logical and manageable. Of course, all
of the test items must be scored and calibrated using IRT prior to establishing the cut-
off score. Therefore, a substantial pilot-testing phase of the items is necessary. Another
Test Development  199

potential challenge of using this method is that subject-matter experts not familiar with
IRT will likely have difficulty understanding the relationship between the number of
items answered correctly and the cutoff score on the test. For example, if the cutoff score
is selected as 19, one may think that 18 items must be answered correctly. However, the
relationship is different in IRT, where a transformation of item difficulty and person abil-
ity occurs and as a result the raw number correct cutoff score rarely matches the number
of questions preceding the bookmark.
Chapter 6 has reviewed several established methods for establishing cutoff scores.
The information presented is a part of a basic overview of a body of work that is substan-
tial in breadth and depth. Readers desiring more information on setting cutoff scores and
standard setting more generally are encouraged to consult the book Standard Setting by
Cizek and Bunch (2006) and Cutscores: A Manual for Setting Standards of Performance on
Educational and Occupational Tests by Zieky et al. (2008).

6.17 Summary and Conclusions

This chapter presented three major areas of the test and instrument development pro-
cess: test construction, item analysis, and standard setting. The topic of test construction
includes establishing a set of guidelines that a researcher follows to sequentially guide
his or her work. Additionally, the information on test and instrument construction pro-
vided was aimed at guiding the effective production of tests and instruments that maxi-
mize differences between persons (i.e., interindividual differences). The second section
of this chapter provided details on various techniques used for item analysis with applied
examples. The utility of each item analysis technique was discussed. The third section
introduced the topic of standard setting and described the four approaches that have been
used extensively and that most closely align with the focus of this book.

Key Terms and Definitions


Biserial correlation. The relationship between the total score on a test item. It is appropri-
ate to use when both variables are theoretically continuous and normally distributed
but one has been artificially reduced to two discrete categories (e.g., when a cutoff
score is applied for mastery or nonmastery decisions).
Bookmark method. A technique for establishing a cutoff score for text based on item
response theory.
Borderline. A level of skill or knowledge barely acceptable for entry into an achievement,
ability, or performance level (Zieky et al., 2008, p. 206).
Content analysis. Identification and synthesis of the substantive area that the items on the
test are targeted to measure.
Criterion. A measure that is an accepted standard against which a test is compared to
validate the use of the test scores as a predictor (Ebel & Frisbie, 1991, p. 106).
200  PSYCHOMETRIC METHODS

Criterion-referenced. A form of test score interpretation that compares one person’s


score with scores each of which represents a distinct level of performance in some
specific content area, or with respect to a behavioral task (Ebel & Frisbie, 1991,
p. 34).
Cutoff score. A point on a score scale at or above which examinees are classified in one
way and below which they are classified in another way (Zieky et al., 2008, p. 206).
Domain of content. The information contained in test items as specified in the domain
outline. The effectiveness of the domain of content is based on how well the test item
writer translates the task descriptions into the items.
Item format. The type of question used to elicit a response from an examinee. For example,
a test question may be in the form of multiple choice, constructed response, or essay.
Item reliability index. A statistic designed to provide an indication of a test’s internal
consistency as reflected at the level of an individual item. For example, the higher the
item reliability index, the higher the reliability of scores on the total test.
Item validity index. A statistic indicating the degree to which a test measures what it
purports to measure as reflected at the level of an individual item. For example, the
higher the item validity index, the higher the validity of scores on the total test.
Nonprobability sampling. Type of sampling design where the researcher knows what
attributes are correlated with key survey statistics and successfully balances the sam-
ple on those attributes.
Normative population. A population or group upon which normative scores are pro-
duced. Normative scores or norms can be used to compare an examinee’s score to
that of a well-defined group.
Normative scores. Test score statistics for a specific group of examinees.

Norm-referenced. A method of deriving meaning from test scores by evaluating an


examinee’s score and comparing it to scores from a group of examinees. Test scores
are understood and interpreted relative to other scores on the same test.
Phi coefficient. A measure of strength of linear dependence between two variables, X
and Y, used when one needs to construct a test for mastery decisions. The Phi coef-
ficient can be used to estimate the discriminating power of an item related to an
outcome that is also dichotomous (e.g., pass or fail).
Point–biserial correlation. Used to estimate the relationship between a test item scored
1 (correct) or 0 (incorrect) and the total test score.
Probability sampling. Probability samples are those where every element (i.e., person)
has a nonzero chance of selection and the elements are selected through a random
process. Each element (person) must have at least some chance of selection, although
the chance is not required to be equal.
Proportionally stratified sampling. Subgroups within a defined population are identi-
fied as differing on a characteristic relevant to a researcher or test developer’s goal.
Using a proportionally stratified sampling approach provides a way to properly
Test Development  201

account for these characteristics that differ among population constituents, thereby
preventing systematic bias in the resulting test scores.
Sampling. The selection of elements, following prescribed rules from a defined popula-
tion. In test development, the sample elements are the examinees or persons taking
the test (or responding to items on an instrument).
Score validity. A judgment regarding how well test scores measure what they purport to
measure. Score validity affects the appropriateness of the inferences made and any
actions taken.
Standard setting. The practice of establishing a cutoff score.
Standards-referenced method. A modified version of the criterion-referenced method
used in high-stakes educational achievement testing. The standards-referenced method
is primarily based on the criterion-referenced method (e.g., examinees must possess
a certain level of knowledge and skill prior to matriculating to the next grade).
Stratified random sampling. Every member in the stratum of interest (e.g., the demo-
graphic characteristics) has an equal chance of being selected (and is proportionally
represented) in the sampling process.
Subject-matter expert. A person who makes decisions about establishing a cutoff score
for a particular test within the context of a cutoff score study.
Table of specifications. A two-way grid used to outline the content coverage of a test.
Also known as a test blueprint.
Tetrachoric correlation. Useful in the test construction process when a researcher wants
to create artificial dichotomies from a variable (item) that is assumed to be normally
distributed (e.g., perhaps from a previously developed theory verified by empirical
research). This correlation has proven highly useful for factor analyzing a set of
dichotomously scored test items that are known to represent an underlying construct
normally distributed in the population.
7

Reliability

This chapter introduces reliability—a topic that is broad and has important implications
for any research endeavor. In this chapter, the classical true score model is introduced
providing the foundation for the conceptual and mathematical underpinnings of reliability.
After the foundations of reliability are presented, several approaches to the estimation of
reliability are provided. Throughout the chapter, theory is linked to practical application.

7.1 Introduction

Broadly speaking, the term reliability refers to the degree to which scores on tests or other
instruments are free of errors of measurement. The degree to which scores are free from
errors of measurement dictates their level of consistency or reliability. Reliability of mea-
surement is a fundamental issue in any research endeavor because some form of mea-
surement is used to acquire data. The process of data acquisition involves the issues of
measurement precision (or imprecision) and the manner by which it is reported in rela-
tion to test scores. As you will see, reliability estimation is directly related to measurement
precision or imprecision (i.e., error of measurement). Estimating the reliability of scores
according to the classical true score model involves certain assumptions about a person’s
observed, true, and error scores. This chapter introduces the topic of reliability in light of
the assumptions of the true score model, how it is conceptualized, requisite assumptions
about true and error scores, and how various coefficients of reliability are derived.
Two issues central to reliability are (1) the consistency or degree of similarity of
at least two scores on a set of test items and (2) the stability of at least two scores on a
set of test items over time. Different methods of estimating reliability are based on spe-
cific assumptions about true and error scores and, therefore, address different sources
of error. The assumptions explicitly made regarding true and error scores are integral to
203
204  PSYCHOMETRIC METHODS

correctly reporting and interpreting score reliability. Although the term reliability is used
in a general sense in many instances, reliability is clearly a property of scores rather than
measurement instruments or tests. It is the consistency or stability of scores that provides
evidence of reliability when using a test or instrument in a particular context or setting.
This chapter is organized as follows. First, a conceptual overview of reliability is
presented followed by an introduction to the classical true score model—a model that
serves as the foundation for classical test theory. Next, several methods commonly used
to estimate reliability are presented using the classical test theory approach. Specifically, we
present three approaches to estimating reliability: (1) the test–retest method for estimating
the stability of scores over time, (2) the internal consistency method based on the model of
randomly parallel tests, and (3) the splithalf method—also related to the model of paral-
lel tests. A subset of the dataset introduced in Chapter 2 that includes three components
of the theory of generalized intelligence—fluid (Gf ), crystallized (Gc), and short-term
memory (Gsm)—is used throughout the chapter in most examples. As a reminder, the
dataset used throughout this chapter includes a randomly generated set of item responses
based on a sample size N = 1,000 persons. For convenience, the data file is available in
SPSS (GfGc.sav), SAS (GfGc.sd7), or delimited file (GfGc.dat) formats and is download-
able from the companion website (www.guilford.com/price2-materials).

7.2 Conceptual Overview

As noted earlier, measurement precision is a critical component of reliability. For exam-


ple, a useful way to envision the concept of reliability is to determine how free a set of
scores is from measurement error. How one evaluates (or estimates) the degree of mea-
surement error in a set of scores is a primary focus of this chapter and is foundational
to understanding the various approaches to the estimation of reliability. Reliability is
perhaps most concretely illustrated in fields such as chemistry, physics, or engineering.
For example, measurements acquired in traditional laboratory settings are often acquired
within the context of well-defined conditions, with precisely calibrated instrumentation,
where the object of the measurement physically exists (i.e., directly observable and mea-
sureable physical properties). Consider two examples from chemistry: (1) measurement
such as the volume of a gas in a rigid container at an exact temperature and (2) the pre-
cise amount of heat required to produce a chemical reaction. In the first example, say that
a researcher measures the volume of gas in a rigid container on 10 different occasions.
In summarizing the 10 measurements, one would expect a high degree of consistency,
although there will be some random error variability in the numerical values acquired
from the measurement due to fluctuations in instrumentation (e.g., calibration issues or
noise introduced through the instruments used for the data collection). When research is
conducted with human subjects, random error may occur due to distractions, guessing,
content sampling, or intermittent changes in a person’s mental state (see Table 7.1).
Another type of error is called systematic or constant error of measurement (Gulliksen,
1950b; 1987, p. 6). For example, systematic error occurs when all test scores are
Reliability  205

Table 7.1.  General and Specific Origins of Test Score Variance Attributable
to Persons
General: Enduring traits or attributes
1. Skill in an area tested such as reading, mathematics, science
2. Test-taking ability such as careful attention to and comprehension of instructions
3. Ability to respond to topics or tasks presented in the items on the test
4. Self-confidence manifested as positive attitude toward testing as a way to measure ability, achieve-
ment, or performance

Specific: Enduring traits or attributes


1. Requisite knowledge and skill specific to the area or content being measured or tested
2. Emotional reactivity to a certain type of test item or question (e.g., the content of the item includes
a topic that elicits an emotional reaction)
3. Attitude toward the content or information included on the test
4. Self-confidence manifested as positive attitude toward testing as a way to measure ability, achieve-
ment, or perfomance

General: Limited or fluctuating


1. Test-taking anxiety
2. Test preparation (e.g., amount and quality of practice specific to the content of items on the test)
3. Impact of test-taking environment (e.g., comfort, temperature, noise)
4. Current attitude toward the test and testing enterprise
5. Current state of physical health and level of mental/physical fatigue
6. Motivation to participate in the testing occasion
7. Relationship with person(s) administering the test

Specific: Limited or fluctuating


1. Momentary changes in memory specific to factual information
2. Test preparation (e.g., amount and quality of practice specific to the content of items on the test)
3. Guessing correct answers to items on the test
4. Momentary shift in emotion triggered by information included on test item
5. Momentary shifts in attention or judgment
Note. Based on Cronbach (1970).

excessively high or low. In the physical sciences, consider the process of measuring the
precise amount of heat required to produce a chemical reaction. Such a reaction may be
affected systematically by an improperly calibrated thermometer being used to measure
the temperature—resulting in a systematic shift in temperature by the amount or degree
of calibration error. In the case of research conducted with human subjects, systematic
error may occur owing to characteristics of the person, the test, or both. For example,
in some situations persons’ test scores may vary in a systematic way that yields a consis-
tently lower or higher score over repeated test administrations. With regard to the crys-
tallized intelligence dataset used in the examples throughout this book, suppose that all
of the subtests on the total test were developed for a native English-speaking population.
206  PSYCHOMETRIC METHODS

Further suppose that a non-native English-speaking person responds to all questions on


the subtests. The person’s scores over repeated testing occasions will likely be consistently
lower (due to the language component) than their true or actual level of intellectual abil-
ity because English is not the respondents’ first or primary language. However, systematic
error is not part of the theoretical assumptions of the true score model—only random error is.
Therefore, systematic errors are not regarded as affecting the reliability of scores; rather,
they are a source of construct-related variance (an issue related to validity).
The example with non-native English speaking persons introduces one aspect of an
important topic in psychometrics and/or test theory known as validity (i.e., the test not
being used with the population for which it was developed). Evidence of test validity is related
to reliability such that reliability is a necessary but not sufficient condition to establish the valid-
ity of scores on a test. The validity example is important because errors of measurement place
limitations on the validity of a test. Furthermore, even if no measurement error existed,
complete absence of measurement error in no way guarantees the validity of test scores.
Validity, a comprehensive topic, is covered in Chapters 3 and 4 of this text. Table 7.1 pro-
vides examples of sources of error variability that may affect the reliability of scores (either
randomly or systematically) when conducting research in social and/or behavioral science.

7.3 The True Score Model

In 1904, Charles Spearman proposed a model-based framework of test theory known


as the true score model. For approximately a century, Spearman’s true score model has
largely dominated approaches to the estimation of reliability. This model rests on the
assumption that test scores represent fallible (i.e., less than perfectly objective or accu-
rate) measurements of human traits or attributes. Because perfect measurement can
never occur, observed scores always contain some error. Based on the idea that measure-
ments are fallible, Spearman (1904, 1907) posited that the observed correlation between
such fallible scores is lower than would be observed if one were able to use true objective
values. Over the past century, the true score model has been revised and/or expanded
with formal, comprehensive treatments published by Harold Gulliksen (1950b, 1987) in
The Theory of Mental Tests and Fredrick Lord and Melvin Novick (1968) in their seminal
text Statistical Theories of Mental Test Scores. The true score model for a person is pro-
vided in Equation 7.1 (Lord & Novick, 1968, p. 56).

Equation 7.1. True score model

Xi = Ti + Ei

• Xi = observed fallible score for person i.


• Ti = true score for person i.
• Ei = error score for person i.
Reliability  207

Although Equation 7.1 makes intuitive sense and has proven remarkably useful his-
torically, six assumptions are necessary in order for the equation to become practical
for use. Before introducing the assumptions of the true score model, some connections
between probability theory, true scores, and random variables are reviewed in the next
section (see the Appendix for comprehensive information on probability theory and ran-
dom variables).

7.4 Probability Theory, True Score Model,


and Random Variables

Random variables are associated with a set of probabilities (see the Appendix). In the true
score model, test scores are random variables and, therefore, can take on a hypothetical
set of outcomes. The set of outcomes is expressed as a probability (i.e., expressed as a fre-
quency) distribution as illustrated in Table 7.2. For example, when a person takes a test,
the score he or she receives is considered a random variable (expressed in uppercase let-
ter X in Equation 7.1). The one time or single occasion a person takes the test, he or she
receives a score, and this score is one sample from a hypothetical distribution of possible out-
comes. Table 7.2 illustrates probability distributions based on a hypothetical set of scores
for three people. In the distribution of scores in Table 7.2, we assume that the same per-
son has taken the same test repeatedly and that each testing occasion is an independent

Table 7.2.  Probability of Obtaining a Particular Score


on a 25-Item Test of Crystallized Intelligence on a Single
Testing Occasion
Person
A B C
Raw score (X) p(X) p(X) p(X)
4 0.01 0.04 0.00
5 0.01 0.05 0.00
6 0.02 0.10 0.00
7 0.05 0.28 0.02
8 0.06 0.45 0.03
11 0.08 0.08 0.12
13 0.40 0.00 0.13
14 0.23 0.00 0.18
15 0.10 0.00 0.40
17 0.02 0.00 0.07
18 0.02 0.00 0.04
20 0.00 0.00 0.01
S(X)p = 12.54 7.45 14.02
Note. Each person has a unique score distribution independently determined for a
single person. The frequency distribution of scores in the table is not based on any
actual dataset used throughout this text; rather, it is only provided as an example.
208  PSYCHOMETRIC METHODS

event. The result is a distribution of scores for each person with an associated probability.
The probabilities expressed in Table 7.2 are synonymous with the relative frequency for
a score based on the repeated testing occasions. The implication of Table 7.2 for the true
score model or classical test theory is that the mean (or expectation) of the hypothetical
observed score distribution for a person based on an infinitely repeated number of independent
trials represents his or her true score within the classical true score model.
To clarify the role of the person-specific probability distribution, consider the follow-
ing example in Table 7.2. Tabulation of the probability of a person’s raw score (expressed
as a random variable) multiplied by the probability of obtaining a certain score (due to
probability theory) demonstrates that person C appears to possess the highest level of
crystallized intelligence for the 25-item test. Furthermore, by Equation 7.6, person C’s
true score is 14.02. Notice that for person C the probability (i.e., expressed as the rela-
tive frequency) of scoring a 15 is .40—higher than the other two persons. Person A has
a probability of .40 scoring a 13. Person B has a probability of .45 scoring an 8. Clearly,
person C’s probability distribution is weighted more heavily toward the high end of the
score scale than person A or B.
Although a person’s true score is an essential component of the true score model,
true score is only a hypothetical entity owing to the implausibility of conducting an infi-
nite number of independent testing occasions. True score is expressed as the expectation
of a person’s observed score over repeated independent testing occasions. Therefore, the
score for each person taking the test represents a different random variable regarding his or
her person-specific probability distribution (e.g., Table 7.2). The result is that such per-
sons have their own probability distribution—one that is specific to their hypothetical
distribution of observed scores (i.e., each person has an associated score frequency or
probability given their score on a test). In actual testing situations, the interest is usually
in studying individual differences among people (i.e., measurements over people rather
than on a single person). The true score model can be extended to accommodate the
study of individual differences by administering a test to a random sample of persons
from a population. Ideally, this process could be repeated an infinite number of times
(under standardized testing conditions), resulting in an observed score random variable
taking on specific values of score X. In the context described here, the error variance
over persons can be shown to be equal to the average, over persons (group-level), of
the error variance within persons (hypothetical repeated testing occasions for a single
person; Lord & Novick, 1968, p. 35). Formally, this is illustrated in Equation 7.5 in the
next section.
In the Appendix, equations for the expectation (i.e., the mean) of continuous and
discrete random variables are introduced along with examples. In the true score model,
total test scores for persons are called composite scores. Formally, such composite scores
are defined as the sum of responses (response to an item as a discrete number) to individ-
ual items. At this point, readers are encouraged to review the relevant parts of Chapter­ 2
and the Appendix before proceeding through this chapter; this will reinforce key founda-
tional information essential to understanding the true score model and reliability estima-
tion. Next, we turn to a presentation of the assumptions of the true score model.
Reliability  209

7.5 Properties and Assumptions of the True Score Model

In the true score model, the human traits or attributes being measured are assumed to
remain constant regardless of the number of times they are measured. Imagine for a
moment that a single person is tested an infinite number of times repeatedly. For exam-
ple, say Equation 7.1 is repeated infinitely for one person and the person’s true state of
knowledge about the construct remains unchanged (i.e., is constant). This scenario is
illustrated in Figure 7.1.
Table 7.3 illustrates observed, true, and error scores for 10 individuals. Given this
scenario, the person’s observed score would fluctuate owing to random measurement
error. The hypothetical trait or attribute that remains constant and that observed score
fluctuates about is represented as a person’s true score or T. Because of random error
during the measurement process, a person’s observed score X fluctuates over repeated
trials or measurement occasions. The result of random error is that differences between
a person’s observed score and true score will fluctuate in a way that some are positive

observed True Score Ti


Parallel test score
xi1
error
score
1
ei1
xi2
2
xi3 ei2

3
xi4 ei3

4
ei4 xi5
5
ei5
6
xi6 ei6
7
xi8 ei7
xi7

ei8

–5 –4 –3 –2 –1 0 1 2 3 4 5

µerror = 0

Figure 7.1.  True score for a person. Adapted from Magnusson (1967, p. 63). Copyright 1967.
Reprinted by permission of Pearson Education, Inc. New York, New York.
210  PSYCHOMETRIC METHODS

Table 7.3.  Crystallized Intelligence Test Observed, True, and Error Scores
for 10 Persons
Person (i) Observed score (X) True score (T) Error score (E)
A 12.00 = 13.00 + –1.00
B 14.50 = 12.00 + 2.50
C 9.50 = 11.00 + –1.50
D 8.50 = 10.00 + –1.50
E 11.50 = 9.00 + 2.50
F 7.00 = 8.00 + –1.00
G 17.00 = 17.25 + –0.25
H 17.00 = 16.75 + 0.25
I 10.00 = 9.00 + 1.00
J 8.00 = 9.00 + –1.00
Mean 11.50   11.50   0.00
Standard deviation 3.43   3.11   1.45
Variance 11.75   9.66   2.11
Sum of cross products 96.50        
Covariance 9.65        
Note. Correlation of observed scores with true scores = .91. Correlation of observed scores with error scores = .42.
Correlation of true scores with error scores = 0. True score values are arbitrarily assigned for purposes of illustration.
Variance is population formula and is calculated using N. Partial credit is possible on test items. Covariance is the
average of the cross products of observed and true deviation scores.

and some are negative. Over an infinite number of testing occasions, the positive and nega-
tive errors cancel in a symmetric fashion, yielding an observed score equaling true score for a
person (see Equations 7.5 and 7.6).
Notice that in Table 7.4, all of the components are in place to evaluate the reliability
of scores based on errors of measurement.
In the situation where score changes or shifts occur systematically, the difference
between observed and true scores will be either systematically higher or lower by the fac-
tor of some constant value. For example, all test takers may score consistently lower on a
test because the examinees are non-English speakers, yet the test items were written and/
or developed for native English-speaking persons. Technically, such systematic influences
on test scores are not classified as error in the true score model (only random error is assumed
by the model). The error of measurement for a person in the true score model is illustrated
in Equation 7.2. Alternatively, in Figure 7.2, the relationship between observed and true

Table 7.4.  Correlations among Observed, True,


and Error Scores for 10 Persons
1 2 3
1. Observed 1 0.91 0.42
2. True 1 0.00
3. Error 1

Note. rTE = 0.0; rOE = .42; rOT = .91; rXX = .82 (which is the reliability coef-
ficient expressed as the square of rOT = .91); r2OE = .42; rOT = .91. The cor-
relation between true and error scores is actually .003 in the above example.
Reliability  211

Equation 7.2. Error of measurement in the true score model for


person i

Ei = Xi – Ti

• Ei = error score for person i.


• Xi = observed score for person i.
• Ti = true score for person i.

scores is expressed as the regression of true score on observed score (e.g., the correlation
between true and observed score is .91 and .912 = .82 or the reliability coefficient).
Next, in Equation 7.3, the mean of the distribution of error is expressed as the
expected difference between the observed score and true score for a person over infinitely
repeated testing occasions (e.g., as in Table 7.3).
Because X and T are equal in the true score model (inasmuch as the mean observed
score distribution over infinite occasions equals a person’s true score distribution), the
mean error over repeated testing occasions is also zero (Table 7.3; Figure 7.1; Equation 7.4;

18.00

16.00

14.00
True score

12.00

10.00

8.00

7.50 10.00 12.50 15.00 17.50

Observed score

Figure 7.2.  Regression line and scatterplot of true and observed scores for data in Table 7.3.
212  PSYCHOMETRIC METHODS

Equation 7.3. Mean error score for person i as expectation of the


difference between observed score and true score

m EI = e(EI ) = e(X I - TI )

• Ti = true score for person i.


• m EI = mean error score for person i.
• ­e = expectation operator.
• (Ei) = observed error score for person i.
• Xi = mean of observed score X for subject i.

Equation 7.4. The expectation of random variable E for person i

e = (Ei) = 0

• e = expectation operator.
• (Ei) = e xpected value of random variable Ei over an indefi-
nite number of repeated trials.

Lord & Novick, 1968, p. 36; Crocker & Algina, 1986, p. 111). Also, since the error com-
ponent is random, then from classical probability theory (e.g., Rudas, 2008), the mean
error over repeated trials equals zero (Figure 7.1). Accordingly, the first assumption in
the true score model is that the mean error of measurement over repeated trials or test-
ing occasions equals zero (Equation 7.4). The preceding statement is true for (a) an infinite
number of persons taking the same test—regardless of their true score and (b) for a single
person’s error scores on an infinite number of parallel repeated testing occasions.

Assumption 1: The expectation (population mean) error for person i over an infinite
number of trials or testing occasions on the same test is zero.

Extension to the Group Level


The expectation (mean) error for a population of persons (i.e., represented at the group
level) over an infinite number of trials or testing occasions is zero. Equation 7.5 includes
the double expectation operator to illustrate that the error variance over persons can be
shown to be equal to the average over persons in a group of the error variance within per-
sons (Lord & Novick, 1968, pp. 34–37). Here, the group notation is denoted by subscript
j as presented in Crocker and Algina (1986, p. 111).
Reliability  213

Equation 7.5. Mean error score for a population of persons

m E = eJ e XJ

AND
m E = e J(0)

• m­ E = mean error for a population or group of persons.


• e J e = d
 ouble expectation operator reflecting that the error
variance over persons is equal to the average error
variance within persons.
• ej = expectation for population or group j.
• eXj = expectation taken over all persons in group j.

A main caveat regarding Equation 7.5 is that for a random sample of persons from a
population, the average error may not actually be zero. The discrepancy between true score
theory and applied testing settings may be due to sampling error or other sources of error.
Also, in the true score model, one is hypothetically drawing a random sample of error
scores from each person in the sample of examinees. The expected value or population
mean of these errors may or may not be realized as zero.

Assumption 2: True score for person i is equal to the expectation (mean) of their
observed scores over infinite repeated trials or testing occasions (Equation 7.6;
Table 7.2).

Equation 7.6. True score for person i as expectation of mean


observed score

TI = e( X I ) = m XI

• Ti = true score for person i.


• ­e = expectation operator.
• (Xi) = observed score for person i.
• ­m XI = mean of observed score X for subject i over indepen-
dent trials.
214  PSYCHOMETRIC METHODS

The fact that a person’s true score remains constant, yet unknown, over repeated
testing occasions makes using Equation 7.1 for the estimation of reliability with empiri-
cal data intractable because without knowing a person’s true score, deriving errors of
measurement is impossible. To overcome the inability of knowing a person’s true score,
items comprising a test are viewed as different parallel parts of a test, enabling estimation of
the reliability coefficient. Given that items serve as parallel components on a test, reli-
ability estimation proceeds in one of two ways. First, the estimation of reliability can
proceed by evaluating the internal consistency of scores by using a sample of persons
tested once, with test items serving as component pieces (each item being a “micro test”)
within the overall composite or total test score. Second, the estimation of reliability can
proceed by deriving the stability of scores as the correlation coefficient for a sample of
persons tested twice with the same instrument or on a parallel form of a test. Later in this
chapter, several methods for estimating the reliability of scores are presented based on the
true score model—all of which are based on the assumption of parallel tests.

Extension to the Group Level


True score for a group of persons is equal to the expectation (mean) of their observed
scores over infinite repeated trials or testing occasions (Equation 7.7; Lord & Novick,
1968, p. 37; Gulliksen, 1950b, p. 29; Crocker & Algina, 1986, p. 111).
At this point, the properties of true and error scores within the true score model can
be summarized as follows: (1) the mean of the error scores in a population or group of
persons equals zero and (2) the expected population or group mean of observed scores
equals the mean of true scores. We now turn to Assumption 3.

Assumption 3: In the true score model, the correlation between true and error
scores on a test in a population of persons equals zero (Equation 7.8; Table 7.4;
Figure 7.3).

Equation 7.7. True score as expectation of mean observed score


for group j

TJ = e( X J ) = m X J

• Tj = true score for a group j.


• ­e = expectation operator.
• (Xj) = observed score for a group j.
• ­m X J = mean of observed score X for group j over indepen-
dent trials.
Reliability  215

Equation 7.8. Correlation between true and error scores in the true
score model

rTE = 0

• ­rTE = c orrelation between true and error scores in a


population.

A consequence of the absence of correlation between true and error scores (Assump-
tion 3, Equation 7.8) is that deriving the observed score variance is accomplished by
summing true score variance and error variance (as linear components in Equation 7.9).
This assumption implies that persons with low or high true scores do not exhibit system-
atically high or low errors of measurement because errors are randomly distributed (as in
Figure 7.3). To illustrate the relationships between true and error scores, we return to the
data in Table 7.3. In Table 7.4, we see that the correlation between true and error scores is
zero (readers should calculate this for themselves by entering the data into SPSS or Excel

18.00

16.00

14.00
True score

12.00

10.00

8.00

−2.00 −1.00 .00 1.00 2.00 3.00


Error score

Figure 7.3.  Correlation of true score with error score from data in Table 7.3.
216  PSYCHOMETRIC METHODS

Equation 7.9. Observed score variance as the sum of true score


and error score

s2X = sT2 + s2E

• s2X = observed score variance.


• sT2 = true score variance.
• s2E = error score variance.

and conducting a correlation analysis). Next, because true score and error scores are
uncorrelated, observed score variance is simply the sum of true and error score variance.
To verify this statement, return to Table 7.3 and add the variance of true scores (9.66) to
the variance of error scores (2.11) and you will see that the result is 11.75—the observed
score variance. Formally, the additive, linear nature of observed score variance in the true
score model is illustrated in Equation 7.9.

Assumption 4: When an independent random sample of persons from a popula-


tion takes two separate tests that are parallel in structure and content, the correlation
between the error scores on the two tests is zero (Equation 7.10; Lord & Novick,
1968, pp. 47–49; Crocker & Algina, 1986, p. 111).

Equation 7.10. Correlation between two sets of random error


scores from two tests in the true score model

rE1E2 = 0

• rE1E2 = p
 opulation correlation between random errors of
measurement for test 1 and parallel test 2.

Intuitively, Assumption 4 should be clear to readers at this point based on the pre-
sentation thus far regarding the nature of random variables as having no relationship (in
this case zero correlation between errors of measurement on two parallel tests).

Assumption 5: Error scores on one test are uncorrelated with true scores on another
test (Equation 7.11). For example, the error component on one intelligence test is not
correlated with true score on a second, different test of intelligence.
Reliability  217

Equation 7.11. Correlation between the error on test 1 and true


score on test 2 are uncorrelated

rE1T2 = 0

• rE1T2 = p
 opulation correlation between the error on test 1
and true score on test 2 are uncorrelated.

Assumption 6: Two tests are exactly parallel if, for every population, their true
scores and error scores are equal (Lord & Novick, 1968; Equation 7.12). Further, all
items on a test are assumed to measure a single construct. This assumption of measur-
ing a single construct is called unidimensionality and is covered in greater detail in
Chapters 8 and 9 on factor analysis and item response theory. If two tests meet the
assumptions of parallelism, they should be correlated with other external or criterion-
related test scores that are parallel based on the content of the test. The parallel tests
assumption is difficult to meet in practical testing situations because in order for the
assumption to be tenable, the testing conditions that contribute to error variability pre-
sented in Table 7.1 (e.g., fatigue, environment, etc.) must vary in the same manner in
each of the testing scenarios. Also, part of Assumption 6 is that every population of
persons will exhibit equal observed score means (i.e., mean expressed the degree of
measurement precision expressed as how close scores are to one another) and
variances (i.e., as a measure of error) on parallel tests.

Equation 7.12. Definition of parallel tests

X1 = T + E1

X2 = T + E2

s2E1 = s2E2

• X1 = observed score on test 1.


• X2 = observed score on test 2.
• T = true score (assumed as equal on both tests).
• s2E1 = variance of test 1.
• s2E2 = variance of test 2.
218  PSYCHOMETRIC METHODS

As previously stated, the model of parallel tests is important because it allows the
true score model to become functional with empirical data. In fact, without the model
of parallel tests, the true score model would be only theoretical because true scores are
not actually measureable. Also, without knowing true scores, calculation of error scores
would not be possible, making the model ineffective in empirical settings. To illustrate the
importance of the model of parallel tests relative to its role in estimating the coefficient
of reliability, consider Equations 7.13 and 7.14 (Crocker & Algina, 1986, pp. 115–116).

Equation 7.13. Deviation score formula as the correlation on


parallel­tests 1 and 2

å X1 X 2
rX1 X2 =
Ns X1 sX 2
• r X1X2 = correlation between scores on two parallel tests.
• x1 = observed deviation score on test 1.
• x2 = observed deviation score on test 2.
• sx 1 = observed standard deviation on test 1.
• s x2 = observed standard deviation on test 2.
• N = sample size.

Equation 7.14. Deviation score formula as the correlation between


parallel tests 1 and 2 with substitution of portions of Equation 7.12
in numerator

r X1 X2 =
å (T1 + E1)(T2 + E2 ) = s2T
Ns X1 sX2 s2X
• rX 1 X 2 = c oefficient of reliability expressed as the correlation
between parallel tests.
• t1 = true score on test 1 in deviation score form.
• t2 = true score on test 2 in deviation score form.
• x1 = observed score on test 1 in deviation score form.
• x2 = observed score on test 2 in deviation score form.
• sX1 = observed score on test 1.
• s X2 = observed score on test 2.
• N = sample size.
s2
• 2T = the coefficient of reliability expressed as the ratio of
sX true score variance to observed score variance.
Reliability  219

The first two lines in Equation 7.12 can be substituted into the numerator of
Equation 7.13 yielding an expanded numerator in Equation 7.14. Notice in Equation
7.14 that x, t, and e are now lowercase letters in the numerator. The lowercase let-
ters represent deviation scores (as opposed to raw test scores). A deviation score is
defined as follows: X - X ; T - T; E - E ; where raw scores are subtracted from their respec-
tive means.
The final bullet point in Equation 7.14, the coefficient of reliability expressed as the
ratio of true score variance to observed score variance, is the most common definition of
reliability in the true score model.

7.6 True Score Equivalence, Essential True Score Equivalence,


and Congeneric Tests

Returning to the example data in Table 7.3, notice that the assumption of exactly parallel
tests is not met because, although the true and observed score means are equivalent, their
standard deviations (and therefore variances) are different. This variation on the model
of parallel tests is called tau-equivalence, meaning that only the true (i.e., tau) scores are
equal (Lord & Novick, 1968, pp. 47–50). Essential tau-equivalence (Lord & Novick,
1968, pp. 47–50) is expressed by further relaxing the assumptions of tau-equivalence,
thereby allowing true scores to differ by an additive constant (Lord & Novick, 1968;
Miller, 1995). Including an additive constant in no way affects score reliability since
the reliability coefficient is estimated using the covariance components of scores and is
expressed in terms of the ratio of true to observed score variance (or as the amount of
variance explained as depicted in Figure 7.1).
Finally, the assumption of congeneric tests (Lord & Novick, 1968, pp. 47–50;
Raykov, 1997, 1998) is the least restrictive variation on the model of parallel tests because
the only requirement is that true scores be perfectly correlated on tests that are designed
to measure the same construct. The congeneric model also allows for either an additive
and/or a multiplicative constant between each pair of item-level true scores so that the
model is appropriate for estimating reliability in datasets with unequal means and vari-
ances. Table 7.5 summarizes variations on the assumptions of parallel tests within the
classical true score model.

7.7 Relationship between Observed and True Scores

To illustrate the relationship among observed, true, and error scores, we return to using
deviation scores based on a group of persons—a metric that is convenient for deriving
the covariance (i.e., the unstandardized correlation presented in Chapter 2) among these
score components. Recall that in Equation 7.1 the definition of observed score is the
sum of the true score and error score. Alternatively, Equation 7.15 illustrates the same
220  PSYCHOMETRIC METHODS

Table 7.5.  Four Measurement Models of Reliability Theory


Parallel Tau-equivalent Essentially tau- Congeneric
Model assumption tests tests equivalent tests testsa
1. Equal expected observed scores X X — —
2. E qual standard deviations (vari-
ances) of expected observed scores X — — —
3. Equal covariance components for
expected observed scores for any
set of parallel tests or for any single X X X —
parallel test and another test of a
different construct
4. E
 qual coefficients of covariance or
correlation X — — —
5. Equal coefficients of reliability X — — —
Note. Due to the axioms of classical test theory, expected observed scores equal true scores.
In congeneric tests, there is no mathematically unique solution to the estimation of a reliability coefficient; thus only
a 

a lower bound should be reported.

Equation 7.15. Observed score, true score, and error score


in deviation score units

x=t+e

• x = o
 bserved score on a test derived as a raw score minus
the mean of the group scores.
• t = true score on a test derived as a true score minus the
mean of the group of true scores.
• e = e rror score derived as an error score minus the mean of
the group error scores.

elements in Equation 7.1 as deviation scores. In the previous section, a deviation score
was defined as X - X ; T - T; E - E ; where raw scores are subtracted from their respective
means. An advantage of working through calculations in deviation score units is that the
derivation includes the standard deviations of observed, true, and error scores—elements
required for deriving the covariance among the score components. The covariance is
expressed as the product of observed and true deviation scores divided by the sample size
(N). For the data in Table 7.3, the covariance is 9.65: COV OT = éë å (XO - XO )( XT - X T )/N ùû
(as an exercise, you should use the data in Table 7.3 and apply it to the equation in
this sentence to derive the covariance between true and observed scores). Notice that in
Reliability  221

Equation 7.14 the covariance is incorporated into the derivation of the reliability index by
including the standard deviations of observed and true scores in the denominator.
Next, recall that the true score model is based on a linear equation that yields a com-
posite score for a person. By extension and analogy, a composite score is also expressed as
the sum of the responses to individual test items (e.g., each test item is a micro-level test).
Working with the covariance components of total or composite scores (e.g., observed,
true, and error components) provides a unified or connecting framework for illustrating
how the true score model works regarding the estimation of reliability with individual and
group-level scores in the true score model and classical test theory.

7.8 The Reliability Index and Its Relationship


to the Reliability Coefficient

The reliability index (Equation 5.16; Crocker & Algina, 1986, pp. 114–115; Kelley,
1927; Lord & Novick, 1968) is defined as the correlation between observed scores
and true scores. From the example data in Table 7.4 we see that this value is .91.
The square of the reliability index (.91) is .82—the coefficient of reliability (see
Table 7.4). Equation 7.16 illustrates the calculation of the reliability index working
with deviation scores. Readers can insert the score data from Table 7.3 into Equation
7.16, then work through the steps and compare the results reported in Table 7.4 pre-
sented earlier.

7.9 Summarizing the Ways to Conceptualize Reliability

The observed score variance variable s2X can be expressed as the sum of the random true
score variance sT2 plus the random observed score error variance s2E. Computing the
observed score variance as a linear sum using separate, independent components is pos-
sible because true score errors are uncorrelated with observed score errors. Next, using
the component pieces of true score error and observed score error, the coefficient of reli-
ability can be conceptually expressed in Equation 7.17 as the ratio of true score variance
to observed score variance.
Returning to the data in Table 7.3, we can insert the variance components from the
table in Equation 7.17 to calculate the reliability coefficient. For example, the true score
variance (9.66) divided by the observed score variance (11.75) equals .82, the coefficient
of reliability (Table 7.4). The type of reliability estimation just mentioned uses the vari-
ance to express the proportion of variability in observed scores explained by true scores.
To illustrate, notice that the correlation between true scores and error scores in Table 7.4
is .91. Next, if we square .91, a value of .82 results, or the reliability coefficient. In linear
regression terms, the reliability (.82) is expressed as the proportion of variance in true
scores explained by variance in observed scores (see Figure 7.2).
222  PSYCHOMETRIC METHODS

Equation 7.16. The reliability index or the correlation between


observed scores and true scores expressed as the ratio of standard
deviation of true scores to the standard deviation of observed scores

r T =
å (T + E) T
N s X sT

=
å T 2 + å TE
N s X sT

r =
å T2 +
å TE
T
N s X sT N s X sT

The last term above cancels because, by tautology, the cor-


relation between true and error scores is zero, and since

åT 2
sT2 = , then
N

sT2
rXT = , simplifying to
s X sT

sT
rXT =
sX

• ­rXT = reliability index.


• ­sT = standard deviation of true scores.
• ­sX = standard deviation of observed scores.
• t = true score in deviation score units.
• e = error score in deviation score units.
• ­å = summation operator.
• N = population size.
• sT2 = variance of true scores.
• ­åt2 = sum of true scores squared.

Finally,
r2XT = t he index of reliability squared is the coefficient of
reliability.
Reliability  223

Equation 7.17. Coefficient of reliability expressed as a ratio


of variances

sT2 sT2
r2XT = =
s 2X sT2 + s E2

• r2XT = coefficient of reliability.


• sT2 = true score variance.
• s 2X = observed score variance.
• s E2 = error score variance.

Equation 7.17 illustrates that the squared correlation between true and observed
scores is the coefficient of reliability. Yet another way to think of reliability is in terms
of the lack of error variance. For example, we may think of the lack of error variability
expressed as 1 - (s E2 / sO2 ). Referring to the data in Table 7.3, this value would be 1 − .18 =
.82, or the coefficient of reliability. Finally, reliability may be described as the lack of cor-
2
relation between observed and error scores, or 1 - rOE, which, based on the data in Table
7.3, is .82 or the coefficient of reliability.

7.10 Reliability of a Composite

Earlier in this chapter it was stated that individual items on a test can be viewed as par-
allel components of a test. This idea is essential to understanding how reliability coeffi-
cients are estimated within the model of parallel tests in the true score model. Specifically,
test items serve as individual, yet parallel, parts of a test providing a way to estimate the
coefficient of reliability from a single test administration. Recall that a score on an indi-
vidual item is defined by a point value assigned based on a person’s response to an item
(e.g., 0 for incorrect or 1 for correct). In this sense, an item is a “micro-level” testing unit,
and an item score is analogous to a “micro-level test.” The variance of each item can be
summed to yield a total variance for all items comprising a test. Equations 7.18a and
7.18b illustrate how the variance and covariance of individual test items can be used to
derive the total variance of a test.
Based on Equation 7.18a, we see that total test variance for a composite is deter-
mined by the variance and covariance of a set of items. In Table 7.6, the total variance
is the sum of the variances for each item (1.53), plus 2 times the sum of the individual
covariance values (1.08), equaling a total test variance of 2.61.
224  PSYCHOMETRIC METHODS

Equation 7.18a. Test variance based on the sum of individual


items

s TEST
2
= ås I2 + 2å rIK sI sK, I > K

• s2TEST = variance of total test.


• s2I = variance of an individual item.
• ­rik = correlation between items i and k.
• ­si = standard deviation of item i.
• ­sk = standard deviation of item k.
• ­riksisk = covariance of items i through k resulting in
n(n − 1) terms.
• ­2å rIKsIsK = t wo times (2×) the sum of all n(n − 1) covari-
ance terms.

Equation 7.18b. Test variance based on the data in Table 7.6

s2TEST = 1.53 + 2(.54)

= 1.53 + 1.08
= 2.61

Table 7.6.  Variance–Covariance Matrix Based on 10 Crystallized Intelligence Test


Items
Item 1 2 3 4 5 6 7 8 9 10
1 0.10 –0.01 –0.01 –0.02 –0.01 –0.02 –0.02 –0.01 –0.04 0.08
2 0.10 –0.01 –0.02 –0.01 –0.02 –0.02 –0.01 0.07 –0.03
3 0.10 0.09 –0.01 0.09 –0.02 –0.01 0.07 –0.03
4 0.18 –0.02 0.18 0.07 0.09 0.13 –0.07
5 0.10 –0.02 –0.02 –0.01 –0.04 0.08
6 0.18 0.07 0.09 0.13 –0.07
7 0.18 0.09 0.02 –0.07
8 0.10 0.07 –0.03
9 0.27 –0.13
10                   0.23
Note. Variances are in bold on the diagonal and covariance elements are off-diagonal entries. S variances = 1.53;
S covariances = 0.54.
Reliability  225

If we replace the “items” in Table 7.6 with “total test scores” (i.e., the total score being
based on the sum of items comprising a test), the same concept and statistical details will
apply regarding how to derive the total variance for a set of total test scores. Next, we
turn to the use of total test scores that are useful as individual components for deriving
a composite score.
In the true score model, total test scores are created by summing the item response
values (i.e., score values yielding points awarded) for each person. The total score for
a test derived in this manner is one form of a composite score. Another form of com-
posite score is derived by summing total test scores for two or more tests. In this case,
a composite score is defined as the sum of individual total test scores. Returning to the
data used throughout this book, suppose that you want to create a composite score
for crystallized intelligence by summing the total scores obtained on each of the four
subtests for crystallized intelligence. The summation of the four total test scores yields
a composite score that represents crystallized intelligence. Equation 7.19 illustrates
the derivation of a composite score for crystallized intelligence (labeled CIQ). The
composite score, CIQ, represents the sum of four subtests, each representing a different
measure of crystallized intelligence.
Given that composites are based on item total scores (for a single test) or total test
scores (for a linear composite comprised of two or more tests), these composites for-
mally serve as parallel components on a test. Applying the definition of parallel test com-
ponents, reliability estimation proceeds according to the technique(s) appropriate for
accurately representing the reliability of scores given the type of study. Specifically, the
estimation of reliability may proceed by one or more of the following techniques. First,
you may derive the stability of scores using the test–retest method. Second, you may
derive the equivalence of scores based on parallel test forms. Third, you may derive the
internal consistency of scores by using a sample of persons tested once with test items

Equation 7.19. Observed score composite based on the linear


sum of four crystallized intelligence tests

CIQ = X1crystallized1 + X2crystallized2 + X3crystallized3 + X4crystallized4

• CIQ = composite score expressed as the linear combination


of crystallized intelligence tests 1–4.
• X1crystallized1 = total score for crystallized intelligence test 1.
• X2crystallized2 = total score for crystallized intelligence test 2.
• X3crystallized3 = total score for crystallized intelligence test 3.
• X4crystallized4 = total score for crystallized intelligence test 4.
226  PSYCHOMETRIC METHODS

serving as parallel pieces within the overall composite using the split-half reliability
method or by deriving the internal consistency of scores using the Küder–Richardson for-
mula 20 (KR20) or (21) or Cronbach’s coefficient alpha. Each of the internal consistency
techniques is based on there being as many parallel tests as there are items on the test. To
derive the variance of the composite score, Equation 7.20a is required. Equation 7.20b
illustrates the application of Equation 7.20a with data from Table 7.7.
Based on Equation 7.20b, the total variance of the composite using the data in Table
7.7 is 214.92.
To conclude this section, recall that earlier in this chapter individual test items com-
prising a test were viewed as parallel parts of a test. The requirements for parallel tests or
measurements include (1) equal mean true scores, (2) equal (item or test) standard devi-
ations, and (3) equal item (or test) variances. Specifically, test items (or total test scores)

Equation 7.20a. Observed score variance of a composite score


derived from crystallized tests 1–4

s2 Q = sCRYSTALLIZED1
2
+ sCRYSTALLIZED2
2
+ sCRYSTALLIZED3
2
+ s2CRYSTALLIZED4 + å rIJ sI sJ
I¹ J

2
• sCIQ = variance of a composite score expressed as
crystallized intelligence based on the sum of
individual total test scores.
2
• sCRYSTALLIZED1 = variance of the crystallized intelligence test 1.
2
• sCRYSTALLIZED2 = variance of the crystallized intelligence test 2.
• sCRYSTALLIZED3 = variance of the crystallized intelligence test 3.
2

• sCRYSTALLIZED4
2
= variance of the crystallized intelligence test 4.
• å rIJ sI sJ = s um of k(k − 1) covariance terms (i.e., k =
I¹ J intelligence tests 1–4), where i and j represent
any pair of tests.

Equation 7.20b. Observed score variance of a composite score


derived from crystallized tests 1–4 based on data in Table 7.7

sCIQ
2
= 47.12 + 24.93 + 12.40 + 21.66 + 108.81

= 214.92
Reliability  227

Table 7.7.  Composite Scores for Crystallized Intelligence Tests 1–4


Crystallized total Crystallized total Crystallized total Crystallized total
score test 1 score test 2 score test 3 score test 4
39 14 23 17
47 17 24 24
28 8 14 12
29 6 19 11
27 5 22 17
35 11 18 11
44 15 25 22
36 5 17 15
42 17 22 21
36 6 18 19

Mean 36.3 10.4 20.2 16.9


SD 6.86 4.99 3.52 4.65
Variance 47.12 24.93 12.40 21.66

Variance–covariance matrix
47.12 28.64 15.93 25.48
— 24.93 11.69 14.71
— — 12.40 12.36
— — — 21.66

Total variance = 214.92

serve as individual, yet parallel, parts of a test, providing a way to estimate the coefficient
of reliability from a single test administration. Equation 7.21 provides a general form for
deriving true score variance of a composite. Equations 7.20a and 7.21 are general because
they can be used to estimate the variance of a composite when test scores exhibit unequal
standard deviations and variances (i.e., the equations allow for the covariation between
all items whether equal or unequal).

Equation 7.21. General form for true score variance of a composite

s2 Q = sTRUE_SCORE_CRYSTALLIZED1
2
+ sTRUE_SCORE_CRYSTALLIZED2
2

+ sTRUE_SCORE_CRYSTALLIZED3
2
+ s 2TRUE_SCORE_CRYSTALLIZED4 + å rIJ sI sJ
I¹ J
228  PSYCHOMETRIC METHODS

Using the foundations of the CTT model, in the next section, we review several
techniques for estimating the coefficient of reliability in specific research or applied
situations.

7.11 Coefficient of Reliability: Methods of Estimation Based


on Two Occasions

Coefficient of Stability: Test–Retest Method


Estimating the stability of test scores involves administering the same test to the same
persons twice in as similar situations as possible. Once the data are collected, one cor-
relates the scores of two test administrations. Reliability estimation under this approach
yields a coefficient of stability. For example, a researcher may want to know how con-
sistently persons respond to the same test at different times. In this context, the interest
is in how stable a person’s observed scores are in relation to his or her true score on a trait
or attribute of interest (e.g., intelligence).
The test–retest method relies on two assumptions. The first assumption is that a per-
son’s true score is stable over time and, therefore, does not change. The second assump-
tion is that a person’s error scores are stable over time. These two assumptions provide
the basis for establishing the degree to which a group of persons’ scores exhibit equal
reliability over time. The main challenge regarding the assumptions of the test–retest method
is that true scores for persons do not change over time. There are three reasons for chal-
lenging this assumption. First, constructs that reflect “states” such as mood or anxiety
are unlikely to remain stable over time (i.e., state-type attributes are highly variable over
time such as days or weeks). For this reason, if a test is measuring mental “states,” the
test–retest method for estimating reliability is seldom useful. Conversely, the construct
of adult intelligence is classified as a “trait” or attribute that is stable over time. For con-
structs that reflect traits, the test–retest method is often useful because it provides a basis
for establishing the degree to which a group of persons’ scores on a trait is equally reliable
over time.
The second challenge to the assumption of the lack of change in a person’s true
score over time is attributed to the length of the interval between the first and second
test administrations. The longer the interval between the first and second testing periods,
the greater the likelihood of change in the psychological attribute. If the time between
the first and second testing periods is too short (i.e., less than 14 days), the chances of
a carryover (memory or practice) or contamination (additional information acquired by
persons) effect are high. The ideal time between the first and second test administrations
is between 14 and 28 days (Nunnally & Bernstein, 1994). Regarding the acceptable level
of test–retest reliability coefficients for tests of ability or achievement where significant
diagnostic or educational decisions often hinge, values of at least .90 are recommended.
Reliability  229

For personality, attitude, or interest inventories, test–retest coefficients are usually lower,
and the recommended range is between .80 and .90.
The final challenge to the test–retest method is related to chronological age. For
example, although research has established that adult intelligence is stable over time
(Wechsler, 1997b), this is not the case with the intelligence of children.

Coefficient of Equivalence: Parallel (Alternate) Forms Method


As previously stated, one way to define the reliability coefficient is the correlation
between two strictly parallel tests. The parallel or alternate forms approach to reli-
ability estimation directly incorporates this definition. The alternate forms approach
to reliability estimation is useful when having parallel forms of a test is desirable. For
example, parallel test forms may be useful (1) when persons are required to repeat
an examination with a short time period between the two testing occasions or (2) to
reduce the possibility of cheating when a single group of persons is taking a test in the
same location.
To use the parallel forms technique, one creates two tests that, as nearly as possible,
meet the requirement of strictly parallel tests. Recall that this requirement means that, for
a group of persons, (1) the same set of true scores is being measured and the true scores
are equal, and (2) error scores (or variances) are equal. If the requirements for strict par-
allelism are tenable, the two test forms are administered by using (1) the same persons in
a retest situation or (2) a group of persons taking two forms of the test at the same time.
Once the scores from the two tests are obtained, one proceeds by conducting a correla-
tion analysis between the scores obtained.
Perhaps the strongest criticism of the alternate forms method is that one can
argue that because two tests are composed of different items, the two forms can never
be exactly parallel—at least theoretically speaking. A second criticism of the alterna-
tive forms method is related to carryover or memory effects. Earlier in this chapter,
it was stated that in the true score model of parallel tests, error scores are required
to be uncorrelated. However, if a carryover effect exists, as is sometimes the case, the
errors of measurement for a group of persons will be correlated—sometimes substan-
tially. For these reasons, if the parallel forms method involves retesting the same per-
sons with an alternate form, the same concerns cited in the test–retest method apply
(i.e., carryover effects due to memory or additional information gleaned by persons
between testing occasions). In applied testing situations, if the researcher can demon-
strate strong evidence that the assumptions of the true score model of parallel tests are
tenable, then the alternate forms coefficient of reliability may be reported. Addition-
ally, in order to provide comprehensive evidence, the parallel forms method is often
accompanied by an estimate of internal consistency reliability—a subject covered in
the next section.
230  PSYCHOMETRIC METHODS

7.12 Methods Based on a Single Testing Occasion

Split-Half Methods
Often it is not possible or desirable to compose and administer two forms of a test, as
discussed earlier. Here we describe a method for deriving the reliability of total test scores
based on parallel half tests. The split-half approach to reliability estimation involves
dividing a test composed of a set of items into halves that, to the greatest degree pos-
sible, meet the assumptions of exact parallelism. The resulting scores on the respective
half tests are then correlated to provide a coefficient of equivalence. The coefficient of
equivalence is actually the reliability based on one of the half tests. However, remember that
owing to the assumption of parallel test halves, we can apply a formula for deriving the
reliability of scores on the total test using the Spearman–Brown formula. For tests com-
posed of items with homogeneous content (a.k.a. item homogeneity; Coombs, 1950),
the split-half method proceeds according to the following steps. First, after scores on the
total test are obtained, items are assigned to each half test in either (a) a random fashion
or (b) according to order of item difficulty. This process yields one parallel subtest that is
composed of odd-numbered items, and a second half test is composed of even-numbered
items. The split-half technique described allows one to create two parallel half tests that
are of equal difficulty and have homogeneous item content.
Earlier it was stated that two parallel half tests can be created with the intent to tar-
get or measure the same true scores with a high degree of accuracy. One way to ascertain
if two tests are parallel is to ensure that the half tests have equal means and standard
deviations. Also, the test items in the two half tests should have the same content (i.e.,
exhibit item homogeneity). A high level of item homogeneity ensures that, as the corre-
lation between the two half tests approaches 1.0, the approximation to equal true scores
is as accurate as possible. If, however, the two half tests comprise items with partially
heterogeneous content, then certain parts of the two half tests will measure different
true scores. In this case, the two half tests should be created based on matching test halves,
where test items have been matched on difficulty and content. Table 7.8 provides example

Table 7.8.  Split-Half Data for 10 Persons from the 25-Item


Crystallized Intelligence Test 2
Half test 1 Half test 2
Odd items (total score) Even items (total score)
Mean 10.30 4.20
Variance 6.23 5.96

Variance of total test: 21.17


Odd/even correlation (rii’): 0.69
Split-half reliability: 0.85
Guttman split-half reliability: 0.85
Reliability  231

data for illustrating the split-half and Guttman (1946) methods for estimating reliability
based on half tests. Rulon’s formula (1939) (equivalent to Guttman’s formula) does not
assume equal standard deviations (and variances) on the half test components. Finally,
when the variances on the half tests are approximately equal, the Rulon formula and
Guttman’s equation yield the same result as the split-half method with the Spearman–
Brown formula.
The SPSS syntax for computing the split-half reliability based on the model of paral-
lel tests (not strictly parallel) is provided below.

RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13 cri2_14
cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21 cri2_22
cri2_23 cri2_24 cri2_25
/SCALE('ALL VARIABLES') ALL
/MODEL=PARALLEL.

The resulting output is provided in Tables 7.9a and 7.9b.


Equation 7.22 can be extended to deriving the reliability of any composite (e.g.,
the parallel components may be subtest total scores rather than individual items).
Equation 7.23 illustrates Rulon’s formula, as applied by Guttman, for total test score
reliability. Rulon’s formula is based on the error variances on half tests and the total
test variance.

Table 7.9a. Test for Model Goodness of Fit


Chi-Square Value -20.653
df 323
Sig 1.000
Log of Determinant of Unconstrained Matrix .000
Constrained Matrix -44.767
Under the parallel model assumption

Table 7.9b. Reliability Statistics


Common Variance .184
True Variance .028
Error Variance .156
Common Inter-Item .151
Correlation
Reliability of Scale .816
Reliability of Scale .857
(Unbiased)
232  PSYCHOMETRIC METHODS

Equation 7.22. Spearman–Brown formula for total test score reli-


ability based on the correlation between parallel split-halves

2(rII¢)
rXX ¢ =
1 + rII¢
• ­rii΄ = correlation between half tests.
• ­rxx΄ = s plit-half reliability based on the Spearman–Brown
formula.

Equation 7.23. Rulon’s formula for total test score reliability based
on the correlation between parallel split-halves

é æ s2 + sHALF
2
TEST2 ö
ù
rXX ¢ = 2 ê1 - ç HALF TEST12 ÷ ú
êë è sTOTAL TEST ø úû

The SPSS syntax for computing the Guttman model of reliability is as follows:

RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13 cri2_14
cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21 cri2_22
cri2_23 cri2_24 cri2_25
/SCALE('ALL VARIABLES') ALL
/MODEL=GUTTMAN.

The Guttman model provides six lower-bound coefficients (i.e., expressed as lambda
coefficients). The output for the Guttman reliability model is provided in Table 7.10. The
lambda 3 (L3) is based on estimates of the true variance of scores on each item and is also
expressed as the average covariance between items and is analogous to coefficient alpha.
Guttman’s lambda 4 is interpreted as the greatest split-half reliability.

Table 7.10. Reliability


Statistics
Lambda 1 .783
2 .865
3 .816
4 .848
5 .830
6 .
N of Items 25
Reliability  233

Internal Consistency: Methods Based on Covariation among Items


The final section of this chapter introduces approaches based on covariation among or
between test items. The methods presented here were developed to provide a way to
estimate the coefficient of reliability from a single test administration without splitting
the single test into parallel halves. Specifically, the methods presented in this chapter
include coefficient alpha, the Küder–Richardson 20, and the Küder–Richardson 21
formulas.

Coefficient Alpha
The first and most general technique for the estimation of internal consistency reliability
is known as coefficient alpha and is attributed to L. J. Cronbach (1916–2001). In his work
(1951), Cronbach provided a general formula for deriving the internal consistency of scores.
Coefficient alpha is a useful formula because of its generality. For example, alpha is effective
for estimating score reliability for test items that are scored dichotomously (correct/­incorrect),
or for items scored on an ordinal level of measurement (e.g., Likert-type or rating scale items)
and even for essay-type questions that often include differential scoring weights. For these
reasons, coefficient alpha is reported in the research literature more often than any other
coefficient. The general formula for coefficient alpha is provided in Equation 7.24. Table 7.11
includes summary data for 10 persons on the 25-item crystallized intelligence test 2 used in
the previous section on split-half methods.
The total test variance for the crystallized intelligence test 2 is 19.05 (defined as the
sum of the squared deviations from the mean) for 10 persons in this example data. Read-
ers are encouraged to conduct the calculation of coefficient alpha using the required parts
of Equation 7.24 by accessing the raw item-level Excel file: “Reliability_Calculation_
Examples.xlsx” on the companion website (www.guilford.com/price2-materials). Knowing
that the test is composed of 25 items, the total test variance is 19.05 and the sum of the

Equation 7.24. Coefficient alpha

K æ S sˆ I2 ö
aˆ = 1 -
K - 1 çè sˆ 2X ÷ø

• â = coefficient alpha.
• k = number of items.
• sˆ I2 = variance of item i.
• sˆ 2X = total test variance.
234  PSYCHOMETRIC METHODS

Table 7.11. Item Summary Data for 10 Persons


from Crystallized Intelligence Test 2
Proportion Proportion
correct incorrect Item variance
Item p q p*q
1 0.9 0.1 0.09
2 0.9 0.1 0.09
3 0.8 0.2 0.16
4 0.8 0.2 0.16
5 0.9 0.1 0.09
6 0.8 0.2 0.16
7 0.9 0.1 0.09
8 0.9 0.1 0.09
9 0.6 0.4 0.24
10 0.7 0.3 0.21
11 0.7 0.3 0.21
12 0.6 0.4 0.24
13 0.8 0.2 0.16
14 0.8 0.2 0.16
15 0.6 0.4 0.24
16 0.7 0.3 0.21
17 0.4 0.6 0.24
18 0.3 0.7 0.21
19 0.3 0.7 0.21
20 0.8 0.2 0.16
21 0.3 0.7 0.21
22 0.2 0.8 0.16
23 0.2 0.8 0.16
24 0.1 0.9 0.09
25 0.1 0.9 0.09
Sp = 15.1 Sp*q = 4.13

item-level variances is 4.13, we can insert these values into Equation 7.23 and derive the
coefficient alpha as .82.

7.13 Estimating Coefficient Alpha: Computer Program


and Example Data

The SPSS syntax and SAS source code that produces output using the data file .sav is
provided on the next page. The dataset may be downloaded from the companion website
(www.guilford.com/price2-materials).
Reliability  235

SPSS program syntax for coefficient alpha using data file Coefficient_Alpha_
Reliability_N_10_Data.SAV

RELIABILITY
/VARIABLES=cri2_01 cri2_02 cri2_03 cri2_04 cri2_05 cri2_06
cri2_07 cri2_08 cri2_09 cri2_10 cri2_11 cri2_12 cri2_13
cri2_14 cri2_15 cri2_16 cri2_17 cri2_18 cri2_19 cri2_20 cri2_21
cri2_22 cri2_23 cri2_24 cri2_25
/SCALE('ALL VARIABLES') ALL
/MODEL=ALPHA
/STATISTICS=DESCRIPTIVE SCALE
/SUMMARY=TOTAL.

Tables 7.12a–d are derived from the SPSS program.

Table 7.12a.  Reliability Statistics


Cronbach’s Alpha N of Items
.816 25

Table 7.12b.  Item Statistics


Mean Std. Deviation N
cri2_01 .90 .316 10
cri2_02 .90 .316 10
cri2_03 .90 .316 10
cri2_04 .80 .422 10
cri2_05 .90 .316 10
cri2_06 .80 .422 10
cri2_07 .80 .422 10
cri2_08 .90 .316 10
cri2_09 .60 .516 10
cri2_10 .70 .483 10
cri2_11 .70 .483 10
cri2_12 .60 .516 10
cri2_13 .80 .422 10
cri2_14 .80 .422 10
cri2_15 .60 .516 10
cri2_16 .70 .483 10
cri2_17 .40 .516 10
cri2_18 .30 .483 10
cri2_19 .30 .483 10
cri2_20 .20 .422 10
cri2_21 .30 .483 10
cri2_22 .20 .422 10
cri2_23 .20 .422 10
cri2_24 .10 .316 10
cri2_25 .10 .316 10
236  PSYCHOMETRIC METHODS

Table 7.12c.  Item–Total Statistics


Cronbach’s
Scale Mean if Scale Variance if Corrected Item- Alpha if Item
Item Deleted Item Deleted Total Correlation Deleted
cri2_01 13.60 21.378 -.106 .824
cri2_02 13.60 20.489 .202 .815
cri2_03 13.60 20.267 .281 .812
cri2_04 13.70 18.233 .765 .791
cri2_05 13.60 21.378 -.106 .824
cri2_06 13.70 18.233 .765 .791
cri2_07 13.70 19.344 .443 .806
cri2_08 13.60 19.156 .690 .799
cri2_09 13.90 18.544 .530 .800
cri2_10 13.80 22.178 -.274 .839
cri2_11 13.80 18.844 .498 .802
cri2_12 13.90 19.433 .322 .811
cri2_13 13.70 18.456 .699 .794
cri2_14 13.70 18.233 .765 .791
cri2_15 13.90 19.656 .272 .814
cri2_16 13.80 17.511 .847 .784
cri2_17 14.10 17.878 .692 .791
cri2_18 14.20 18.400 .611 .796
cri2_19 14.20 21.733 -.178 .834
cri2_20 14.30 20.233 .199 .816
cri2_21 14.20 20.844 .020 .825
cri2_22 14.30 19.344 .443 .806
cri2_23 14.30 20.233 .199 .816
cri2_24 14.40 20.267 .281 .812
cri2_25 14.40 21.156 -.031 .822

Table 7.12d.  Scale Statistics


Mean Variance Std. Deviation N of Items
14.50 21.167 4.601 25

SAS source code for coefficient alpha using SAS data file alpha_reliability_data

libname work 'LPrice_09';


data temp; set work.alpha_reliability_data;
proc corr data=temp nosimple alpha;
Title 'Coefficient Alpha using Crystallized Intelligence Example
Data N=10 ';
var cri2_01 - cri2_25;
run; quit;

Table 7.13 is produced from the SAS program.


Reliability  237

Table 7.13.  SAS Output for Coefficient Alpha


Coefficient Alpha using Crystallized Intelligence Example Data N=10 1

10:45 Tuesday, November 15, 2011

The CORR Procedure


25 CRI2_01 CRI2_02 CRI2_03 CRI2_04 CRI2_05 CRI2_06 CRI2_07 CRI2_08
Vari-
ables:
CRI2_09 CRI2_10 CRI2_11 CRI2_12 CRI2_13 CRI2_14 CRI2_15 CRI2_16
CRI2_17 CRI2_18 CRI2_19 CRI2_20 CRI2_21 CRI2_22 CRI2_23 CRI2_24
CRI2_25

Cronbach Coefficient Alpha


Variables Alpha
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Raw 0.815836
Standardized 0.808206

Cronbach Coefficient Alpha with Deleted Variable


Raw Variables Standardized Variables

Correla- Correla-
Deleted tion with tion with
Variable Total Alpha Total Alpha Label
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
ƒƒƒƒƒƒƒƒƒƒƒ
CRI2_01 -.106391 0.824370 -.117827 0.821956 cri2_01
CRI2_02 0.201823 0.814864 0.176187 0.809214 cri2_02
CRI2_03 0.280976 0.812357 0.269409 0.805024 cri2_03
CRI2_04 0.765257 0.791034 0.766489 0.781409 cri2_04
CRI2_05 -.106391 0.824370 -.139250 0.822857 cri2_05
CRI2_06 0.765257 0.791034 0.766489 0.781409 cri2_06
CRI2_07 0.443376 0.805534 0.423210 0.797949 cri2_07
CRI2_08 0.690412 0.798951 0.664913 0.786412 cri2_08
CRI2_09 0.529629 0.800271 0.518662 0.793454 cri2_09
CRI2_10 -.273526 0.838547 -.252589 0.827561 cri2_10
CRI2_11 0.498087 0.802297 0.516984 0.793534 cri2_11
CRI2_12 0.322139 0.811395 0.307313 0.803299 cri2_12
CRI2_13 0.699294 0.794074 0.689362 0.785216 cri2_13
CRI2_14 0.765257 0.791034 0.766489 0.781409 cri2_14
CRI2_15 0.271781 0.814019 0.293075 0.803948 cri2_15
CRI2_16 0.846512 0.783933 0.851875 0.777130 cri2_16
CRI2_17 0.692078 0.791202 0.700026 0.784693 cri2_17
CRI2_18 0.611315 0.796471 0.625044 0.788351 cri2_18
CRI2_19 -.177627 0.834356 -.192228 0.825069 cri2_19
CRI2_20 0.199188 0.815987 0.203809 0.807980 cri2_20
CRI2_21 0.020153 0.825438 -.003166 0.817071 cri2_21
CRI2_22 0.443376 0.805534 0.470516 0.795731 cri2_22
CRI2_23 0.199188 0.815987 0.215163 0.807471 cri2_23
CRI2_24 0.280976 0.812357 0.297003 0.803769 cri2_24
CRI2_25 -.030557 0.822068 -.048887 0.819032 cri2_25
238  PSYCHOMETRIC METHODS

7.14 Reliability of Composite Scores Based


on Coefficient Alpha

In reality, tests rarely meet the assumptions required of strictly parallel forms. Therefore,
a framework is needed for estimating composite reliability when the model of strictly
parallel tests is untenable. Estimating the composite reliability of scores in the case of
essentially tau-equivalent or congeneric tests is accomplished using the variance of the
composite scores and all of the covariance components of the subtests (or individual
items if one is working with a single test). An estimate is provided that is analogous to
coefficient alpha and is simply an extension from the item-level data to subtest level data
structures. Importantly, alpha provides a lower bound to the estimation of reliability in the
situation where tests are nonparallel. The evidence that coefficient alpha provides a lower
bound estimate of reliability is established as follows. First, there will be at least one
subtest of those comprising a composite variable that exhibits a variance greater than or
equal to its covariance with any other of the subtests. Second, for any two tests that are
not strictly parallel, the sum of their true score variances is greater than or equal to twice
their covariance. Finally, the sum of the true score variance for nonparallel tests (k) will
be greater than or equal to the sum of their k(k – 1) covariance components divided by
(k – 1). Application of the inequality yields Equation 7.25.

Equation 7.25. Reliability of a composite equivalent to coefficient


alpha

K æ S sˆ I2ö
rCC¢ ³ 1- 2 ÷
K - 1 çè sˆ C ø
• ­rCC΄ = reliability of the composite.
• S sˆ I2 = variance for subtest i.
• sˆ C2 = total composite test variance.

Küder–Richardson Formulas 20 (KR20) and 21 (KR21)


In 1937, Küder and Richardson developed two formulas aimed at solving the problem of
the lack of a unique solution provided by the split-half method of reliability estimation.
Specifically, the Küder–Richardson approaches are based on item-level statistical proper-
ties rather than the creation of two parallel half tests. The two formulas developed, KR20
and KR21, are numbered according to the steps involved in their derivation. Both KR20
and KR21 are closely related to coefficient alpha. In fact, the two formulas can be viewed
as more restrictive versions of coefficient alpha. For example, the KR20 formula is only
applicable to dichotomously (correct/incorrect) scored items (Equation 7.26).
To explain, notice that the numerator inside the brackets of Equation 7.26 is the sum
of the product of the proportion of persons correctly responding to each item on the
Reliability  239

Equation 7.26. Küder–Richardson formula 20

K æ S PQö
R 20 = 1- 2 ÷
K - 1 çè sˆ X ø

• KR20 = coefficient alpha.


• k = number of items.
• pq = variance of item i as the product of the proportion
of correct and proportion incorrect responses over
persons.
• sˆ 2X = total test score variance.

test multiplied by the proportion of persons responding incorrectly to each item on the
test. Comparing Equation 7.24 for coefficient alpha, we see that the numerator within
the brackets involves summation of the variance of all test items. The primary difference
between the two equations is that in KR20 the variance for dichotomous items is based
on multiplying proportions, whereas in coefficient alpha the derivation of item variance
is not restricted to multiplying the proportion correct times the proportion incorrect for
an item because items are allowed to be scored on an ordinal or interval level of mea-
surement (e.g., Likert-type scales or continuous test scores on an interval scale). Finally,
where all test items are of equal difficulty (e.g., the proportion correct for all items are
equal), the KR21 formula applies and is provided in Equation 7.27.
For a detailed exposition of the KR20, KR21, and coefficient alpha formulas with sam-
ple data, see the Excel file titled “Reliability_Calculation_Examples.xlsx” located on the
companion website (www.guilford.com/price2-materials).

Equation 7.27. Küder–Richardson formula 21

K é mˆ (K - mˆ ) ù
K R21 = ê1-
K -1 ë K sˆ X2 úû

• k = number of items.
• m̂ = total score on the test.
• sˆ 2X = total test score variance.
240  PSYCHOMETRIC METHODS

7.15 Reliability Estimation Using the Analysis


of Variance Method

Another useful and general approach to estimating the reliability of test scores is the
analysis of variance (Hoyt, 1941). Consider the formulas for coefficient alpha, KR20 and
KR21. Close inspection reveals that the primary goal of these formulas is the partitioning
of (1) variance attributed to individual items and (2) total variance collectively contrib-
uted by all items on a test. Similarly, in the analysis of variance (ANOVA), one can parti-
tion the variance among persons and items, yielding the same result as coefficient alpha.
The equation for the ANOVA method (Hoyt, 1941) is provided in Equation 7.28.
To illustrate Equation 7.28 using example data, we return to the data used in the
examples for coefficient alpha. Restructuring the data file as presented in Table 7.14
ensures the correct layout for running ANOVA in SPSS. Note that Table 7.14 only pro-
vides a partial listing of the data (because there are 25 items on the test) used in the
example results depicted in Table 7.15.
The data layout example in Table 7.14 continues until all persons, items, and scores
are entered. Next, the following SPSS syntax is used to produce the mean squares required
for calculation of the reliability coefficient.

SPSS syntax to produce Table 7.15

UNIANOVA score BY person item


/METHOD=SSTYPE(3)
/CRITERIA=ALPHA(.05)
/DESIGN=person item person*item.

Inserting the mean squares for persons and the person by items interaction yields a
reliability coefficient of .82—the same value as that which resulted using the formula for
coefficient alpha. Applying the person and person by item mean squares to the ANOVA
approach yields rXX¢ = .847 – .156/.847 = .82.

Equation 7.28. ANOVA method for estimating the coefficient


of reliability

MSPERSONS - MSPERSONS*ITEMS
rXX ¢ =
MSPERSONS

• ­rXX¢ = coefficient of reliability.


• MSpersons = variability attributed to persons.
• MSpersons*items = variability attributed to persons and items
together.
Reliability  241

Table 7.14.  Data Layout for Reliability


Estimation Using SPSS ANOVA
Person Item Score
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
Note. Table consists of 10 persons, the first item out of 25, and persons’
scores on item 1.

Table 7.15.  ANOVA Output: Tests of Between-


Subjects Effects
Dependent Variable: score
Type III
Sum of Mean
Source Squares df Square
       
       
person 7.620 (n -1) = 9 .847
item 19.600 (k -1) = 24 .817
person * item 33.680 (n-1)(k-1) =216 .156
       
Total 145.000 250  
       

7.16 Reliability of Difference Scores

An important aspect of score reliability for certain types of research relates to how change over
time affects the reliability of scores (Linn & Slinde, 1977; Zimmerman & Williams, 1982;
Rogosa, Brandt, & Zimowski, 1982). For example, consider the case where a difference score
based on fluid intelligence and crystallized intelligence is of interest for diagnostic reasons.
Although the primary research question may be about whether the change in score level is
statistically different, a related question focuses on how reliability is affected by the change in
score level. To address the previous question, we consider the reliability of change scores as
a function of (1) the reliability of the original scores used for computation of the difference
242  PSYCHOMETRIC METHODS

score, and (2) the correlation between the scores obtained on the two tests. Based on these
two components, the usefulness of calculating the reliability of change scores depends on the
psychometric quality of the measurement instruments.
The research design of a study plays a crucial role in the application and interpreta-
tion of the reliability of change scores. For example, if groups of subjects selected for a
study are based on a certain range of pretest score values, then the difference score will
be a biased estimator of reliable change (e.g., due to restricted range of pretest scores).
Elements of the research design also play an important role when using change scores.
For example, random assignment to study groups provides a way to make inferential
statements that are not possible when studying intact groups. Equation 7.29 provides the
formula estimating the reliability of difference scores based on pretest to posttest change.
Note that Equation 7.29 incorporates all of the elements of reliability theory presented
thus far in this chapter. Within the true score model, one begins with the fact that it is
theoretically possible to calculate a difference score. Given this information, the usual
true score algebraic manipulation (i.e., true scores to observed scores) applies. Equation
7.29 illustrates the reliability of difference scores.
To illustrate the use of Equation 7.29, we use crystallized (serving as test 1) and fluid
intelligence (serving as test 2) subtest total scores. In Equation 7.30, application of our
score data is applied.The following information is obtained from the GfGc.sav dataset
and is based on the total sample (N = 1,000).

Equation 7.29. Reliability of difference scores

rˆ X1X1¢ s X2 1¢ + rˆ X2 X2¢ s X2 2 - 2rX1X2 s X1s X2


rDD¢ =
s 2X1 + s 2X2 - 2 rX1X2 s X1s X 2

• ­rDD¢ = reliability of a difference score.


• rˆ X1X1 = reliability of test 1.
• r̂X2 X2 = reliability of test 2.
• s2X1 = variance of scores on test 1.
• s2X2 = variance of scores on test 2.
• 2rX1X2 = two times the correlation between tests 1 and 2.
• s X1 s X2 = product of the standard deviation of test 1 and
test 2.
• r̂X1X1′ = reliability of test 1.
• r̂X2 X2 = reliability of test 2.
Reliability  243

Equation 7.30. Application of the equation for the reliability


of difference scores using statistics in Table 7.16

.95(502.21) + .89(129.50) − 2(.463)(22.41)(11.38)


rDD¢ =
502.21 + 129.50 − 2(.463)(22.41)(11.38)
477.10 + 115.25 −(.926)(255.02)
=
631.71 − (.926)(255.02)
592.35 − 236.15
=
631.71 − 236.15
356.20
=
395.56
= .90

Table 7.16.  Descriptive Statistics and Reliability Estimates for Crystallized


and Fluid Intelligence Tests
Crystallized intelligence subtest Fluid intelligence subtest
total score (test 1) total score (test 2)
Mean 81.57 33.00
Standard deviation 22.41 11.38
Reliability 0.95 0.89

7.17 Application of the Reliability of Difference Scores

To ensure the existence of highly reliable difference scores, the following conditions should
be present. Both tests (i.e., scores) should exhibit high reliability but be correlated with each
other at a low to moderate level (e.g., .30–.40). This situation produces reliability of differ-
ence scores that are high. Finally, the psychometric quality of the tests used to derive dif-
ference scores for the analysis of change is crucial to produce reliable change scores. The
concept of the reliability of change scores over time can also be extended beyond the analysis
of discrepancy between different constructs (e.g., crystallized and fluid intelligence presented
here) or basic pretest to posttest analyses to analyze change over time. For example, analytic
techniques such as longitudinal item response theory (IRT; covered in Chapter 10) and hier-
archical linear and structural equation modeling provide powerful frameworks for the analy-
sis of change (Muthen, 2007; Zimmerman, Williams, & Zumbo, 1993; Raudenbush, 2001;
Card & Little, 2007).
244  PSYCHOMETRIC METHODS

7.18 Errors of Measurement and Confidence Intervals

Reliability has been presented so as to provide information regarding the consistency or


stability of test scores. Alternatively, it is also useful to view how “unreliable” test scores
are. Such unreliability is regarded as a discrepancy between observed scores and true
scores and is expressed as the error of measurement relative to scores on a test. In this
section, three different approaches to deriving estimates of errors of measurement are pre-
sented along with the interpretation of each using example data. These three approaches
are from Lord and Novick (1968, pp. 67–68). The first technique presented is the stan-
dard error of measurement, sˆ X.T = sˆ E = sX 1 − rˆ XX ′ , and is based on the error in pre-
dicting a person’s observed score given the person’s true score on randomly parallel tests.
The second technique is the standard error of estimation, sˆ T .X = s X rˆ XX ′ (1 − rˆ XX ′),
and is based on the error in predicting a person’s true score from his or her observed
score. It is useful for establishing confidence limits and intervals for true scores based on
observed scores (i.e., based on the standard deviation of the errors of estimation of true
score given an observed score). The third technique is the standard error of predic-
tion, sˆ Y .X = sY 1 − rXX
2
′ , and is useful for predicting scores on test form Y from parallel
test form X. The next section provides application of the SEM and the standard error of
prediction.

7.19 Standard Error of Measurement

The standard error of measurement (SEM; ŝE) provides an estimate of the discrepancy
between a person’s true score and observed score on a test of interest. Measurement error
for test scores is often expressed in standard deviation units, and the SEM indexes the stan-
dard deviation of the distribution of measurement error. Formally, the SEM (ŝE) is defined as
the standard deviation of the discrepancy between a person’s true score and observed score
over infinitely repeated testing occasions. Gulliksen (1950b, p. 43) offered an intuitive defi-
nition of the SEM as “the error of measurement made in substituting the observed score for
the true score.” Equation 7.31 illustrates the standard error of measurement.

Equation 7.31. Population SEM

sˆ E = s X 1 − rˆ XX ′

• ŝE = population standard error of measurement.


• ­sX = observed score population standard deviation.
• r̂XX ′ = coefficient of reliability based on scores on a test.
Reliability  245

When applying Equation 7.31 to score data, sample estimates rather than population
parameters are typically used to estimate the SEM.
The SEM provides a single index of measurement error for a set of test scores. It can
be used for establishing confidence limits and developing a confidence interval around
a person’s observed score given the person’s estimated true score. Within classical test theory,
a person’s true score is fixed (or constant), and it is the observed and error scores that ran-
domly fluctuate over repeated testing occasions (Lord & Novick, 1968, p. 56). One can
derive confidence limits and an associated interval for observed scores using the SEM.
However, because a person’s true score is of primary interest in the true score model, one
should first estimate the true score for a person prior to using Equation 7.31 to derive
confidence intervals.
Two problems occur when not accounting for true score: (1) a regression effect (i.e.,
the imperfect correlation between observed and true scores, which produces a regression
toward the group mean), and (2) the impact of heteroscedastic (nonuniform) errors
across the score continuum (Nunnally & Bernstein, 1994, p. 240). Consequently, sim-
ply using the SEM has the effect of overcorrecting owing to larger measurement error in
observed scores as compared to true scores. Confidence intervals established without
estimating true scores will lack symmetry (i.e., lack the correct precision across the score
scale) around observed scores. To address the issue of regression toward the mean due
to errors of measurement, Stanley (1970), Nunnally and Bernstein (1994), and Glutting,
McDermott, and Stanley (1987) note that one should first estimate true scores for a per-
son and then derive estimated true score–based confidence intervals that can be used
with observed scores. This step, illustrated in Equation 7.32, overcomes the problem
of lack of symmetry from simply applying the SEM to derive confidence intervals for
observed scores.
As an example, consider estimating a true score for a person who obtained an
observed score of 17. Returning to Tables 7.3 and 7.4, we see that the mean is 11.50, the

Equation 7.32. Estimated true score derived using a deviation-


based observed score multiplied by the reliability estimate corrected
in relation to the group mean

Tˆ = rˆ XX ′ ( X I − X J ) + X J

• T̂ = estimated true score.


• Xi = observed score for a person.
• X J = mean score for a group of persons.
• X I − X J = deviation score for person i.
• ρ̂XX ′ = coefficient of reliability.
246  PSYCHOMETRIC METHODS

standard deviation of observed scores is 4.3, and the reliability is .82. Application of this
information to Equation 7.33 provides the following result.
As noted earlier, lack of symmetry for confidence intervals derived with an SEM
without first estimating true scores neglects accounting for a regression effect. The regres-
sion effect causes biased scores either upward or downward, depending on their location
relative to the group mean. For example, high observed scores are typically further away
from the mean of the group (i.e., they exhibit an upward bias effect), and low scores are
typically biased downward lower than the actual observed score. For these reasons, it is
correct to establish confidence intervals or probable ranges for a person’s observed score given
their (fixed or regressed) true score. Using the estimated true score for a person, one can
apply Equation 7.33 to Equation 7.34a to derive a symmetric confidence interval for true
scores that can be applied to a person’s observed scores. Equation 7.34a can be expressed
as sˆ X .T to show that applying the SEM to estimated true scores yields the prediction of

Equation 7.33. Estimated true score expressed as a regressed


observed score using reliability of .82, observed score of 17, and
group mean of 11.50

T̂ = .82(17 − 11.5) + 11.5

= .82(5.5) + 11.5

= 4.51 + 11.5
= 16.01

Equation 7.34a. SEM expressed as the prediction of observed


score on true score

sˆ X.T = sX 1 − rˆ XX ′

• ŝX .T = standard error of measurement as the prediction of


observed score from true score.
• ­sX = observed score population standard deviation.
• rˆ XX ′ = coefficient of reliability based on scores on a test.
Reliability  247

Equation 7.34b. Illustration of Equation 7.34a

sˆ X.T = sX 1 − rˆ XX ′

= 4.3 1 − .82

= 4.3(.42)
= 1.82

• ŝ X.T = standard error of measurement as the prediction of


observed score from true score.
• ­sX = observed score population standard deviation.
• r̂XX ′ = coefficient of reliability based on scores on a test.

observed scores from true scores. The resulting confidence intervals will be symmetric
about a person’s true score but asymmetric about their observed score. This approach
to developing confidence intervals is necessary in order to account for regression toward the
mean test score.
Equation 7.35a provides the following advantages. First, Stanley’s method is based
on a score metric that is expressed in estimated true score units (i.e., Tˆ − T′, the T¢ = pre-
dicted true score) (Glutting et al., 1987). Second, as Stanley demonstrated (1970), his

Equation 7.35a. Stanley’s method for establishing confidence


limits—expressed in true score units—based on estimated true scores

Tˆ ± (Z )( sˆ X.T )( rˆ XX ′ )

• T̂ = estimated true score.


• z = standard normal deviate (e.g., 1.96).
• ŝX.T = standard error of measurement as the prediction of
observed score from true score.
• ­r̂XX¢ = coefficient of reliability.
248  PSYCHOMETRIC METHODS

Equation 7.35b. Application of Stanley’s method for establishing


a 95% confidence interval for observed scores based on estimated
true score of 16.01

Tˆ ± (1.96)(1.82)(.82)

= (1.96)(1.5)
= 16.01 ± 2.94

= 13.07 − 18.95

• T̂ = estimated true score (16.01).


• z = standard normal deviate (e.g., 1.96).
• ŝX.T = standard error of measurement as the prediction of
observed score from true score (1.82).
• ­r̂XX¢ = coefficient of reliability (.82).

method adheres to the classical true score model assumption that states, for a population
of examinees, errors of measurement exhibit zero correlation with true scores.

Interpretation
To facilitate understanding that a person’s true score will fall within a confidence interval
based on that person’s observed score, consider the following scenario. First, using the
previous example, let’s assume that a person’s true score is 16, the reliability is .82, and the
standard error of measurement is 1.82. Next, let’s assume that this person is repeatedly
tested 1,000 times. Of the 1,000 repeated testing occasions, 950 (95%) would lie within
2.94 points of their true score (e.g., between 13.07 and 18.95). Fifty scores would fall
outside of the interval 13.07 to 18.95. Finally, if a confidence interval is derived for each
of the person’s 1,000 observed scores, 950 of the intervals would be generated around
observed scores between 13.07 and 18.95 (each interval would contain the person’s true
score). From the previous explanation, we see that 5% of the time the person’s true score
would not fall within the interval 13.07 to 18.95. However, there is a 95% chance that the
confidence interval generated around the observed score of 16 will contain the person’s
true score.
A common alternate approach to establishing confidence limits and intervals offered
by Lord and Novick (1968, pp. 68–70) does not always meet the classical true score
model requirement of zero correlation between true and error scores—unless the reliabil-
ity of the test is perfect (i.e., 1.0). Lord and Novick’s (1968, p. 68) approach is expressed
in obtained score units (e.g., Tˆ − T) and is provided in Equation 7.36a.
Reliability  249

Continuing with Lord and Novick’s approach, we will next illustrate the probability
that a person’s true score will fall within a confidence interval based on their observed
score. Again, we assume that a person’s true score is 16 and that the standard error of
measurement is 1.82. Next, let’s assume that this person is repeatedly tested 1,000 times.
Of the 1,000 repeated testing occasions, 950 (95%) would lie within 3.25 points of their
true score (e.g., between 12.76 and 19.26). Notice that the confidence interval is wider
in Lord and Novick’s method (see Equation 7.36a) because the product of the z-ordinate
and the estimated standard error is multiplied by the square root of the reliability. Fifty
scores would fall outside of the interval 12.76 to 19.26. Finally, if a confidence interval
was derived for each of the person’s 1,000 observed scores, 950 of the intervals would be
generated around observed scores between 12.76 and 19.26 (each interval would contain
the person’s true score). It is apparent from the previous explanation that 5% of the time
the person’s true score would not fall within the interval 12.76 to 19.26. However, there
is a 95% chance that the confidence interval generated around the observed score of 16
will contain the person’s true score.

Equation 7.36a. Lord and Novick’s method for establishing confi-


dence limits—expressed in obtained score units—based on estimated
true scores

Tˆ ± (Z )( σˆ X.T ) ρˆ XX ′
• T̂ = estimated true score.
• z = standard normal deviate (e.g., 1.96).
• ŝX.T = standard error of measurement as the prediction of
observed score from true score.
• r̂XX ′ = square root of coefficient of reliability or the reli-
ability index.

Equation 7.36b. Application of Lord and Novick’s method


for estab­lishing a 95% confidence interval for observed scores based
on estimated true score of 16.01

Tˆ ± (1.96)(1.82)(.91)

= (1.96)(1.66)
= 16.01 ± 3.25

= 12.76 − 19.26
250  PSYCHOMETRIC METHODS

7.20 Standard Error of Prediction

The standard error of prediction is useful for predicting the probable range of scores on
one form of a test (e.g., Y), given a score on an alternate parallel test (e.g., X). For exam-
ple, using the crystallized intelligence test example throughout this chapter, one may be
interested in what score one can expect to obtain on a parallel form of the same test. To
derive an error estimate to address this question, Equation 7.37a is required.

Equation 7.37a. Standard error of prediction expressed as the


prediction of test score Y on parallel test score X

σˆ Y .X = σY 1 − ρXX
2

• σˆ Y .X = standard error of prediction.


• σY = standard deviation of test Y.
• ρ2XX ′ = reliability of test X squared.

Equation 7.37b. Derivation of the standard error of prediction

σY .X = 4.3 1 − .822

= 4.3 .327
= 4.3(.572)

= 2.46

Equation 7.37c. Application of standard error of prediction


for establishing a 95% confidence interval for observed scores
based on an estimated true score of 16.01

Tˆ ± (1.96)(2.46)

= 4.82

= 16.01 ± 4.82
= 11.19 − 20.83
Reliability  251

Applying the same example data as in Equations 7.32 and 7.33 to Equation 7.37a
yields the error estimate in Equation 7.37b.
Next, we can apply the standard error of prediction derived from Equation 7.37c to
develop a 95% confidence interval.

Interpretation
Using the standard error of prediction, the probability that a person’s true score will fall
within a confidence interval based on that person’s observed score is illustrated next.
Again we assume that a person’s true score is 16, the standard deviation of test X is 4.3, and
the reliability estimate is .82. Next, we assume that this person is repeatedly tested 1,000
times. Of the 1,000 repeated testing occasions, 950 (95%) would lie within 4.82 points of
the person’s true score (e.g., between 11.19 and 20.83). Notice that the confidence interval
is wider in the previous examples. Fifty scores would fall outside of the interval 11.19 to
20.83. Finally, if a confidence interval was derived for each of the person’s 1,000 observed
scores, 950 of the intervals would be generated around observed scores between 11.19
and 20.83 (each interval would contain the person’s true score). It is apparent from the
previous explanation that 5% of the time the person’s true score would not fall within the
interval 11.19 to 20.83. However, there is a 95% chance that the confidence interval gener-
ated around the observed score of 16 will contain the person’s true score.

7.21 Summarizing and Reporting Reliability Information

Summarizing and reporting information regarding measurement error is essential to the


proper use of any instrument. More broadly, any assessment procedure that uses some
form of instrumentation or measurement protocol for the assessment of knowledge, skill,
or ability is prone to error. Ideally, the optimal way to evaluate the quality of the reliability
of scores is to conduct independent replication studies that focus specifically on reli-
ability (AERA, APA, & NCME, 1999; 2014, p. 27). The following points are essential in
reporting errors of measurement: (1) sociodemographic details about the study group or
examinee population, (2) sources of error, (3) magnitude of errors, (4) degree of gener-
alizability across alternate or parallel forms of a test, and (5) degree of agreement among
raters or scorers. Information on the reliability of scores may be reported in terms of one
or more coefficients (depending on the use of the scores) such as (1) stability—test–
retest, (2) equivalence—alternate forms, and (3) internal consistency—coefficient alpha
or split-half. When decisions are based on judgment, coefficients of interscorer or rater
consistency are required.
Errors of measurement and reliability coefficients involving decisions based on
judgments have many sources. For example, evaluator biases, scoring subjectivity, and
between-examinee factors are all sources of error. To meet these additional challenges,
when errors of measurement and reliability are being reported for decisions based on judg-
ments resulting in classifications, generalizability theory (Cronbach et al., 1972) provides
252  PSYCHOMETRIC METHODS

a comprehensive (presented next in Chapter 8) framework that allows for many types of
applied testing scenarios. Reliability information may also be reported in terms of error
variance or standard deviations of measurement errors. For example, when test scores are
based on classical test theory, the standard error of measurement should be reported along
with confidence intervals for score levels. For IRT, information on functions should be
reported because they provide the magnitude of error across the score range. Also, when a
test is based on IRT, information on the individual item characteristic functions should be
reported along with the test characteristic curve. The item characteristic and test functions
provide essential information regarding the precision of measurement at various ability
levels of examinees. Item response theory will be covered thoroughly in Chapter 10.
Whenever possible, reporting conditional errors of measurement is also encouraged
because errors of measurement are not uniform across the score scale and this has implica-
tions for the accuracy of score reporting (AERA, APA, & NCME, 1999, p. 29). For approaches
to estimating conditional errors of measurement see Kolen, Hanson, and Brennan (1992),
and for conditional reliability, see Raju, Price, Oshima, and Nering (2007).
When comparing and interpreting reliability information obtained from using a test
for different groups of persons, consideration should be given to differences in variability
of the groups. Also, the techniques used to estimate the reliability coefficients should be
reported along with the sources of error. Importantly, it is essential to present the theo-
retical model by which the errors of measurement and reliability coefficients were derived
(e.g., classical test theory, IRT, or generalizability theory). This step is critical because
interpretation of reliability coefficients varies depending on the theoretical model used
for estimation.
Finally, test score precision should be reported according to the type of scale by
which they have been derived. For example, raw scores or IRT-based scores may reflect
different errors of measurement and reliability coefficients than standardized or derived
scores. This is particularly true at different levels of a person’s ability or achievement.
Therefore, measurement precision is substantially influenced by the scale in which the
test scores are reported.

7.22 Summary and Conclusions

Reliability refers to the degree to which scores on tests or other instruments are free from
errors of measurement. This dictates their level of consistency, repeatability, or reliability.
Reliability of measurement is a fundamental issue in any research endeavor because some
form of measurement is used to acquire data—and no measurement process is error free.
Identifying and properly classifying the type and magnitude of error is essential to esti-
mating the reliability of scores. Estimating the reliability of scores according to the clas-
sical true score model involves certain assumptions about a person’s observed, true, and
error scores. Reliability studies are conducted to evaluate the degree of error exhibited
in the scores on a test (or other instrument). Reliability studies involving two separate
test administrations include the alternate form and test–retest methods or techniques.
Reliability  253

The internal consistency approaches are based on covariation among or between test
item responses and involve a single test administration using a single form. The inter-
nal consistency approaches include (1) split-half techniques with the Spearman–Brown
correction formula, (2) coefficient alpha, (3) the Küder–Richardson 20 formula, (4) the
Küder–Richardson 21 formula, and (5) the analysis of variance approach. The reliability
of scores used in the study of change is an issue important to the integrity of longitudinal
research designs. Accordingly, a formula was presented that provides a way to estimate
the reliability of change scores.
It is also useful to view how “unreliable” test scores are. The unreliability of scores is
viewed as a discrepancy between observed scores and true scores and is expressed as the
error of measurement. Three different approaches to deriving estimates of errors of mea-
surement and associated confidence intervals were presented, along with the interpretation
of each using example data. The three approaches commonly used are (1) the standard
error of measurement, (2) the standard error of estimation, and (3) the standard error of
prediction.

Key Terms and Definitions

Attributes. Identifiable qualities or characteristics represented by either numerical ele-


ments or categorical classifications of objects that can be measured.
Classical test theory. Based on the true score model, a theory concerned with observed,
true, and error score components.
Classical true score model. A model-based theory of properties of test scores relative to
populations of persons based on true, observed, and error components. Classical test
theory is based on this model.
Coefficient alpha. An estimate of internal consistency reliability that is based on item
variances and covariances and that does not require strictly parallel or true score
equivalence between its internal components or half tests. The alpha coefficient is the
mean of all possible randomly split-half tests using Rulon’s formula. In relation to theo-
retical or true score estimates of reliability, alpha produces a lower-bound estimate
of score reliability.
Coefficient of equivalence. Calculated as the correlation between scores on two admin-
istrations of the same test.
Coefficient of reliability. The ratio of true score variance to observed score variance.

Coefficient of stability. Correlation coefficient between scores on two administrations of


the same test on different days; calculated using the test–retest method.
Composite score. The sum of responses to individual items where a response to an item
is a discrete number.
Confidence interval. A statistical range with a specified probability that a given param-
eter lies within the range.
254  PSYCHOMETRIC METHODS

Confidence limits. Either of two values that provide the endpoints of a confidence interval.

Congeneric tests. Axiom specifying that a person’s observed, true, and error scores on
two tests are allowed to differ.
Constant error. Error of measurement that occurs systematically and constantly due to char-
acteristics of the person, the test, or both. In the physical or natural sciences, this type of
error occurs by an improperly calibrated instrument being used to measure something
such as temperature. This results in a systematic shift based on a calibration error.
Deviation score. A raw score subtracted from the mean of a set of scores.

Essential tau-equivalence. Axiom specifying that a person’s observed score random vari-
ables on two tests are allowed to differ but only by the value of the linking constant.
Generalizability theory. A highly flexible technique for studying error that allows for the
degree to which a particular set of measurements on an examinee are generalizable
to a more extensive set of measurements.
Guttman’s equation. An equation that provides a derivation of reliability estimation
equivalent to Rulon’s method that does not necessarily assume equal variances on the
half-test components. This method does not require the use of the Spearman–Brown
correction formula.
Heteroscedastic error. A condition in which nonuniform or nonconstant error is exhibited
in a range of scores.
Internal consistency. Determines whether several items that propose to measure the
same general construct produce similar scores.
Item homogeneity. Test items composed of similar content as defined by the underlying
construct.
Küder–Richardson Formula 20 (KR-20). A special case of coefficient alpha that is
derived when items are measured exclusively on a dichotomous level.
Küder–Richardson Formula 21 (KR-21). A special case of coefficient alpha that is
derived when items are of equal difficulty.
Measurement precision. How close scores are to one another and the degree of mea-
sure of error on parallel tests.
Parallel tests. The assumption that when two tests are strictly equal, true score, observed,
and error scores are the same for every individual.
Random error. Variability of errors of measurement function in a random or nonsystem-
atic manner.
Reliability. The consistency of measurements based on repeated sampling of a sample
or population.
Reliability coefficient. The squared correlation between observed scores and true scores.
A numerical statistic or index that summarizes the properties of scores on a test or
instrument.
Reliability index. The correlation between observed scores and true scores.
Reliability  255

Rulon’s formula. A split-half approach to reliability estimation that uses difference scores
between half tests and that does not require equal error variances on the half tests.
This method does not require the use of the Spearman–Brown correction formula.
Spearman–Brown formula. A method in which tests are correlated and corrected back
to the total length of a single test to assess the reliability of the overall test.
Split-half reliability. A method of estimation in which two parallel half tests are created,
and then the Spearman–Brown correction is applied to yield total test reliability.
Standard error of estimation. Used to predict a person’s score on one test (Y) based
on his or her score on another parallel test (X). Useful for establishing confidence
intervals for predicted scores.
Standard error of measurement. The accuracy with which a single score for a person
approximates the expected value of possible scores for the same person. It is the
weighted average of the errors of measurement for a group of examinees.
Standard error of prediction. Used to predict a person’s true score from his or her
observed score. Useful for establishing confidence intervals for true scores.
Tau-equivalence. Axiom specifying that a person has equal true scores on parallel forms
of a test.
True score. Hypothetical entity expressed as the expectation of a person’s observed score
over repeated independent testing occasions.
True score model. A score expressed as the expectation of a person’s observed score
over infinitely repeated independent testing occasions. True score is only a hypo-
thetical entity due to the implausibility of actually conducting an infinite number of
independent testing occasions.
Validity. The degree to which evidence and theory support the interpretations of test
scores entailed by proposed use of a test or instrument. Evidence of test validity is
related to reliability, such that reliability is a necessary but not sufficient condition to
establish the validity of scores on a test.
8

Generalizability Theory

This chapter introduces generalizability theory—a statistical theory about the dependabil-
ity of measurements. In this chapter, the logic underlying generalizability is introduced
followed by practical application of the technique. Emphasis is placed on the advan-
tages generalizability theory provides for examining single and multifaceted measurement
problems.

8.1 Introduction

In Chapter 7, reliability was introduced within the classical test theory (CTT) frame-
work. In CTT, a person’s true score is represented by his or her observed score that is a
single measurement representative of many possible scores based on a theoretically infi-
nite number of repeated measurements. The CTT approach to reliability estimation is
based on the variation in persons’ (or examinees) observed scores (Xi) being partitioned
into true (Ti) and error (Ei) components. The true component is due to true differences
among persons, and the error part is an aggregate of variation due to systemic and ran-
dom sources of error. In generalizability theory, a person’s observed score, true score,
and error score are expressed as Xpi, Ipi, and Epi, respectively, where p represents persons
(examinees) and i represents items. For any person (p) and item (i), Xpi is a random vari-
able expressed as the expectation over replications (i.e., the long-run average over many
repeated measurements). Based on the long-run average, a person’s true score is repre-
sented by the random variable.
Aggregating systemic and random sources of error in CTT is less than ideal because
we lose important information about the source of systematic and/or random error and
the impact each has on measurement precision. For example, variation (differences)
in item responses arise from (1) item difficulty, (2) person performance, and (3) the

257
258  PSYCHOMETRIC METHODS

interaction between persons and items confounded by other sources of systematic and
random error. Classical test theory provides no systematic way to handle these complexi-
ties. Another example where CTT is inadequate is when observers rate examinees on
their performance on a task. Typically, this type of measurement involves multiple rat-
ers on a single task or multiple tasks. As an example, consider the situation where test
items are used to assess level of performance on a written or constructed response using
a quality-based rating scale. In this case, it is the quality of the written response that is
being assessed. CTT does not provide a framework for teasing apart multiple sources
of error captured in (Ei). Generalizability theory extends CTT (Cronbach et al., 1972;
Brennan, 2010) by providing a framework for increasing measurement precision by esti-
mating different sources of error unique to particular testing or measurement conditions.
Generalizability theory is easily extended to complex measurement scenarios where CTT
is inadequate. Throughout this chapter the usefulness of generalizability theory is illus-
trated through examples.

8.2 Purpose of Generalizability Theory

The primary goal of generalizability theory is to provide a framework for increasing


the dependability of measurements. Dependability of measurement is increased by using
information acquired from a generalizability study (G-study) to reduce or eliminate
unwanted sources of error in future measurement procedures. Information obtained from
a G-study is used to guide a decision or D-study. In fact, the purpose of a D-study is to
make sample-based decisions based on improved dependability of measurement rather
than to generalize to populations. For instance, consider the following two examples
where different conditions (a.k.a. facets in the language of generalizability theory) of the
measurement process are of interest. First, we want to ensure that the level of difficulty
of the test items falls within a certain range (i.e., items are not excessively difficult and
not excessively easy). Second, we may want to ensure that ratings of writing quality meet
a desired level of reliability (a.k.a. dependability in generalizability theory) when using
multiple raters. For example, we may want to know how many raters are necessary to
obtain an acceptable level of dependability in the ratings.
In a decision or D-study, the measurement conditions are considered a random sam-
ple from the universe of conditions that are employed in the generalizability study that
preceded it. Dependability in generalizability theory is analogous to reliability in CTT. In
generalizability studies, dependability of measurement is expressed as a generalizability
coefficient (G coefficient) and is synonymous with the estimate of reliability coefficient
alpha (a) in CTT—under certain measurement circumstances. For example, in the situa-
tion where a sample of persons responds to a set of test items on a single occasion, apply-
ing generalizability theory analysis can yield the same results as those of coefficient alpha
(a). This type of generalizability theory analysis and others are described in the following
sections.
Generalizability Theory  259

8.3 Facets of Measurement and Universe Scores

Generalizability theory provides a flexible framework for a variety of measurement and


D-study design conditions. For example, the measurement goals within a particular
study may be simple, moderate, or complex. In a generalizability study (i.e., G-study),
the conditions being studied are called facets. As an example, the items on a test constitute
the item facet; persons or examinees represent the person facet. Another facet commonly
used in a generalizability study is an observer or rater facet, where observers rate persons
on a task (e.g., the quality of a written response to an essay question). A simple generaliz-
ability study (i.e., a one-facet design) might include only items and persons as the focus
of measurement (e.g., an item x person design). In this design, the single facet is items;
all persons respond to all items (the symbol x represents “crossed with”).
A more complex scenario may include items, persons, and observers/raters (i.e., a
two-facet design), with observers rating some aspect of examinee performance during
testing such as (1) the quality of a written response to an essay question or (2) their level
of performance based on application of a cognitive strategy. This study yields a person
x item x rater design (the symbol x represents “crossed with,” meaning that all persons
respond to all items and are rated by all raters). In this example, the two facets are repre-
sented by items and raters.
An even more complex two-facet design may include items, persons, raters, and
occasions as facets of measurement (i.e., creating a person x item x rater: occasion design).
In this more complex design, the occasion facet is nested within the rater facet (i.e., each
observer rates performance on more than one occasion or time point; the symbol “:” rep-
resents the fact that occasions are nested within raters).
These examples do not exhaust the possible designs available in generalizability the-
ory; rather, they only provide examples of commonly used designs.
In generalizability theory, a person’s test score or performance rating is a sample from
an infinitely large universe of scores that represents or indexes a person’s true ability, state
of knowledge, or performance. In generalizability theory a person’s average score over
a theoretically infinite number of measurement occasions is his or her universe score
(analogous to true score in CTT). As you may imagine, a critical issue in generalizability
studies is the accuracy of the generalization from sample to universe. A universe in general-
izability theory may be multifaceted, consisting of more than one facet of measurement,
testing occasion, test form, and observer/rater. The flexibility of generalizability theory
lies in its ability to provide a framework for capturing and isolating a variety of different
sources of variation attributable to the measurement procedure. The steps of anticipat-
ing the relevant conditions of measurement and sources of variance are the focus of a
G-study. Armed with the results of the G-study, a D-study can be planned in a way that
provides a highly informative set of results for a particular sample. The magnitude of the
variation within each facet (i.e., known as a variance component) is estimated using
analysis of variance (ANOVA) procedures. Analysis of variance is presented in more
detail later in the chapter.
260  PSYCHOMETRIC METHODS

The next section presents the ways that generalizability theory extends CTT and
introduces types of score-based decisions that are available when using generalizability
theory and two types of G-studies: generalizability (G) and decision (D) studies.

8.4 How Generalizability Theory Extends


Classical Test Theory

Generalizability theory extends CTT in four ways. First, the procedure estimates the size
of each source of error attributable to a specific measurement facet in a single analysis.
By identifying specific error sources, the reliability or dependability of measurement can
be optimized using this information (e.g., score reliability in generalizability theory
is labeled a G coefficient). Second, generalizability theory estimates the variance com-
ponents that quantify the magnitude of error from each source. Third, generalizability
theory provides a framework for deriving relative and absolute decisions. Relative deci-
sions include comparing one person’s score or performance with others (e.g., as in ability
and achievement testing). Absolute decisions focus on an individual’s level of perfor-
mance regardless of the performance of his or her peers. For example, absolute decisions
implement a standard (i.e., a cutoff score) for classifying mastery and nonmastery, as in
certification and licensure examinations or achievement testing where a particular level
of mastery is required prior to progressing to a more challenging level. Fourth, generaliz-
ability theory includes a two-part analytic strategy; G-studies and D-studies. The purpose
of conducting a G-study is to plan a D-study that will have adequate generalizability to the
universe of interest. To this end, all of the relevant sources of measurement error are identi-
fied in a G-study. Using this information, a D-study is designed in a way that maximizes
the quality and efficiency of measurement and will accurately generalize to the target
universe. Finally, G-studies and D-studies feature either (1) nested or crossed designs
and (2) may include random or fixed facets of measurement, or both, within a single
analysis. This chapter focuses primarily on crossed designs illustrated with examples.
Additionally, descriptions of random and fixed facets are provided with examples of when
each is appropriate.

8.5 Generalizability Theory and Analysis of Variance

At the heart of generalizability theory is the variance component. A variance component


captures the source of variation in observed scores of persons and is the fundamental unit
of analysis within a G-study. For example, we want to accurately quantify the amount
of variability in a set of scores (or performance ratings) if our measurement is to be use-
ful for describing differences between a person’s psychological attributes. The analysis
of variance (ANOVA) is a statistical model based on a special case of the general linear
model most often used to analyze data in experimental studies where researchers are
interested in determining the influence by a factor or treatment (e.g., the effect of an
Generalizability Theory  261

intervention) on an outcome (dependent) variable (e.g., reading achievement or success


in treating a medical disease).
In the previous example on reading achievement, each subject has a reading achieve-
ment score (the dependent or outcome variable) and the independent variable is the
treatment (one group receives the treatment and one group does not). For example, in
ANOVA the variation in reading scores may be partitioned by factors (i.e., independent
variables) such as study group and sex. Additionally, there may be another independent
variable such as socioeconomic status, with classification levels such as low, medium,
and high. ANOVA can be used to partition subjects’ scores into effects for the indepen-
dent (factor) variables, interactions, and error. Also, ANOVA may include single-factor,
two-factor, and higher study designs. In G- and D-studies, the conditions of measurement or
facets are the factors in ANOVA terminology.
Generalizability theory consists of a general analytic framework that encompasses
elements of CTT and the statistical mechanics of ANOVA. Figure 8.1 illustrates the con-
ceptual connections between CTT and generalizability theory.
Variance in observed scores of persons may be due to (1) item difficulty (i), (2) per-
son (p) performance or behavior factors, and (3) the effect of raters’ (r) on person’s
scores. Sources of variation in generalizability theory are classified into facets (i.e., fac-
tors in ANOVA). In the simplest case, a one-facet design within generalizability the-
ory includes one source of measurement error and consists of measurements acquired
from a sample of admissible observations from a universe of all possible observations. In

G-study

D-study

Figure 8.1. Precursors and conceptual framework of generalizability theory. From Brennan


(2010, p. 5). Copyright 2010 by Springer. Reprinted by permission.
262  PSYCHOMETRIC METHODS

G-studies, a universe of admissible observations refers to measurements (and their vari-


ance) acquired specific to item, rater, and person facets. Recall that in a G-study we want
to anticipate all of the measurement conditions specific to the universe of admissible
observations so that we can use this information in planning and conducting a D-study.
For example, in a one-facet design a variance component for persons is expressed with
the symbol s P2, for test items s I2 and for the residual or error as s PI,
2
E. The residual vari-
ance component accounts for the interaction of persons and items plus random error of
measurement. Using the variance components just described, we can identify sources
and size of error, and total error can be estimated leading to the estimation of a G (gen-
eralizability) coefficient.
A facet is defined as a set of similar conditions of measurement (Brennan, 2010). In
G-studies item, person, and rater facets are commonly used (although others are pos-
sible, such as test form or occasion of measurement facets). Generalization from results
of a G-study proceeds under scenarios such as (1) an item facet where there is a gen-
eralization from a set of items to a set of items from a universe of items; (2) a test form
facet where there is a generalization from one test form to a set of forms from a universe
of forms, and (3) a measurement occasion facet where there is a generalization from one
occasion to another from a universe of occasions (e.g., days, weeks, or months).
Generalizability theory is also flexible in that many study designs are possible,
depending on the goal of the G-study. For example, measurement designs may be
(1) crossed (i.e., all persons respond to all test questions), (2) nested (e.g., each per-
son is rated by three raters and raters rate each person on two separate occasions),
or (3) partially nested (e.g., different raters rate different persons on two separate
occasions). Additionally, facets may be of the random or fixed variety. If a facet is con-
sidered random, the conditions comprising the facet are representative of the universe of
all possible facet conditions. Specifically, a facet is considered random when (1) the size
of the sample is substantially smaller than the size of the universe and (2) the sample
either is drawn randomly or is considered to be exchangeable with any other sample
of the same size drawn from the same universe (Brennan, 2010; Shavelson & Webb,
1991, p. 11).
The implications of conducting a G-study related to the universe of generalization
are that the facet conditions used to estimate the generalizability coefficient should be
representative of the universe of conditions so that when planning a D-study we will
have confidence that the fixed-facet conditions are indeed one subset of possible condi-
tions. In the case of fixed facets, the term fixed means that we are only interested in
the variance components of specific characteristics of a particular facet (i.e., we will
not generalize beyond the characteristics of the facet). A mixed-facet generalizability
theory study includes random and fixed facets within a single study. In generalizability
theory, ANOVA is used to partition a subject’s score into (1) a universe score effect
(for the object of measurement—usually the person), (2) an effect for each facet
(e.g., items), (3) the interaction among the facets (e.g., person x items interaction),
and (4) a residual or error component reflecting unsystematic or random error left
unexplained.
Generalizability Theory  263

8.6 G
 eneral Steps in Conducting a Generalizability
Theory Analysis

The following general steps can be used to plan and conduct generalizability (G) and
decision (D) studies.

1. Decide on the goals of the analysis, including score-based decisions that are to be
made (e.g., relative or absolute) if applicable.
2. Determine the universe of admissible observations.
3. Select the G-study design that will provide the observed score variance compo-
nent estimates to generalize to a D-study.
4. Decide on random and fixed facets or conditions of measurement relative to the
goal(s) of the D-study.
5. Collect the data and conduct the G-study analysis using ANOVA.
6. Calculate the variance components and the generalizability (G) coefficient for
the G-study.
7. Calculate the proportion of variance for each facet (measurement condition) to
provide a measure of effect.
8. If applicable (e.g., for relative or absolute decisions) calculate the standard error
of measurement (SEM) for the G-study that can be used to derive confidence
intervals for scores in a D-study.

8.7 Statistical Model for Generalizability Theory

Recall that the fundamental unit of analysis in generalizability theory is the variance com-
ponent. The general linear equation (Equation 8.1; Brennan, 2010; Crocker & Algina, 1986,
p. 162) can be used to estimate the variance components for a generalizability theory analy-
sis. Notice that Equations 8.1 and 8.2 constitute a linear, additive model. This is convenient
because using the linear, additive model the individual parts of variation from person, items,
and raters can be summed to create a measure of total variation. To understand the compo-
nents that the symbols in Equations 8.1 and 8.2 represent, we turn to Tables 8.1 and 8.2. Table
8.1 illustrates how deviation scores and the variance are derived for a single variable. Table 8.2
provides the item responses and selected summary statistics for our example for 20 persons
responding to the 10-item short-term memory test 2 of auditory memory in the GfGc data.
Next, the variance components must be estimated, and therefore we need the devia-
tion scores for persons from the grand mean (Equation 8.2; Brennan, 2010; Crocker &
Algina, 1986, p. 162).
Using Equation 8.2, we can obtain an effect for persons and items and a residual (error
component) that captures the error of measurement (random and systematic combined).
Next, we review how the variance is derived to aid an understanding of variance components.
264  PSYCHOMETRIC METHODS

Equation 8.1. General linear equation for generalizability theory


analysis

Xpi = m + (mp – m) + (mi – m) + epi

• Xpi = score for a person on an item (or rating).


• ­m = mean for persons over the universe scores.
• ­mp = mean for persons over items.
• ­mi = mean for items (or ratings) over persons; also known
as the grand mean.
• epi = residual or error of measurement based on persons
and items.

Equation 8.2. Deviation score for a person

Xpi – m = (mp – m) + (mi – m) + epi

• Xpi = score for a person on an item.


• ­m = mean for persons over the universe scores.
• ­mp – m = person effect.
• ­mi – m = item (or rater) effect.
• epi = residual or error of measurement based on persons
and items or persons and raters; includes the
correlation among raters plus random error.

The variance of a set of scores (see Chapter 2 for a review) is obtained by (1) deriving
the mean for a set or distribution of scores, (2) calculating the deviation of each person’s
score from the mean (i.e., deriving deviation scores), (3) squaring the deviation scores,
and (4) computing the mean of the squared deviations. Table 8.1 illustrates the sequen-
tial parts for estimating the variance using the total (sum) score on short-term memory
test 2 for 20 randomly selected persons representing the universe of persons on short-term
memory test 2. The sample of 20 persons is considered exchangeable with any other ran-
domly drawn sample of size 20 from this universe of scores. Therefore, the person facet
is considered random. The item facet in design 1 (illustrated in the next section) is fixed
(i.e., we are only interested in how this particular set of items functions with our random
sample of persons). An important point to note here is that both persons and items could be
random if we were also interested in generalizing to a larger set of items from a possible uni-
verse of items measuring short-term memory.
Generalizability Theory  265

Table 8.1.  Calculation of the Variance for Sample Data in Table 8.2
Score(X) Mean(μ) Squared deviation
3 13.35 –10.35 107.1225
5 13.35 –8.35 69.7225
5 13.35 –8.35 69.7225
9 13.35 –4.35 18.9225
9 13.35 –4.35 18.9225
11 13.35 –2.35 5.5225
11 13.35 –2.35 5.5225
12 13.35 –1.35 1.8225
12 13.35 –1.35 1.8225
13 13.35 –0.35 0.1225
13 13.35 –0.35 0.1225
14 13.35 0.65 0.4225
14 13.35 0.65 0.4225
16 13.35 2.65 7.0225
16 13.35 2.65 7.0225
17 13.35 3.65 13.3225
20 13.35 6.65 44.2225
22 13.35 8.65 74.8225
22 13.35 8.65 74.8225
23 13.35 9.65 93.1225
Sum (SX) = 267 SS = S (X – X)2 = 614.55
GRAND MEAN (X; m) = 13.35 Variance = s2 = 30.72
Standard deviation = s = 5.54
Note. The denominator for the variance of this random sample is based on n = 20, not n – 1 = 19.
The symbol s2 is for a sample. The symbol s2 is for the population and is used throughout this
chapter to represent the variance. The symbol s is the standard deviation for a sample, and the
symbol s is the standard deviation for the population.

With an understanding of how the variance is derived using deviation scores, we are
in a position to estimate the variance components necessary for use in our first example
of generalizability theory analysis. Specifically, we need estimates of the following vari-
ance components based on the data in Table 8.2.
The next section illustrates our first generalizability theory analysis.

• σ̂2P = variance of persons’ universe scores.


• σ̂2I = variance of item means, mi.
• σ̂2E| I = variance of epi for item i.
• σ̂2E = average over all items, σ2E| I.
• σ̂2X |I = variance of Xpi for item i.
Note. The symbol “^” is included on top of the variance to indicate that it
is an estimate based on a sample rather than a population value.
266  PSYCHOMETRIC METHODS

Table 8.2.  Item Scores and Statistics on Short-Term Memory Test 2

Items
Person Person
Person 1 2 3 4 5 6 7 8 9 10 mean variance
1 3 0 0 0 0 0 0 0 0 0 0.3 0.90
2 3 1 1 0 0 0 0 0 0 0 0.5 0.94
3 3 2 0 0 0 0 0 0 0 0 0.5 1.17
4 3 3 3 0 0 0 0 0 0 0 0.9 2.10
5 3 3 3 0 0 0 0 0 0 0 0.9 2.10
6 3 3 2 1 1 0 1 0 0 0 1.1 1.43
7 3 3 2 1 1 0 1 0 0 0 1.1 1.43
8 3 3 3 1 1 0 1 0 0 0 1.2 1.73
9 3 3 3 1 1 0 1 0 0 0 1.2 1.73
10 3 3 3 1 1 1 1 0 0 0 1.3 1.57
11 3 3 3 1 1 1 1 0 0 0 1.3 1.57
12 3 3 2 2 2 0 2 0 0 0 1.4 1.60
13 3 3 2 2 2 0 2 0 0 0 1.4 1.60
14 3 3 3 2 2 1 2 0 0 0 1.6 1.60
15 3 3 3 2 2 1 2 0 0 0 1.6 1.60
16 3 3 3 2 2 2 2 0 0 0 1.7 1.57
17 3 3 3 3 3 2 3 0 0 0 2 2.00
18 3 3 3 3 3 1 3 1 1 1 2.2 1.07
19 3 3 3 3 3 1 3 3 0 0 2.2 1.73
20 3 3 3 3 3 3 3 2 0 0 2.3 1.57
Item mean 3 2.7 2.4 1.4 1.4 0.65 1.4 0.3 0.05 0.05 1.34
Item variance 0 0.64 0.99 1.19 1.19 0.77 1.19 0.64 0.05 0.05  

8.8 Design 1: Single-Facet Person-by-Item Analysis

In Design 1, we use short-term memory test 2 from our GfGc data measuring an audi-
tory component of memory. The range of possible raw scores is 0 to 3 points possible
for each item. Table 8.2 provides a random sample of 20 persons from the target uni-
verse of persons; these data will be used to illustrate Design 1, and the person facet is
random. In Design 1 (and in most G-studies), persons’ scores are the object of mea-
surement. In this example and for other designs throughout this chapter, we use the
mean score across the 10 items for the 20 persons as opposed to the sum or total score
mainly for convenience in explaining how a generalizability theory analysis works.
Additionally, using mean scores and the variance is consistent with ANOVA. Design 1
is known as a crossed design because all persons respond to all items. In Design 1, we
assume that the 10 items on the short-term memory test have been developed as one
representative set from a universe of possible items that measures this aspect of memory,
as posited by the general theory of intelligence. The item facet in this example is con-
sidered fixed (i.e., we are only interested in how the 10 items function for our random
sample of persons).
Generalizability Theory  267

Returning to the person’s facet, if we are willing to assume that scores in Table 8.2
reflect universe scores accurately, we have a universe score for each person. Since the goal
in generalizability theory is to estimate the universe score for persons, we use persons’
observed score as representative of their universe score (i.e., the expectation of observed
score equals true score). Based on this assumption, our sample of 20 persons is consid-
ered exchangeable with any other random sample from the universe. Ultimately, we want
to know how accurate our score estimates are of the target universe.
To calculate the variance components using the data in Table 8.2, we can use the
mean square estimates from an ANOVA. Before proceeding to the ANOVA, Table 8.3
illustrates how to structure the data for the ANOVA analysis in this example. Using this
information, you should duplicate the results presented here to understand the process
from start to finish. The layout in Table 8.3 is for the first two items only from the data
in Table 8.2. The data layout in Table 8.3 is for a one-facet (p × i) analysis. Note that the
complete dataset for the example analysis will include 200 rows (20 persons × 10 items),
with the appropriate score assigned to each person and item row.
Next, we can conduct ANOVA in SPSS to estimate the variance components using
the SPSS program below.

SPSS syntax for estimating variance components used in a G-study

UNIANOVA score BY persons items


/METHOD=SSTYPE(1)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(OVERALL)
/EMMEANS=TABLES(persons)
/EMMEANS=TABLES(items)
/EMMEANS=TABLES(persons*items)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/DESIGN=persons items persons*items.

The resulting SPSS output is provided in Table 8.4.


Another way of understanding how the variance components work together and
separately is by way of a Venn diagram (e.g., the variance components in Table 8.5 from
the ANOVA results can be visualized using a Venn diagram as in Figure 8.2).
In Figure 8.2 notice how the unique parts of the total variation in person and items
is partitioned as (1) the variance estimate σ̂2P = .285 attributable only to persons, (2) the
variance estimate attributable to persons and items, σ̂2P I,E = .389, and (3) the variance esti-
mate attributable only to items σ̂2I = 1.16 (note that the sizes of the ellipses in Figure 8.2
are not to scale). Finally, in ANOVA these variance “parts” can be summed to account for
the total variance.
The results in Tables 8.4 and 8.5 reveal the degree to which each facet affects short-
term memory by the size of the variance component. However, interpreting the vari-
ance component is difficult because it depends on the size of the effect and on the scale
268  PSYCHOMETRIC METHODS

Table 8.3.  Data Layout for Single-Facet


(p × i) Generalizability Theory Analysis
Person Score Item
1 3 1
2 3 1
3 3 1
4 3 1
5 3 1
6 3 1
7 3 1
8 3 1
9 3 1
10 3 1
11 3 1
12 3 1
13 3 1
14 3 1
15 3 1
16 3 1
17 3 1
18 3 1
19 3 1
20 3 1
1 0 2
2 1 2
3 2 2
4 3 2
5 3 2
6 3 2
7 3 2
8 3 2
9 3 2
10 3 2
11 3 2
12 3 2
13 3 2
14 3 2
15 3 2
16 3 2
17 3 2
18 3 2
19 3 2
20 3 2
Generalizability Theory  269

Table 8.4.  Univariate ANOVA Output


ANOVA
Source Type I Sum df Mean Square
of Squares
Corrected Model 340.555 199 1.711
Intercept 356.445 1 356.445
Persons (p) 61.455 19 3.234
Items (i) 212.505 9 23.612
persons * items (p x i; Res) 66.595 171 .389
Total 697.000 200
Corrected Total 340.555 199
Note. Mean Squares are derived by dividing the Sum of Squares by the degrees of freedom.

Table 8.5.  ANOVA Formulas and Notation for G-Study p × i Design


Effect df SS MS ŝ2

MS( P) − MS( PI)


p np – 1 SS(p) MS(p) sˆ 2 ( P) = = .285
NI
MS(I) − MS( PI)
i ni – 1 SS(i) MS(i) sˆ 2 (I) = = 1.16
NP
pi, e (np – 1)(ni – 1) SS(pi, e) MS(pi, e) sˆ 2 (PI, E) = MS(PI, E) = .389
SS( P) = N I ΣP X P2 − NP NI X 2

SS(I) = NP ΣI X I2 − NP N I X 2

SS( PI, E) = Σ P Σ I X PI2 − N IΣP X 2P − N PΣI X I2 + N PN I X 2

Note. Adapted from Brennan (2010, p. 26). Copyright 2010 by Springer. Adapted by permission. p, persons; i, items;
pi, persons by items interaction; df, degrees of freedom; SS, sum of squared deviations from mean; MS, mean squared
deviation derived as SS divided by degrees of freedom; ŝ2, variance component estimate for a particular effect; e,
residual for persons and items.

2 2 2
sˆ p ŝ pi, e ŝ i

Figure 8.2.  Variance components in a one-facet design. Figure segments are not to scale.
270  PSYCHOMETRIC METHODS

of measurement. To facilitate interpretation, each variance component is (1) compared to


the variance component of other variance components in the analysis and (2) the ratios
of variance components are interpreted as the proportion of total variance in the analysis.
For example, using the results in Table 8.4, we can derive the relative contribution of each
variance component to the total variation. In many, if not most, measurement or testing
situations, the person effect is of primary interest (i.e., the object of measurement) because
we want to know if (a) the test captures variability among examinees (i.e., individual dif-
ferences) in terms of their score performance and (b) the size of the variance components.
Such decisions are relative because our interest is in knowing how persons or examinees
are ranked relative to one another (e.g., studying individual differences among persons
such as in intelligence testing). In Table 8.6, the variance component for persons is .285,
the smallest of the variance components. The variance component for persons is derived
using Equation 8.3 (Brennan, 2010, p. 27) and the mean square estimates provided in
Table 8.4.
We calculate the variance component for items using Equation 8.4 (Brennan, 2010,
p. 27).
Another way within SPSS to obtain the variance component estimates in Equations
8.3 and 8.4 is by using the following syntax.

SPSS syntax for estimating variance components using variance components procedure

VARCOMP score BY persons items


/RANDOM=persons items
/METHOD=MINQUE(1)
/DESIGN
/INTERCEPT=INCLUDE.

Note. In the METHOD command, the ANOVA option with the desired sum of squares
(in parentheses) can also be used.

Table 8.6.  Variance Components


for the Person × Items Random Design
Variance Estimates
Component Estimate
Var(persons) .285
Var(items) 1.161
Var(persons * items) .389
Var(Error) .000a
Dependent Variable: score
Method: Minimum Norm Quadratic Unbiased Estimation (Weight =
1 for Random Effects and Residual)
a. This estimate is set to zero because it is redundant.
Generalizability Theory  271

Equation 8.3. Variance component for persons using mean


squares

P RES 3.23 .389


σˆ 2P = = = .285
NI 10

• σ̂2P = variance component for persons.


• MSp = mean square for persons.
• MSRes = mean square residual.
• ni = number of items.

Equation 8.4. Variance component for items

MSI − MS 23.61 − .389


σˆ 2I = = = 1.16
NP 20

• σ̂2I = variance component for items.


• MSi = mean square for items.
• MSRes = mean square residual.
• np = number of persons.

8.9 Proportion of Variance for the p × i Design

Using Equation 8.5, we can derive the proportion of variance for the person effect. The
proportion of variance provides information about how much each facet explains in the
analysis. Using the proportion of variance is advantageous because it is a measure of
effect size expressed in a unit that is comparable across studies (or different designs).
Using the estimates from Table 8.4 or 8.6, we can derive the proportion of variance values
as follows.
We see from Equation 8.5 that the person effect accounts for approximately 16%
of the variability in memory scores. In our example the sample size is only 20 persons
(very small); the person variability may be much larger with an increased, more realistic
sample size.
272  PSYCHOMETRIC METHODS

Equation 8.5. Proportion of variance for persons

sˆ 2P .285 .285
= = = .16
ˆs2P + sˆ 2I + sˆ 2RES .285 + 1.16 + .389 1.83

Next, in Equation 8.6 we calculate the proportion of variance for the item effect.
We see from Equation 8.6 that the item effect accounts for approximately 63% of the
variability in memory scores (i.e., differences between items is large). So, from this infor-
mation we conclude that the item effect is relatively large (i.e., the items vary substantially
in terms of their level of difficulty). Next, we derive the residual variance in Equation 8.7.
The residual variance component is about one-third the size (21%) relative to the
item variance component (63%). Also, the variance component for persons (16%) is
small relative to the item variance component. The large variance component for items
indicates that the items do not discriminate equally and are therefore of unequal difficulty
across persons. In Table 8.2 (p. 266), we see that the range of item means for persons is
.05 to 2.7 (range = 2.65). Also, we see that the range of person means is .3 to 2.3 (range =
2.0), smaller than the range for items. This partially explains why the item variance
component is larger than the person variance component.
The final statistic that is calculated in a generalizability theory analysis is the
coefficient of generalizability (i.e., G coefficient). Under certain conditions, the G

Equation 8.6. Proportion of variance for items

sˆ 2I 1.16 1.16
= = = .63
ˆs2P + sˆ 2I + sˆ 2RES .285 + 1.16 + .389 1.83

Equation 8.7. Proportion of variance for residual

sˆ 2RES .389 .389


= = = .21
ˆs2P + sˆ 2I + sˆ 2RES .285 + 1.16 + .389 1.83
Generalizability Theory  273

coefficient is synonymous with the reliability coefficient in CTT. For example, in a


single-facet crossed design when the measurement facet (e.g., items in our example)
is fixed and each person has one item response to each item, the G coefficient is
analogous to the reliability coefficient derived in CTT. In this case, the G coefficient
represents how dependable or reliable a person’s observed score is relative to his or
her universe score and also relative to other persons (i.e., the focus is on individual
differences among persons).

8.10 Generalizability Coefficient and CTT Reliability

In CTT, under the assumption of strictly parallel tests, recall that item means and vari-
ances are equal for two parallel tests. In the language of generalizability theory, the result
of this assumption means that the item effect or variance component is zero. Because the
item effect is zero under the strictly parallel assumptions of CTT, the analysis resolves to the
individual differences among persons. Finally, since items are considered to be of equal
difficulty, in the right-hand side of the denominator in Equation 8.8 the error (.389) is
divided or averaged by the number of items.

Equation 8.8. Generalizability coefficient for the one-facet p × i


design

σˆ 2P .285 .285 .285


ρX′ X = = = = = .88
σˆ 2
RES .285 +
.389 .285 + .0389 .323
σˆ P +
2
N′I 10

• r′xx = generalizability coefficient for relative decisions


among persons.
• σ̂ P = variance component for persons.
2

• σ̂2RES = variance component of residual.


• N′I = number of items (N′I = the residual is divided by the
number of items).

Note. The crossed person by item design yields an equivalent coeffi­


cient alpha (or KR20 for dichotomous items) reliability estimate as
those based on CTT introduced in Chapter 7. The focus of this
design is on individual differences (relative decisions).
274  PSYCHOMETRIC METHODS

In Equation 8.8 we see that using the variance components estimated from the vari-
ance components procedure but dividing the error by the number of items, we arrive at
the same result that you would obtain calculating the coefficient alpha (a) of reliability
for this 20-item dataset (i.e., a = .88; you can verify this result for yourself in SPSS).
Next, we turn to a different design where the condition of measurement is ratings of
performance. For example, in Design 2 observers rate the performance of persons on test
items where performance can be rated on a scale based on gradations of quality.

8.11 Design 2: Single-Facet Crossed Design


with Multiple Raters

In Design 2, the example research design again involves conducting a G-study and using the
results to plan a D-study. Design 2 is highly versatile because we can use the variance compo-
nents estimated from ANOVA to plan a variety of D-study scenarios where ratings are used.
In a D-study, raters are different from those used in the G-study. So, the question is: “How
generalizable are the results from our G-study with respect to planning our D-study?” Our
example for design 2 is based on ratings of person performance on an item from subtest 2 on
the auditory component of short-term memory. For clarity and ease of explanation, we use
a single item to illustrate the analysis. In Table 8.7 there are three different observers (raters)
providing ratings on each of the 20 persons for item number one. Notice that this is a crossed
design because all persons are rated by all raters. The ratings are based on a 10-point scale
with 1 = low to 10 = high. The ANOVA is used to estimate the necessary statistics for estimat-
ing the variance components in the G- and D-studies.
The variance components we need from the data in Table 8.7 for this analysis are:

• σ̂2P = variance of persons’ universe scores.


• σ̂2R = variance of rater means, mr.
• σ̂2E| R = variance of epr for rater r.
• σ̂2E = average over all raters, σ2E| R.
• σ̂2X|R = variance of Xpr for rater r.
Note. The symbol “^” is included on top of the variance to indicate that it is
an estimate rather than a population value.

The following SPSS program provides the mean square statistics we need for calculat-
ing the variance components. The technique employed is a two-factor repeated measures
ANOVA model with a within-rater factor (because the repeated measures are based on the
ratings on 1 item for 20 persons from three different raters) and a between-subjects factor
(persons). For example, each person signifies one level of the person factor and allows us
to estimate the between-persons effect for the ratings. Each rater represents one level of
the rater factor; with each combination of rater and person contained within each cell in
Generalizability Theory  275

Table 8.7.  Design 2 Data: Single Facet with 20 Persons and Three Raters
Person Item Rater 1 Rater 2 Rater 3 XpI

1 1 2 3 2 2.33
2 1 7 5 7 6.33
3 1 3 3 2 2.67
4 1 4 2 6 4.00
5 1 4 3 5 4.00
6 1 5 4 7 5.33
7 1 7 2 6 5.00
8 1 8 2 3 4.33
9 1 8 4 2 4.67

Mean rating for three raters for each person


10 1 5 6 4 5.00
11 1 6 6 7 6.33
12 1 8 5 5 6.00
13 1 7 3 6 5.33
14 1 4 3 4 3.67
15 1 5 4 3 4.00
16 1 6 4 4 4.67
17 1 4 3 6 4.33
18 1 5 3 7 5.00
19 1 5 2 5 4.00
20 1 6 2 4 4.00
Average (XPi)   5.45 3.45 4.75 4.55 (XpI)
Mean ratings for 20 persons for each Grand mean for ratings
rater and persons

the data matrix. Given this ANOVA design, there is only one score for each rater–person
combination. Tables 8.8a–8.8b provide the results of the SPSS analysis.

SPSS program for repeated measures ANOVA for the person x rater design

GLM rater_1 rater_2 rater_3 BY persons


/WSFACTOR=raters 3 Polynomial
/METHOD=SSTYPE(3)
/EMMEANS=TABLES(OVERALL)
/EMMEANS=TABLES(persons)
/EMMEANS=TABLES(raters)
/EMMEANS=TABLES(persons*raters)
/CRITERIA=ALPHA(.05)
/WSDESIGN=raters
/DESIGN=persons.
276  PSYCHOMETRIC METHODS

Table 8.8a.  Repeated Measures ANOVA Output for the Person × Rater Design
Tests of Within-Subjects Effects
Measure: MEASURE_1
Source Type III Sum of df Mean Square F Sig.
Squares

Sphericity Assumed 5.233 2 2.617 . .


raters
. .
. .
. .
Sphericity Assumed 20.100 38 .529 . .
raters * persons . .
(Residual) . .
. .
Note. Parts of the output have been omitted for ease of interpretation.

Table 8.8b.  Repeated Measures ANOVA Output for the Person × Rater Design
Tests of Between-Subjects Effects

Measure: MEASURE_1
Transformed Variable: Average
Source Type III Sum of Squares df Mean Square F Sig.
Intercept 1118.017 1 1118.017 . .
persons 95.650 19 5.034 . .
Error .000 0 .

Next, the variance components are calculated using mean squares from the ANOVA
results. The variance component estimate for persons is provided in Equation 8.9, and
the estimate for raters is provided in Equation 8.10.
The variance component estimate for error or the residual is provided in Equation 8.11.
To illustrate how the generalizability coefficient obtained in our G-study can be used
within a D-study, let’s assume that the raters used in our G-study are representative of the
raters in the universe of generalization. Under this assumption, our best estimate is the

Equation 8.9. Variance component for persons

MS MS
s = = = =

Equation 8.10. Variance component for raters

RATERS - RESIDUAL -
sˆ 2RATERS = = = = 1.04
NP 20 20
Generalizability Theory  277

Equation 8.11. Proportion of variance for residual

ŝ2E = RESIDUAL =

average observed score variance for all the raters in the universe. The average score vari-
ance is captured in the sum of s2P + s2E . Because we are willing to assume that our raters
are representative of the universe of raters we can estimate the coefficient of generaliz-
ability in Equation 8.12 from our sample data. An important point here is that raters are not
usually randomly sampled from all possible raters in the universe of generalization, leading
to one difficulty with this design.
The value of .90 indicates that the raters are highly reliable in their ratings. Using
this information, we can plan a D-study in a way that ensures that rater reliability will
be adequate by changing the number of raters. For example, if the number of raters is
reduced to two in the D-study, the variance component for persons changes to 2.25.
Using the new variance component for persons in Equation 8.13 yields a generalizability
coefficient of .81 (which is still acceptably high).
Next, we turn to the proportion of variance as illustrated in Equation 8.14 as a way
to understand the magnitude of the effects.
In G theory studies, the proportion of variance provides a measure of effect size
that is comparable across studies. The proportion of variance is reported for each
facet in a study. For example, the proportion of variance for persons is provided in
Equation 8.14.
Equation 8.14 shows that the person effect accounts for approximately 61% of the
variability in rating scores among persons. Next, in Equation 8.15 we calculate the pro-
portion of variance for the rater effect.
We see from Equation 8.15 that the rater effect accounts for approximately 32% of
the variability in memory score performance ratings. From this information we conclude

Equation 8.12. Generalizability coefficient for rating data

sˆ 2P 5.03 5.03
rˆ 2RATERS* = = = = .90
ˆs2P + sˆ 2E 5.03 + .53 5.56

Note. The asterisk (*) signifies that the G coefficient can be used
for a D-study with persons crossed with raters (i.e., the measure-
ment conditions). Notation is from Crocker and Algina (1986,
p. 167).
278  PSYCHOMETRIC METHODS

Equation 8.13. Revised generalizability coefficient for rating data


with two raters

ˆ 2P 2.25 2.25
ρˆ RATERS*
2
= = = = .81
ˆ 2P ˆ 2E 2.25 .53 2.78

Note. The asterisk (*) signifies that the G coefficient can be used
for a D-study with persons crossed with the average number of raters
(i.e., the measurement conditions).

Equation 8.14. Proportion of variance for persons

sˆ 2P 5.03 5.03
= = = .61
sˆ 2P + sˆ 2R + sˆ 2RESIDUAL 5.03 + 2.62 + .53 8.18

Equation 8.15. Proportion of variance for raters

sˆ R2 2.62 2.62
= = = .32
ˆs2P + sˆ 2R + sˆ 2RESIDUAL 5.03 + 2.62 + .53 8.18

that the rater effect is relatively small (i.e., raters account for or capture a small amount
of variability among the raters). Another way of interpreting this finding is that the raters
are relatively similar or consistent in their ratings.

8.12 Design 3: Single-Facet Design with the Same Raters


on Multiple Occasions

In Design 3, we cover a G-study where the ratings are averaged, a strategy used to reduce
the error variance in the measurement condition. We can average over raters because
the same observers are conducting the ratings on each occasion for persons (i.e., raters are
not different for persons). Averaging over raters involves dividing the appropriate error
component by the number of raters. For example, in Equation 8.16 the error variance
Generalizability Theory  279

Equation 8.16. Generalizability coefficient for rating data averag-


ing over raters

ˆ 2P 5.03 5.03
ρˆ 2 = = = = .96
ˆ 2e .53 5.03 .17
ˆ 2P 5.03
NRATERS 3

Note. The asterisk (*) signifies that the G coefficient can be used
for a D-study with persons crossed with the average number of
­raters (i.e., the measurement conditions). Capital notation for
RATERS signifies that the error variance is divided by 3, the number
of raters in a D-study. The symbol N′RATERS signifies the number of
ratings to form the average. Notation is from Crocker and Algina
(1986, p. 167).

component is divided by 3 (i.e., .53/3). In our example data, the change realized in the G
coefficient by averaging over raters is from .90 to .96 (Equation 8.16).
There is a substantial increase in the G coefficient (i.e., from .90 in Design 2 to .96
in Design 3), telling us that when it is reasonable to do so, averaging over raters is an
excellent strategy.

8.13 Design 4: Single-Facet Nested Design with Multiple Raters

In Design 3, we illustrated the situation in which each person is rated by the same raters
on multiple occasions. In Design 4, each person has three ratings (on three occasions),
but each person is rated by a different rater. For example, this may occur in the event that
a large pool of raters is available for use in a G-study. In this scenario, raters are nested
within persons. Symbolically, this nesting effect is expressed as r : p or r(p). In this design,
differences among persons are influenced by (1) rater differences plus (2) universe score
differences for persons and (3) error variance. To capture this variance, the observed
score variance for this design is σ2P + σ2RATERS + σ2E, where the variance component symbols
are the same as in Design 2. Using the same mean square information in Equations 8.9,
8.10, and 8.11, we find that the G coefficient for Design 3 is provided in Equation 8.17.
We see that there is substantial reduction in the G coefficient from .90 (Design 2) or
.96 (Design 3) to .70 (Design 4). Knowing this information about the reduction of the
G coefficient to an unacceptable level, we can plan accordingly by using Design 2 or 3
rather than Design 4.
280  PSYCHOMETRIC METHODS

Equation 8.17. Generalizability coefficient for Design 4

σˆ 2P 1.5 1.5
ρˆ 2RATERS = = = = .70
σˆ 2P + σˆ 2RATERS + σˆ 2RESIDUAL 1.5 + .104 + .53 2.13

Note. No asterisk (*) is included in the equation after “raters,” signify-


ing that this is a D-study and the measurement condition of ratings
is nested within persons.

8.14 Design 5: Single-Facet Design with Multiple Raters Rating


on Two Occasions

In Design 4, the scenario was illustrated where different raters rate each person and
each person is rated on two occasions. Our strategy in Design 5 with multiple raters and
occasions of measurement is to average over ratings. The G coefficient for Design 5 is
provided in Equation 8.18.
Table 8.9 summarizes the formulas for the four G coefficients based on the designs
covered to this point (excluding Design 5, which is a modification of Design 4).

Equation 8.18. Generalizability coefficient for Design 5

s2P 5.03
rˆ 2 = =
s2ERROR (.104 + .53)
s2P + s2RATERS + 5.03 +
¢
NRATERS 3

5.03 5.03 5.03


= = = = .96
.634 5.03 + .21 5.24
5.03 +
3

Note. The word RATERS in capital letters signifies that the mea-
surement condition, ratings, are averaged over raters. The sym-
bol N′RATERS signifies the number of ratings to form the average.
Notation is from Crocker and Algina (1986, p. 167).
Generalizability Theory  281

Table 8.9.  Generalizability Coefficients for Four Single-Facet D-Study Designs


Number of measurement Observed score Generalizability
Design Description conditions variance coefficient

σ P2
1 p × i (crossed) 1 σP2 + σ 2E ρ I2* =
σ P2 + σ E2
σ P2
σE2 ρ 2* =
2 p × i (crossed) ni′ σ P2 + σ2
N′I σ P2 + E
NI¢
σ P2
3 i : p (nested) 1 σP2 + σ 2I + σE2 ρ I2* =
σ P2 + σ 2I + σ 2E
σ P2
( σ 2 + σE2 ) ρ 2I* =
4 i : p (nested) ni′ σP2 + I (σ 2 + σ E2 )
N′I σ P2 + I
NI¢

Note. Adapted from Crocker and Algina (2006). Copyright 2006 by South-Western, a part of Cengage Learning, Inc.
Adapted by permission. www.cengage.com/permissions. Crossed, all persons respond to all questions or are rated by all
raters; nested, condition of measurement is nested within persons (e.g., condition may be number of raters or occa-
sions of ratings; ni, the number of raters [or test items] in a G-study; ni¢, number of raters in a D-study; I, score is an
average over the raters). Note that the only difference between ρ 2I * and ρ 2I* is that in ρ 2I* the error σ 2E is divided by the
number of raters in the D-study.

8.15 Standard Errors of Measurement: Designs 1–5

In D-studies, the standard error of measurement (SEM) is used in a similar way as was pre-
sented in CTT. Recall that the SEM provides a single summary of measurement error with
which we can construct confidence intervals around observed scores. Recall also that
the observed (Xpi) score for a person is based on the expectation of the person’s true (Tpi)
score on an item (or rating); and that this process is applied to all persons in the sample.
Finally, the error score for a person is (Epi). Given this information about observed score
(Xpi) representing true score (Tpi), a confidence interval is based on a person’s true score.
Symbolically, the confidence interval for a person’s score is Xpi ± (SEM). Using this nota-
tion, we can create a confidence interval for any observed score in a D-study. To construct
a confidence interval, we need the error variance for the design being used in a D-study.
For example, in Design 1 where persons and test items were crossed, the residual or error
variance was .389. To return to standard deviation units, we take the square root of the
variance yielding s = .623.

8.16 Two-Facet Designs

This chapter concludes with an example of a two-facet design. Many measurement prob-
lems involve more complex scenarios than were presented in the previous section on
single-facet designs. To address increased measurement and/or design complexity, we can
use a two-facet G-study to estimate the necessary variance components. Two examples
282  PSYCHOMETRIC METHODS

Table 8.10.  Two-Facet Design


  Observer/Rater  
1 2 3
Working Working Working Person
Person Auditory Visual memory Auditory Visual memory Auditory Visual memory mean
1 3 3 2 2 3 3 3 4 1 2.67
2 1 3 3 3 1 2 4 5 3 2.78
3 2 7 2 3 3 5 5 7 3 4.11
4 5 7 6 2 5 5 7 6 6 5.44
5 7 8 9 3 7 6 5 7 8 6.67
3.6 5.6 4.4 2.6 3.8 4.2 4.8 5.8 4.2 4.33
Mean ratings for 5 persons for each rater and each subtest Grand mean

are provided to illustrate two-facet G theory designs. In our first example, we use five per-
sons from the GfGc data to illustrate how to apply a two-facet G-study. Specifically, our
focus is on short-term memory as the broad construct of interest. In our first example,
short-term memory consists of the subtests auditory, visual, and working memory. Next,
ratings by three observers on auditory, visual, and working memory serve as our out-
come measures of interest. Ratings signify the quality (expressed as accuracy) of response
and are based on a 1–10 scale with 1 = low level of short-term memory and 10 = a high
level of short-term memory on each of the items (1–3). In this situation we have two
facets of measurement: an item (or in this case a test) facet and an observer (rater) facet.
In this example, persons are the object of measurement and are included as a random
effect. The design is crossed because all five persons are rated by all three observers on
the three memory subtests. The primary research question of interest for this analysis is
whether the persons elicited different mean ratings averaged across subtests and raters. In
ANOVA, the main effect for persons reflects differences among persons’ averages.
Table 8.10 illustrates the design structure and (1) the person means, (2) rater means,
and (3) grand mean for persons for this two-facet example.
The corresponding data file layout for an SPSS ANOVA analysis is illustrated in Table
8.11.
Next, the SPSS syntax is provided that yields the ANOVA results necessary for deriving
mean squares for estimating the generalizability coefficient for the two-facet generalizabil-
ity theory analysis. Table 8.12 provides the results of the ANOVA for the two-facet design.

SPSS ANOVA syntax for two-facet generalizability theory analysis

UNIANOVA score BY person item rater


/METHOD=SSTYPE(1)
/INTERCEPT=EXCLUDE
/EMMEANS=TABLES(OVERALL)
/EMMEANS=TABLES(person)
/EMMEANS=TABLES(item)
Generalizability Theory  283

Table 8.11.  Data Layout: Two-Facet Design


Person Item Rater Score
1 1 1 3
2 1 1 1
3 1 1 2
4 1 1 5
5 1 1 7
1 2 1 3
2 2 1 3
3 2 1 7
4 2 1 7
5 2 1 8
1 3 1 2
2 3 1 3
3 3 1 2
4 3 1 6
5 3 1 9
1 1 2 2
2 1 2 3
3 1 2 3
4 1 2 2
5 1 2 3
1 2 2 3
2 2 2 1
3 2 2 3
4 2 2 5
5 2 2 7
1 3 2 3
2 3 2 2
3 3 2 5
4 3 2 5
5 3 2 6
1 1 3 3
2 1 3 4
3 1 3 5
4 1 3 7
5 1 3 5
1 2 3 4
2 2 3 5
3 2 3 7
4 2 3 6
5 2 3 7
1 3 3 1
2 3 3 3
3 3 3 3
4 3 3 6
5 3 3 8
284  PSYCHOMETRIC METHODS

/EMMEANS=TABLES(rater)
/EMMEANS=TABLES(person*item)
/EMMEANS=TABLES(person*rater)
/EMMEANS=TABLES(item*rater)
/EMMEANS=TABLES(person*item*rater)
/PRINT=DESCRIPTIVE
/CRITERIA=ALPHA(.05)
/DESIGN=person item rater item*person person*rater item*rater.

Table 8.13 provides the main effects and two-way interactions for our two-facet design.
Next, the equations for calculating variance components are presented in Tables
8.14 and 8.15.
Table 8.16 provides an interpretation for the results on Tables 8.13 and 8.15.

Table 8.12.  Two-Facet Design ANOVA Results


Tests of Between-Subjects Effects
Dependent Variable: score
Source Type I Sum of Squares df Mean Square F Sig.
Model 1021.133a 29 35.211 21.780 .000
person 952.333 5 190.467 117.814 .000
item 14.800 2 7.400 4.577 .027
rater 15.600 2 7.800 4.825 .023
person * item 14.533 8 1.817 1.124 .399
person * rater 15.067 8 1.883 1.165 .376
item * rater 8.800 4 2.200 1.361 .291
Error 25.867 16 1.617
Total 1047.000 45
a. R Squared = .975 (Adjusted R Squared = .931)

Table 8.13.  Mean of Ratings: Main Effects and Interactions


Main effects
Person Mean Rater Mean Item Mean
1 2.7 1 4.5 Auditory 3.7
2 2.8 2 3.5 Visual 5.1
3 4.1 3 4.9 Working 4.3
4 5.4
5 6.7        
                       
Two-way interactions
Person × rater Person × item Rater × item
Person 1 2 3 Person Auditory Visual Working Rater Auditory Visual Working
1 2.7 2.7 2.7 1 2.7 3.3 2.0 1 3.6 5.6 4.4
2 2.3 2.0 4.0 2 2.7 3.0 2.7 2 2.6 3.8 4.2
3 3.7 3.7 5.0 3 3.3 5.7 3.3 3 4.8 5.8 4.2
4 6.0 4.0 6.3 4 4.7 6.0 5.7
5 8.0 5.3 6.7 5 5.0 7.3 7.7        
Generalizability Theory  285

Table 8.14.  Equations for Estimating Variance Components


in the Person × Rater × Item Model
Effect Equation
Person 2 − − +
P
N R NI
Subtest − − +
σI2 =
N P NR
Rater R − PR − RI +
2
R =
NP N I
Person × subtest PI −
σPI2 =
NR
Person × rater MSPR − MS
σPR
2
=
NI
Subtest × rater MSRI − MS
σRI2 =
NP
Residual s2RES = RES

Table 8.15.  Variance Component Estimates in the Person × Rater × Item Model
Effect Variance component % variance
Person 190.47 − 1.88 − 1.82 + 1.62 .890
σP2 = = 20.93
3 *3
Subtest 7.4 − 1.82 − 2.2 + 1.62 .014
σI2 = = .33
5*3
Rater 7.8 − 1.88 − 2.2 + 1.62 .015
σ R2 = = .36
5* 3
Person × subtest 1.82 − 1.62 .002
σPI2 = = .06
3
Person × rater 1.88 − 1.62 .004
σ PR
2
= = .09
3
Subtest × rater 2.2 − 1.62 .005
σR2I = = .12
5
Residual σRES
2
= 1.62 .068
Total 23.51 1.00
286  PSYCHOMETRIC METHODS

Table 8.16.  Interpretation of Variance Components in the Person × Rater × Item


Model
Effect Interpretation Example
Person Persons exhibit different mean ratings Person 3 receives higher average rating
averaged across the raters and items. than person 2.

Item Items (subtests) were awarded different Item 2 has a higher average rating than
mean ratings averaged across persons item 1.
and items.

Rater Raters provided different mean ratings Rater 3 provides higher average ratings
averaged across persons and items. than rater 2.

Person × item Persons were ranked differently across On subtest 1 person X was rated higher
items relative to their ratings averaged than person Y, but on item 2 person X
across raters. was rated lower than person Y.

Person × rater Persons were ranked differently across Rater 1 rates person X higher than
raters relative to their ratings averaged rater 2 rates person Y, but rater 2 rates
across items. person X lower than person Y.

Item × rater Items (subtests) were ranked differently Rater 1 rates item 1 (Auditory
by raters relative to the ratings averaged memory) higher than item 2 (Visual
across persons. memory), but rater 2 rates item 1
(Auditory memory) lower than item 2
(Visual memory).

Residual Variance in ratings not captured with any


of the above effects.

8.17 Summary and Conclusions

This chapter presented generalizability theory—a statistical theory about the depend-
ability of measurements useful for studying a variety of complex measurement problems.
In this chapter, the logic underlying generalizability was introduced followed by practi-
cal application of the technique under single facet and two-facet measurement designs.
Generalizability theory was discussed as providing a way to extend and improve upon
the classical test theory model for situations where measurement is affected by multiple
facets or conditions. Reliability of scores according to the generalizability theory was
discussed in relation to the CTT model, and the advantages of estimating score reliability
in generalizability theory were highlighted. Finally, emphasis was placed on the advan-
tages generalizability theory provides for examining single and multifaceted measure-
ment problems.
Generalizability Theory  287

Key Terms and Definitions


Absolute decisions. Focus on the level of performance of an individual regardless of the
performance of his or her peers.
Analysis of variance. A statistical model based on a special case of the general linear
model most often used to analyze data in experimental studies where researchers
are interested in determining the influence by a factor or treatment (e.g., the effect of
an intervention) on an outcome (dependent) variable (e.g., reading achievement or
success in treating a medical disease). In generalizability theory, a factor is labeled
as a facet.
Classical test theory. Based on the true score model, a theory concerned with observed,
true, and error score components.
Coefficient of generalizability. Represents how dependable or reliable a person’s
observed score is relative to his or her universe score and also relative to other per-
sons (i.e., the focus is on individual differences among persons).
Confidence interval. A statistical range with a specified probability that a given param-
eter lies within the range.
Crossed design. All persons respond to all test questions, or all persons are exposed to
all study conditions.
D-study. A generalizability study used to make sample-based decisions predicated on
improved dependability of measurement rather than generalized to populations.
Facets. A set of similar conditions of measurement (Brennan, 2010, p. 5).

Fixed facet of measurement. Interest lies in the variance components of specific char-
acteristics of a particular facet (i.e., we will not generalize beyond the characteristics
of the facet).
G-study. A generalizability study with the purpose of planning, then conducting a D-study
that will have adequate generalizability to the universe of interest.
Generalizability coefficient. Synonymous with the estimate of the reliability coefficient
alpha (a) in CTT under certain measurement circumstances.
Generalizability theory. A highly flexible technique for studying error that allows for the
degree to which a particular set of measurements on an examinee are generalizable
to a more extensive set of measurements.
Item facet. Generalization from a set of items, defined under a set of similar conditions
of measurement to a set of items from a universe of items.
Measurement precision. How close scores are to one another and the degree of mea-
sure of error on parallel tests.
Nested design. A design where each person is rated by three raters and the raters rate
each person on two separate occasions (i.e., persons are nested within raters and
occasions).
288  PSYCHOMETRIC METHODS

Object of measurement. The focus of measurement; usually persons but may also be
items.
Occasion facet. A generalization from one occasion to another from a universe of occa-
sions (e.g., days, weeks, or months).
Partially nested. Different raters rate different persons on two separate occasions.

Random error of measurement. Variability of errors of measurement function in a ran-


dom or nonsystematic manner (e.g., error variance of measurement is randomly dis-
persed over a score distribution).
Random facet of measurement. The conditions comprising the facet are representative
of the universe of all possible facet conditions.
Relative decisions. Comparing one person’s score or performance with others (e.g., as
in ability and achievement testing).
Reliability. The consistency of measurements based on repeated sampling of a sample or
population; also known as dependability in generalizability theory.
Score reliability. The dependability of measurement expressed as a G coefficient in
generalizability theory.
Standard error of measurement. The accuracy with which a single score for a person
or persons approximates the expected value of possible scores for the same person
or persons. It is the weighted average of the errors of measurement for a group of
examinees.
Test form facet. Generalization from one test form under a particular set of conditions to
a set of forms from a universe of forms.
Universe score. In generalizability theory, a person’s average score over a theoretically
infinite number of measurement occasions.
Variance component. Captures the source of variation in observed scores of persons
and is the fundamental unit of analysis within a G-study.
9

Factor Analysis

This chapter introduces factor analysis as a technique for reducing multiple themes embed-
ded in tests to a simpler structure. An overview of the concepts and process of conducting
a factor analysis is provided as it relates to the conceptual definitions underlying a set of
measured variables. Additionally, interpretation of the results of a factor analysis is included
with examples. The chapter concludes by presenting common errors to avoid when con-
ducting factor analysis.

9.1 Introduction

The GfGc model of general intelligence presented in Chapter 1 is factor analytically


derived and based on subtests comprising fluid and crystallized intelligence and short-
term memory. Factor analysis (FA) provides a way for researchers to reduce multiple
themes embedded in tests to a simpler structure. It accomplishes this goal by arriving at
a more parsimonious representation of the underlying structure of a collection of cor-
relations on a set of measured variables (e.g., test items or total test scores). This more
parsimonious structure provides results that are more easily interpretable in light of the
goals of a particular study. FA accomplishes this goal by telling us what measures (e.g.,
total test scores or individual items) belong together and the degree to which they do so.
Using FA allows researchers (1) to reduce the number of test items (a.k.a. variables) and
(2) to locate underlying themes or dimensions in tests.
There are several approaches to reducing the complexity of a set of variables to a sim-
pler structure. For example, principal components analysis (PCA), structural equation
modeling (SEM), cluster analysis, and multidimensional scaling (MDS) are all techniques
that allow for variable reduction to a more parsimonious overall structure. In this chapter,
FA is presented within the context of psychological measurement and test development

289
290  PSYCHOMETRIC METHODS

where the basic principles of the FA approach to variable reduction is useful owing to the
nature of the complex correlational structure of psychological attributes and/or constructs.
This chapter presents an overview of the process of conducting FA and the mechanics of FA as
it relates to the conceptual definitions underlying a set of measured variables (Fabrigar &
Wegner, 2012, p. 144). The presentation here therefore focuses on the factor-analytic tradi-
tion to variable reduction targeting simple structure based on the common factor model.
Recall that throughout this book we have used data representing part of the gen-
eral theory of intelligence represented by the constructs crystallized intelligence, fluid
intelligence, and short-term memory. In our examples, we use score data on 10 subtests
acquired from a sample of 1,000 examinees. Chapters 3 and 4 introduced the issue of
score accuracy. For example, do examinee scores on tests really represent what they are
intended to represent? Establishing evidence that scores on subtests display patterns of
association in a way that aligns with a working hypothesis or existing theory is part of the
test or instrument validation process. The degree to which the subtests cluster in patterns
that align with a working hypothesis or theory provides one form of evidence that the
subtests actually reflect the constructs as they exist relative to a theoretical framework.
Therefore, an important question related to the validation of the general theory of intel-
ligence in our examples is whether scores on the items and subtests comprising each
theoretical construct reflect similar patterns. The number and composition of the subtest
clusters are determined by the correlations among all pairs of subtests.
To provide a conceptual overview of FA, we return to the GfGc data used through-
out this book. The relationships among the 10 subtests are summarized according to
their intercorrelations (e.g., in Table 9.2). Note that in this chapter, to help present concepts
involved in conducting factor analysis, we use subtest total score data rather than item-level
data. Alternatively, FA can also be conducted at the level of individual items comprising
a test. With regard to the correlation matrix in Table 9.2, although we see basic infor-
mation about the relationships among the subtests, it is difficult to identify a discernible
pattern of correlations. Using the correlation matrix as a starting point, FA provides a
way for us to identify order or relational structures among the correlations. In identifying
relational structures among our 10 subtests, FA can be used in an exploratory mode. For
example, exploratory factor analysis (EFA) is used in the early stages of test or instru-
ment development, and confirmatory factor analysis (CFA) is used to test or confirm
an existing theory on the basis of the tests.
We begin the chapter with a conceptual overview and brief history of FA. Next, an
example FA is presented using the GfGc data, with an emphasis on basic concepts. The
presentation aims to facilitate an understanding of FA by considering associated research
questions. Core questions common to correctly conducting and interpreting a factor ana-
lytic study (adapted from Crocker & Algina, 1986, p. 287) include:

1. What role does the pattern of intercorrelations among the variables or subtests
play in identifying the number of factors?
2. What are the general steps in conducting a factor-analytic study?
Factor Analysis  291

3. How are factors estimated?


4. How are factor loadings interpreted?
5. How are factor loadings used to identify the number of factors in an observed
correlation matrix?
6. What are factor rotations, and how are they useful?
7. What decisions are required for a researcher to determine which factor loadings
to interpret?
8. How do orthogonal and oblique rotations differ, and when is one preferred over
the other?
9. How is communality interpreted?
10. How is uniqueness interpreted?
11. How is reliability related to factor analysis?
12. What is the difference between exploratory and confirmatory factor analysis?

9.2 Brief History

FA was created by Charles Spearman in 1904, related to his work on formulating a theory
of general intelligence (McArdle, 2007, p. 99). Spearman observed that variables from a
carefully specified domain (e.g., intelligence) are often correlated with each other. Since
variables are correlated with one another, they share information about the theory under
investigation. When variables in a domain are correlated, factor analysis is a useful tech-
nique for determining how variables work together in relation to a theory. The primary
goals of FA include (1) exploration and identification of a set of variables in terms of a
smaller number of hypothetical variables called factors, based on patterns of associa-
tion in the data (i.e., EFA; see Cattell, 1971; Mulaik, 1987; Fabrigar & Wegner, 2012);
(2) confirmation that variables fit a particular pattern or cluster to form a certain dimen-
sion according to a theory (i.e., CFA; see McDonald, 1999; Fabrigar & Wegner, 2012;
Brown, 2006); and (3) synthesis of information about the factors and their contribution
as reflected by examinee performance on the observed variables (e.g., scores on tests).
Additionally, when researchers conduct a CFA, they attempt to understand why the
variables are correlated and to determine the degree or level of accuracy the variables and
factors provide relative to a theory. Factor-analytic theory posits that variables (i.e., test
total scores or test items) correlate because they are determined in part from common but
unobserved influences. These common influences are due to common factors, meaning
that variables are correlated to some degree—thus the name common factor model. The
unobserved influences are manifested as a latent factor (or simply a factor) in FA.
Several approaches to FA are possible depending on the goal(s) of the research. The
most common type of FA is the R-type where the focus is on grouping variables (e.g.,
subtests in the GfGc data) into similar clusters that reflect latent constructs. R-Type FA
292  PSYCHOMETRIC METHODS

is used widely in test and scale development, and we use it in this chapter to illustrate
how it works with the GfGc data. Other variations of FA include Q-type (i.e., FA of
persons into clusters with like attributes; Kothari, 2006, p. 336; Thompson, 2000) and
P-type, which focuses on change within a single person or persons captured by repeated
measurements over time (Browne & Zhang, 2007; Molenaar, 2004). The reasons for con-
ducting an R-type factor analysis in the test development process include the following
(Comrey & Lee, 1992, pp. 4–5):

1. Determination of constructs that might explain the intercorrelations among


variables.
2. A need to test a theory about the number and composition of the factors needed
to account for the intercorrelations among the variables being studied.
3. A need to evaluate the effect on the factor-construct relationships brought about
by changes in the variables and the conditions under which the measurements
were taken.
4. A desire to verify previous findings using the same population or a new sample
from a different population.
5. A need to test the effect on obtained results produced by a variation in the factor-
analytic procedures used.

Figure 9.1 provides a decision tree for conducting an FA or planning a factor-analytic


study.
The next section provides a practical application of how FA works using the GfGc
data. In subsequent sections, the components of the practical application are covered in
greater detail.

9.3 Applied Example with GfGc Data

This section illustrates FA using the subtest total scores from the GfGc data. Recall that
four subtests measure crystallized intelligence, three subtests measure fluid intelligence, and
three subtests measure short-term memory. Table 9.1 (introduced in Chapter 1) provides the
details of each of the subtests that comprise the factors or constructs. Figure 9.2 (introduced
in Chapter 1) illustrates the conceptual (i.e., theoretical) factor structure for the GfGc data.
Conducting FA begins with inspection of the correlation matrix of the variables (or
in our example, the subtests) involved. Table 9.2 provides the intercorrelation matrix for
the 10 GfGc subtests used in the examples in this chapter.
Table 9.2 reveals that the correlations within and between the subtests crystallized
intelligence, fluid intelligence, and short-term memory do in fact correlate in a way that
supports conducting a factor analysis. For example, the variations in shading in Table
9.2 show that the clusters of subtests correlate moderately with one another. The excep-
tion to this pattern is in the short-term memory cluster where subtest 10 (inductive and
Factor Analysis  293

Goal of research or study


Is analysis confirmatory or exploratory?

confirmatory

exploratory

Structural equation
modeling Select type of factor analysis

Q-type R-type P-type

Selecting a factor method


Is the total variance or common variance
to be analyzed?

Total variance Common variance


Extract using components Extract using common
analysis factor analysis

Specify the factor matrix

Orthogonal Oblique
Varimax Oblimin
Select the rotational method
Equimax Promax
Quartimax Orthoblique

Interpret the rotated factor matrix


NO Can significant loadings be found?
Can factors be named?
Are communalities sufficient?

Factor model specification


YES
Were any variables deleted?
Do you want to change the number of factors?
Do you want another type of rotation?

Validation of the factor matrix


Split or multiple samples
Separate analysis for subgroups
Identify influential cases

Selection of surrogate variables Computation of factor scores Creation of summated scales

Figure 9.1.  Guidelines for conducting FA. Adapted from Hair, Anderson, Tatham, and Black (1998,
pp. 94, 101). Copyright 1998. Reprinted by permission of Pearson Education, Inc. New York, New York.
294  PSYCHOMETRIC METHODS

Table 9.1.  Subtest Variables in the GfGc Dataset


Number of
Name of subtest items Scoring

Fluid intelligence (Gf)


Quantitative reasoning—sequential Fluid intelligence test 1 10 0/1/2
Quantitative reasoning—abstract Fluid intelligence test 2 20 0/1
Quantitative reasoning—induction and
deduction Fluid intelligence test 3 20 0/1
Crystallized intelligence (Gc)      
Language development Crystallized intelligence test 1 25 0/1/2
Lexical knowledge Crystallized intelligence test 2 25 0/1
Listening ability Crystallized intelligence test 3 15 0/1/2
Communication ability Crystallized intelligence test 4 15 0/1/2
Short-term memory (Gsm)      
Recall memory Short-term memory test 1 20 0/1/2
Auditory learning Short-term memory test 2 10 0/1/2
Arithmetic Short-term memory test 3 15 0/1

deductive reasoning) does not correlate at even a moderate level with graphic orientation
and graphic identification. Additionally, inspection of the unshaded cells in Table 9.2
reveals that the subtests in the theoretical clusters also correlate moderately (with the
exception of subtest 10 on inductive and deductive reasoning) with subtests that are not
part of their theoretical cluster.

9.4 Estimating Factors and Factor Loadings

At the heart of FA is the relationship between a correlation matrix and a set of factor
loadings. The intercorrelations among the variables and the factors share an intimate
relationship. Although factor(s) are unobservable variables, it is possible to calculate the
correlation between factors and variables (e.g., subtests in our GfGc example). The cor-
relation between factors and the GfGc subtests are called factor loadings. For example,
consider questions 1–4 originally given in Section 9.1.

1. What role does the pattern of intercorrelations among the variables or subtests
play in identifying the number of factors?
2. What are the general steps in conducting a factor-analytic study?
3. How are factors estimated?
4. How are factor loadings interpreted?

Through these questions, we seek to know (1) how the pattern of correlations among
the variables inform what the factor loadings are, (2) how the loadings are estimated; and
Factor Analysis  295

Figure 9.2. General theory of intelligence. The smallest rectangles on the far right represent
items. The next larger rectangles represent subtests that are composed of the sum of the individual
items representing the content of the test. The ovals represent factors also known as latent or unob-
servable constructs posited by intelligence theory.
296  PSYCHOMETRIC METHODS

Table 9.2.  Intercorrelations for 10 GfGc Subtests

1 2 3 4 5 6 7 8 9 10
1. Short-term memory: based on 1 — — — — — — — — —
visual cues
2. S hort-term memory: auditory .517** 1 — — — — — — — —
and visual components
3. S
 hort-term memory: math .540** .626** 1 — — — — — — —
reasoning
4. Gc: measure of vocabulary .558** .363** .406** 1 — — — — — —

5. Gc: measure of knowledge .602** .326** .384** .717** 1 — — — — —

6. G c: measure of abstract .572** .413** .478** .730** .667** 1 — — — —


reasoning
7. Gc: measure of conceptual .548** .319** .365** .749** .694** .677** 1 — — —
reasoning
8. Gf: measure of graphic .420** .407** .545** .391** .394** .528** .377** 1 — —
orientation
9. G f: measure of graphic .544** .480** .588** .392** .374** .544** .397** .654** 1 —
identification
10. Gf: measure of inductive/ .073* .121** .156** 0.01 0.04 .096** 0.03 .210** .199** 1
deductive reasoning
Note. N = 1,000. Shaded cells highlight the intercorrelations among the subtests comprising each of the three
areas/factors representing intelligence theory.

(3) how to properly interpret the loadings relative to a theory or other context (e.g., an
external criterion as discussed in Chapter 3 on validity). To answer these questions, we
can examine the relationship between the correlation matrix and factor loadings. Recall
that a factor is an unobserved or a latent variable. A relevant question is, “How is a fac-
tor loading estimated since a factor is unobserved?” An answer to this question is found
in part by using the information given in Chapter 7 on reliability. For example, in fac-
tor analysis, factors or latent variables are idealized as true scores just as true score was
defined in Chapter 7 on reliability. Recall that in Chapter 7 we were able to estimate the
correlation between an unobservable true score and an observed score based on the axi-
oms of the classical test theory (CTT) model. Also recall that the total variance for a set
of test scores can be partitioned into observed, true, and error components. Later in this
chapter the common factor model is introduced, and parallels are drawn with the clas-
sical true score model. At this point, it is only important to know that we can estimate
factors and their loadings using techniques similar to those presented in Chapter 7.
Continuing with our example using the correlation matrix and how factor loadings
are estimated, we use the seven subtests representing the two factors, crystallized and
fluid intelligence. The correlation matrix for the seven crystallized and fluid intelligence
subtests is presented in Table 9.3.
Related to questions 1–4, we want to know (1) how the subtests relate to the hypo-
thetical factors of crystallized and fluid intelligence (i.e., the size of the factor loadings)
Factor Analysis  297

Table 9.3.  Intercorrelations for Crystallized and Fluid Intelligence


  1 2 3 4 5 6 7
Crystallized 1. Gc measure of vocabulary 1 — — — — — —
intelligence
2. Gc measure of knowledge 0.717 1 — — — — —
3. Gc measure of abstract reasoning 0.730 0.667 1 — — — —
4. Gc measure of conceptual reasoning 0.749 0.694 0.677 1 — — —
Fluid 5. Gf measure of graphic orientation 0.391 0.394 0.528 0.377 1 — —
intelligence
6. Gf measure of graphic identification 0.392 0.374 0.544 0.397 0.654 1 —
7. Gf measure of inductive and
0.012 0.038 0.096 0.027 0.210 0.199 1
deductive reasoning

  Correlation between fluid intelligence subtests.


  Correlation between crystallized intelligence subtests.
  Correlation between crystallized and fluid intelligence subtests.

and (2) if a correlation between the factors exists. To illustrate how answers to these ques-
tions are obtained, a table of initial and alternate factor loadings is given for the seven sub-
tests measuring crystallized and fluid intelligence. Table 9.3 reveals that subtest 7, the Gf
measure of inductive and deductive reasoning, is problematic based on its correlation with
the graphic identification subtest (.19) and graphic orientation (.21) under fluid intelli-
gence and the four tests measuring crystallized intelligence (e.g., all correlations are <.10).
In practical terms, retaining this subtest in the GfGc theoretical model should be revisited
within the context of the proposed use of the test for a population of examinees (e.g., the
validity of test score use is another issue to consider here). Based on the initial informa-
tion, subtest 7 contributes little to the GfGc theoretical model. The loadings in Table 9.4
are produced by a process known as factor extraction. Several techniques are available to
extract the factors (e.g., see Fabrigar & Wegner, 2012, p. 40). In the current example, the
factor loadings were extracted using the principal axis factor (PAF) extraction technique
(e.g., see the shaded line in the SPSS syntax below that highlights EXTRACTION PAF).
Extraction techniques are reviewed in more detail later in this chapter.

SPSS factor analysis syntax to produce Table 9.4

FACTOR
/VARIABLES cri1_tot cri2_tot cri3_tot cri4_tot fi1_tot fi2_tot

fi3_tot
/MISSING LISTWISE
/ANALYSIS cri1_tot cri2_tot cri3_tot cri4_tot fi1_tot fi2_tot

fi3_tot
/PRINT UNIVARIATE INITIAL CORRELATION DET KMO AIC EXTRACTION
/FORMAT SORT
/PLOT EIGEN
/CRITERIA FACTORS(2) ITERATE(25)
/EXTRACTION PAF
298  PSYCHOMETRIC METHODS

/CRITERIA ITERATE(25)
/METHOD=CORRELATION.

Note. The first shaded line denotes the syntax required to produce the Measure of Sampling Ade-
quacy (MSA) and Bartlett’s tests. The second shaded area denotes the syntax required to yield load-
ings by the principal axis factor extraction technique.

The correlations in Table 9.4 are called factor loadings and are defined as the
correlation between factors and subtests. Factors are based on structures or patterns
produced by the covariation of the GfGc tests. For example, notice the pattern of cor-
relations within each factor for each subtest. We see that for the initial factor solution
the crystallized intelligence subtests correlate highly with factor 1 (i.e., all > .60). Con-
versely, the majority of the fluid intelligence subtests (5 out of 7) display a low cor-
relation with factor 1 (i.e., .30 or lower). Two of the fluid intelligence subtests display
a high correlation with factor 1 (i.e., .64). However, as before, we see that subtest 7 is
problematic (i.e., a loading < .30). Taken together, the pattern of results in the left side
of Table 9.4 illustrate that there appears to be a single dominant factor represented by
six of the seven subtests. Additionally, we see that the graphic orientation and graphic

Table 9.4.  Initial Factor Loadings for Crystallized and Fluid Intelligence

Initial factor Alternate factor


loadings (F) loadings (F’)
Construct Subtest 1 2 1 2
Crystallized 1. Gc measure of vocabulary .84 –.30 .38 .68
intelligence
2. Gc measure of knowledge .78 –.23 .39 .71
3. Gc measure of abstract reasoning .85 –.02 .59 .62
4. Gc measure of conceptual reasoning .80 –.26 .38 .75
Fluid 5. Gf measure of graphic orientation .64 .49 .80 .10
intelligence
6. Gf measure of graphic identification .64 .48 .79 .11
7. Gf measure of inductive and deductive reasoning –.12 .26 .09 .27
  Factor 1 loadings on crystallized intelligence.
Factor 1 loadings on fluid intelligence.

  Factor 2 loadings on fluid intelligence.


Factor 2 loadings on crystallized intelligence.

  Substantial loadings on both factors 1 and 2.


Note. Correlation between factor and a subtest is a factor loading. Alternate factor loadings are derived for factor 1
by summing the initial loadings for each factor and multiplying the result by .707 (i.e., setting the variance of factor
1 to 1.0).
Factor Analysis  299

identification subtests appear to cross-load on both factors 1 and 2 to some degree,


indicating an unclear picture of what these two subtests are measuring. The next sec-
tion provides an example intended to clarify the relationship of the correlation between
subtests and how factor loadings are derived.
Equation 9.1a (Comrey & Lee, 1992; Crocker & Algina, 1986, p. 289) shows how to
estimate the relationship of the correlation between pairs of subtests and the loadings on
factors 1 and 2. Recall that the initial factor loadings for the crystallized and fluid intel-
ligence subtests are presented in the left half of Table 9.4. Using these initial loadings,
Equations 9.1a and 9.1b present the relationship of the correlation between two subtests
and factor loadings.
To illustrate Equation 9.1a using seven of the subtests in the GfGc data, consider
the correlation between the crystallized intelligence subtest word “knowledge” and the
fluid intelligence subtest of graphic orientation. In Table 9.2 we see that the correla-
tion between these two subtests is .394. Inserting values for factor loadings from Table
9.4 into Equation 9.1a, we have Equation 9.1b. The result in Equation 9.1b verifies the
relationship between factor loadings and subtest correlations. For example, by using the
factor loadings we can reproduce the correlation of .394 between the crystallized intelli-
gence subtest word “knowledge” and the fluid intelligence subtest of graphic orientation.
Next, we modify the notation of Equation 9.1a to signify that an alternative set of
loadings are used (presented in the right side in Table 9.4). We use the same subtests as in

Equation 9.1a. Relationship between pairwise correlation of two


subtests and factors

rij = ai1aj1 + ai2aj2

• rij = correlation between tests i and j.


• ai1 = loading of test i on factor 1.
• aj1 = loading of test j on factor 1.
• ai2 = loading of test i on factor 2.
• aj2 = loading of test j on factor 2.

Equation 9.1b. Relationship between pairwise correlation of two


subtests and factors

rij = ai1aj1 – ai2aj2


= .78(.64) – .23(.49) = .39
300  PSYCHOMETRIC METHODS

the previous example (i.e., crystallized intelligence subtest word knowledge and the fluid
intelligence subtest of graphic orientation) in Equation 9.2.
Inserting the initial factor loadings into Equation 9.2, we see nearly the same result
(.38; the difference is due to rounding error) as before in Equation 9.1b. These results illus-
trate an important point in factor analysis: that there are an infinite number of sets of factor
loadings that satisfy Equation 9.1a. This infinite number is called factor indeterminacy.
Table 9.4 presents alternate loadings created to illustrate the point that there is
always more than one factor solution that satisfies Equation 9.1a. The alternate loadings
(i.e., the right-hand side of Table 9.4) were derived using Equation 9.3 (Paxton, Curran,
Bollen, Kirby, & Chen, 2001; Crocker & Algina, 1986, p. 291).
Applying Equation 9.3 to create alternate factor loadings in Table 9.4 reveals two
points. First, there appears to be a general factor underlying the seven subtests. This pattern

Equation 9.2. Alternate loadings on word knowledge and graphic



reasoning tests

ρ15 = 12
′ 15
′ + 22
′ 52

= .39(.80) + .71(.10) = (.31) + (.07) = .38

• r15 = correlation between word knowledge and graphic


orientation tests.
′ = factor 1 loading on word knowledge subtest.
• 12
′ = factor 1 loading on graphic orientation subtest.
• 15
′ = factor 2 loading on word knowledge subtest.
• 22
′ = factor 2 loading on graphic orientation subtest.
• 52

Equation 9.3. Derivation of alternate loadings in Table 9.4

F1′ = .707(F1 ) + .707(F2 ) = .707(F1 + F2 )

F2′ = .707(F1 ) − .707(F2 ) = .707(F1 − F2 )

• F1′ = alternate factor 1.


• F2′ = alternate factor 2.
• F1 = initial loading on factor 1.
• F2 = initial loading on factor 2.
• .707 = scaling quantity that sets the variance of factor 1 to
a value of 1.0.
Factor Analysis  301

Equation 9.4. General equation for relating intercorrelations to


factor loadings

• rij = correlation between tests i and j.


• m = number of factors.
• aik = loading of test i on factor k.
• ajk = loading of test j on factor k.

of loadings supports at least two components of the general theory of intelligence. Second,
in Equation 9.3, F2 represents the difference between factor loadings on factors 1 and 2 (i.e.,
notice the sign of operations in each equation). The difference between the factor loadings
is tantamount to the idea that the two factors are tapping different parts of general intelli-
gence. The idea of two factors aligns with our example of crystallized and fluid intelligence.
Finally, Equation 9.1a can be modified to be applicable to any number of subtests as
expressed in Equation 9.4.

9.5 Factor Rotation

Recall that an infinite number of sets of factor loadings satisfy Equations 9.1a (i.e., for
two factors) and 9.3 (i.e., for more than two factors). The fact that (1) multiple sets of
factor loadings are possible and (2) most factor extraction methods yield initial load-
ings that are not easily interpreted provides results that are unclear or that lack ease of
interpretation. Fabrigar and Wegner (2012) and Kerlinger and Lee (2000) argue that it
is necessary to rotate factor matrices if they are to be adequately interpreted. Rotation is
helpful because original factor matrices are arbitrary inasmuch as any number of refer-
ence frames (i.e., factor axes) can be derived that reproduce any particular correlation
matrix. Factor rotation is the process of transforming the initial loadings using a set of
equations (such as in Equation 9.2) to achieve simple structure. The idea underlying
simple structure is to work to identify as pure a set of variables as possible (e.g., each
variable or subtest loads on as few factors as possible and as many zeros as possible in the
rotated factor matrix; see Kerlinger & Lee, 2000).
The guidelines for simple structure (based on Fabrigar & Wegner, 2012, p. 70, and
Kerlinger & Lee, 2000, p. 842) include the following:

1. Each row of the factor matrix should have at least one loading close to zero.
2. For each column of the factor matrix, there should be at least as many variables
with zero or near-zero loadings as there are factors.
302  PSYCHOMETRIC METHODS

3. For every pair of factors (columns), there should be several variables with load-
ings in one factor (column) but not in the other.
4. When there are four or more factors, a large proportion of the variables should
have negligible (close to zero) loadings on any pair of variables.
5. For every pair of factors (columns) of the factor matrix, there should be only a
small number of variables with appreciable (nonzero) loadings in both columns.

For any rotation technique used, the original factor loadings are related by a math-
ematical transformation. Factor rotation is accomplished geometrically as illustrated
graphically in Figures 9.3 through 9.5. Importantly, when any two sets of factor load-
ings are obtained through the rotation process, the two sets contain loadings that reflect
the correlations among the subtests equally well. Although factor rotation techniques
produce loadings that represent the correlations among subtests equally well, the mag-
nitude or size of the factor loadings varies, and a different set of factors represent each set
of factor loadings. This final point means that interpretations of the factors differ based on
the rotational technique applied. There are two classes of rotational techniques; orthogo-
nal and oblique (Brown, 2006, pp. 30–32; Lattin et al., 2003). Applying the orthogonal

.7
Factor 2 plotted on the Y-axis
.6
Gf graphic
.5 orientation ●
● Gf graphic
.4 identification

.3 Gf inductive and
● deductive

.2 reasoning
Factor 1 plotted on the X-axis

.1

−.9 −.8 −.7 −.6 −.5 −.4 −.3 −.2 −.1 .1 .2 .3 .4 .5 .6 .7 .8 .9


● Gc abstract
Factor matrixa −.1
reasoning
Gc word
Factor −.2 knowledge●
1 2 Gc conceptual

Gc meas of abstract reasoning
.851 −.020
reasoning −.3 ● Gc vocabulary
Gc meas of vocabulary .840 −.301
Gc meas of conceptual
.801 −.261
reasoning −.4
Gc meas knowledge .779 −.233
Gf meas of graphic
.644 .488
identification −.5
Gf meas of graphic
orientation .640 .493
Gf meas of inductive
.118 .263 −.6
and deductive reas
Extraction method: principal axis factoring.
a. 2 factors extracted. 8 iterations required. −.7

Figure 9.3.  Unrotated factor loadings for crystallized and fluid intelligence.
Factor Analysis  303

technique yields transformed factors that are uncorrelated (i.e., factors are oriented at 90°
angles in multidimensional space; see Figure 9.4). Applying the oblique technique yields
transformed factors that are correlated (i.e., permit factor orientations > 90°). Figures 9.4
and 9.5 illustrate orthogonal and oblique rotations for the crystallized and fluid intel-
ligence subtests.
Table 9.5 provides a comparison of the initial factor loadings and the obliquely rotated
loadings. From this table we see that the two loading solutions reveal that two interpreta-
tions are plausible. First, in the unrotated or initial solution, we see that six out of seven
subtests exhibit high and positive loadings, suggesting a single dominant factor for the seven
subtests. Similarly, for factor 2 we see high and positive loadings for two out of three subtests

Y (F2)

F2
.8
Factor 2 plotted on the Y-axis
.7

.6 Gf graphic orientation


.5 Gf graphic identification

.4

.3 ● Gf inductive and deductive reasoning

Factor 1 plotted on the X-axis


.2

.1
X (F1)
−.9 −.8 −.7 −.6 −.5 −.4 −.3 −.2 −.1
.1 ● Gc abstract reasoning
−.1 90˚
.2
Rotated factor matrixa
Factor
.3
−.2 Gc word knowledge
1 2 .4 ●
Gc meas of vocabulary .883 .122 .5 ● Gc conceptual reasoning
Gc meas of conceptual
.831 .139 −.3 .6
reasoning
.7 ● Gc vocabulary
Gc meas of knowledge .799 .154
Gc meas of abstract .8
reasoning
.763 .376 −.4 .9 F1
Gf meas of graphic
.339 .733
orientation
Gf meas of graphic
.345 .731 −.5
identification
Gf meas of inductive
−0.17 .288
and deductive reas
−.6
Extraction method: principal axis factoring.
Rotation method: Varimax with Kaiser
normalization.
a. Rotation converged in 3 iterations. −.7

−.8

Figure 9.4.  Orthogonally rotated factor loadings for crystallized and fluid intelligence. Rotated
scale metric or perspective is only an approximation. In orthogonal rotation, the angle is constrained
to 90 degrees, meaning that the factor axes are uncorrelated in multidimensional space.
304  PSYCHOMETRIC METHODS

Y(F2)
F2' – reference axis

F2
Factor 2 plotted on the Y-axis

Gf graphic orientation

● Gf graphic identification

● Gf inductive and deductive reasoning

Factor 1 plotted on the X-axis

90° < 90°

X(F1)
-.9 -.8 -.7 -.6 -.5 -.4 -.3 -.2 -.1
-.1
● Gc abstract reasoning
a
Pattern matrix
Factor
-.2
1 2 Gc word knowledge ● F1
Gc meas of vocabulary .957 –.119 ● Gc conceptual reasoning
Gc meas of conceptual -.3
reasoning
.888 –.083
● Gc vocabulary
Gc meas of knowledge .845 –.056
Gc meas of abstract
reasoning .702 .221 -.4
Gf meas of graphic
orientation .049 .778

Gf meas of graphic -.5


identification .057 .773

Gf meas of inductive and


deductive reas
–.153 .351
-.6
Extraction method: principal Axis Factoring .
Rotation method: Promax with Kaiser
Normalization .
-.7
a. Rotation converged in 3 iterations.

-.8
F1' – reference axis

Figure 9.5. Factor loadings for crystallized and fluid intelligence after oblique rotation.
In oblique rotation, the angle is less than 90 degrees, meaning that the factor axes are correlated
in multidimensional space.

(e.g., recall that subtest 7 has been consistently identified as problematic in our examples).
Negative loadings are interpreted as differences in abilities as measured by crystallized and fluid
intelligence. We interpret theses loadings as differences in ability based on the fact that in the
original correlation matrix (Table 9.2) we see that the seven subtests all positively correlate.
Alternatively, inspection of the obliquely rotated factor loadings on the right side of Table 9.5
reveals a much clearer picture of how the seven subtests reflect the two factors.
The obliquely rotated factor loadings provide the clearest picture of the factor struc-
ture for the seven subtests. However, interpreting factor loadings from an oblique solu-
tion is slightly more complicated than interpreting loadings from an orthogonal rotation.
For example, the factor loadings obtained from an oblique rotation do not represent
simple correlations between a factor and an item or subtest (as is the case of loadings
in an orthogonal rotation) unless there is no overlap among the factors (i.e., the fac-
tors are uncorrelated). Specifically, because the factors correlate, the correlations between
Factor Analysis  305

Table 9.5.  Unrotated and Obliquely Rotated Factor Loadings for Crystallized
and Fluid Intelligence Subtests
Obliquely
  Initial (unrotated) rotated factor
factor loadings loadings
Construct Subtest 1 2 1 2
Crystallized 1. Gc measure of vocabulary .84 –.30 .96 –.12
intelligence
2. Gc measure of knowledge .78 –.23 .85 –.06
3. Gc measure of abstract reasoning .85 –.02 .70 .22
4. Gc measure of conceptual reasoning .80 –.26 .89 –.08
Fluid 5. Gf measure of graphic orientation .64 .49 .05 .78
intelligence
6. Gf measure of graphic identification .64 .48 .06 .77
7. Gf measure of inductive and deductive reasoning –.12 .26 –.15 .35

  Factor 1 loadings on crystallized intelligence

Factor 1 loadings on fluid intelligence

  Factor 2 loadings on fluid intelligence

Factor 2 loadings on crystallized intelligence

  Loadings on both factors 1 and 2    


Note. Correlation between factor and a subtest is a factor loading. Unrotated loadings are also called initial
loadings or solutions. Rotation technique used is oblique (Promax). Correlation between factors is .59.

indicators (variables or tests) and factors may be inflated (e.g., a subtest may correlate
with one factor in part through its correlation with another factor). When interpreting
loadings from an oblique rotation, the contribution of a subtest to a factor is assessed
using the pattern matrix. The factor loadings in the pattern matrix represent the unique
relationship between a subtest and a factor while controlling for the influence of all the
other subtests. This unique contribution is synonymous with interpreting partial regres-
sion coefficients in multiple linear regression analysis (see the Appendix for a thorough
presentation of correlation and multiple regression techniques). One final point related to
regression and factor analysis is that the regression weights representing factor loadings
in an oblique solution are standardized regression weights.
In Table 9.6, we see that the orthogonally rotated factor loadings provide a clearer
picture of the factor structure than did the initial loadings, but not as clear as those
obtained from the oblique rotation. In an orthogonal rotation, the factors are constrained to
be uncorrelated (i.e., 90° in multidimensional space; see Figure 9.4). From a geometric
perspective, because the cosine (90°) of an angle is equal to zero, this amounts to saying
that the factors have no relationship to one another. One perceived advantage of using
306  PSYCHOMETRIC METHODS

Table 9.6.  Unrotated and Orthogonally Rotated Factor Loadings for Crystallized
and Fluid Intelligence Subtests
Orthogonally
  Unrotated factor rotated factor
loadings loadings
Construct Subtest 1 2 1 2
Crystallized 1. Gc measure of vocabulary .84 –.30 .88 .12
intelligence
2. Gc measure of knowledge .78 –.23 .79 .15
3. Gc measure of abstract reasoning .85 –.02 .76 .37
4. Gc measure of conceptual reasoning .80 –.26 .83 .14
Fluid 5. Gf measure of graphic orientation .64 .49 .33 .73
intelligence
6. Gf measure of graphic identification .64 .48 .34 .73
7. Gf measure of inductive and deductive reasoning –.12 .26 –.02 .28

  Factor 1 loadings on crystallized intelligence.

Factor 1 loadings on fluid intelligence.

  Factor 2 loadings on fluid intelligence.

Factor 2 loadings on crystallized intelligence.

  Loadings on both factors 1 and 2.    


Note. Correlation between factor and a subtest is a factor loading. Orthogonal rotation technique used was
Varimax.

orthogonal rotations is that the loadings between a factor and a subtest are interpreted as
a correlation coefficient, making interpretation straightforward. However, in an oblique
rotation, the pattern matrix provides loadings that are interpreted as standardized partial
regression coefficients (e.g., as are regression coefficients in multiple regression analy-
ses; for a review see Chapter 3 or the Appendix). Thus, the increase in simple structure
obtained by using an oblique rotation in conjunction with the availability and proper
interpretation of the pattern matrix is usually the best way to proceed—unless the factors
are uncorrelated by design (e.g., subtests comprising each factor are correlated with one
another but factors are uncorrelated within each other). A variety of rotation techniques
have been developed; Table 9.7 provides an overview of the techniques. The most com-
monly used orthogonal technique is varimax, and the most used oblique techniques are
promax and oblimin.

9.6 Correlated Factors and Simple Structure

Recall that applying an orthogonal rotation results in factors being uncorrelated and that
applying an oblique rotation results in factors being correlated. In this section we examine
Factor Analysis  307

Table 9.7.  Rotation Techniques


Rotation
technique Program Type Goal of analysis Comments
Varimax SPSS SAS Orthogonal Minimize complexity of factors Most commonly used rota-
by maximizing variance of load- tion; recommended as a
ings on each factor default option
Quartimax SPSS SAS Orthogonal Minimize complexity of vari- First factor tends to be gen-
ables by maximizing variance of eral, with other subclusters
loadings on each variable of variables
Orthogonal SAS Orthogonal Simplifies either variables or Gamma is continuously
with gamma factors, depending on the value scaled variable
(orthomax) of gamma
Equamax SPSS SAS Orthogonal Simplifies both variables and Research indicates erratic
factors; compromise between behavior
quartimax and varimax
Direct SPSS Oblique Simplify factors by minimizing Continuous values of gamma
oblimin cross-products of loadings or delta available; allows a wide
range of factor intercorrelations
Direct SPSS SAS Oblique Simplify factors by minimizing Permits fairly high cor-
quartimin cross products of squared load- relations among factors.
ings in pattern matrix Achieved in SPSS by setting
delta = 0 with direct oblimin
Promax SPSS SAS Oblique Orthogonal factors rotated to Fast and inexpensive with
oblique solutions respect to computational time
Orthoblique SPSS SAS Orthogonal Rescale factor loadings to yield  
and oblique orthogonal solution; non-rescale
loadings may be correlated
Note. From Tabachnick and Fidell (2007, p. 639). Copyright 2007. Reprinted by permission of Pearson Education, Inc.
New York, New York.

the role correlated factors play in the mechanics of factor analysis. To begin, we return
to Table 9.5, which provides the factor loadings for the oblique solution. Notice that the
correlation between factors 1 and 2 is .59 (see the note at the bottom of Table 9.5).
Recall that in an oblique rotation, the factor loadings do not represent correlations
between subtests and factors; rather, the loadings are standardized regression weights
(i.e., a unique relationship between a subtest and a factor while controlling for the influ-
ence of all the other subtests is based on partial regression coefficients). To illustrate how
the correlation between factors relates to the relationship between two subtests, consider
the crystallized intelligence subtest of abstract reasoning. In Table 9.5 we see that the
loading on factor 1 for abstract reasoning is .70 and .22 on factor 2. Because oblique
rotations allow factors to be correlated, after accounting for the statistical relationship
between factor 1 and abstract reasoning (i.e., partialing out factor 1), abstract reason-
ing and factor 2 are related by a standardized regression weight of .22. A loading of .22
in this context is a partial (standardized) regression weight—not simply the correlation
308  PSYCHOMETRIC METHODS

between a factor and a subtest. Also, a factor loading of .70 for the abstract reasoning
subtest on factor 1 indicates a strong relationship for this subtest on crystallized intel-
ligence (unique factor) after controlling for factor 2 (the graphic identification compo-
nent of fluid intelligence). Modifying Equation 9.1a, we can account for the correlation
between factors and the relationship between any two subtests by Equation 9.5a (Crocker
& Algina, 1986, p. 293).
For example, we know from the results of the factor analysis with an oblique rota-
tion that the correlation between factors 1 and 2 is .59 (see the note in Table 9.5). To illus-
trate Equation 9.5a, we use the crystallized intelligence subtest abstract reasoning and the
fluid intelligence subtest graphic identification. Applying the factor loading values for
these subtests (from Table 9.5) into Equation 9.5a, we have Equation 9.5b.
Returning to Table 9.2, which presented the original correlation matrix for all seven
subtests, we can verify that Equation 9.5b holds by noting that the correlation between
the crystallized intelligence subtest abstract reasoning and fluid intelligence subtest
graphic identification is in fact .54.

Equation 9.5a. Relationship between pairwise correlation of two


subtests and factors

rij = ai1aj1 + ai2aj2 +ai1aj2f + ai2aj1f

• rij = correlation between tests i and j.


• ai1 = loading of test i on factor 1.
• aj1 = loading of test j on factor 1.
• ai2 = loading of test i on factor 2.
• aj2 = loading of test j on factor 2.
• f = correlation between factors.

Equation 9.5b. Relationship between pairwise correlation of two


subtests and factors

r = 1 1 + 2 2 + 1 2f + 2 1f

= .70(.06) + .22(.77) + .70(.77)(.59) + .22(.06)(.59)


= .042 + .17 + .32 + .008

= .54
Factor Analysis  309

9.7 The Factor Analysis Model, Communality, and Uniqueness

Earlier in this chapter, Equations 9.1a and 9.2 illustrated how correlations between sub-
tests in the GfGc data are related to factor loadings. Equation 9.6 presents the factor
analysis model, a general equation that subsumes and extends Equations 9.1a and 9.2 in
a way that allows for the estimation of common factors and unique factors. A common
factor is one with which two or more subtests are correlated. These subtests are also cor-
related with one another to some degree. Conversely, a unique factor is correlated with
only one subtest (i.e., its association is exclusive to a single subtest). The common factor
model assumes that unique factors are uncorrelated (1) with each common factor and
(2) with unique factors for different tests. Thus, unique factors account for no correlation
between subtests in a factor analysis.
Related to the FA model are communality and uniqueness. One of the primary goals
of an FA is to determine the amount of variance that a subtest accounts for in relation to
a common factor. The communality of a subtest reflects the portion of the subtest’s vari-
ance that is associated with the common factor. For the case where the factors are uncor-
related, Equation 9.7 is applicable to estimation of the communality.
For example, consider the vocabulary subtest of crystallized intelligence. Using the
orthogonally (uncorrelated) derived factor loadings provided in Table 9.6, we find that

Equation 9.6. Common factor model

zi = aikFk + Ei

• zi = z-score on test i.
• aik = loadings of test i on factor k.
• Fk = scores on common factor k.
• Ei = scores on the factor unique to test i.

Equation 9.7. Communality

• = communality expressed as the variance of subtest i.


• aik2 = squared loading of test i on factor k.
310  PSYCHOMETRIC METHODS

the loading for the vocabulary subtest is .88 on factor 1 and .12 on factor 2. Next, insert-
ing these loadings into Equation 9.8 results in a communality of .78 as illustrated in
Equation 9.7.
Raw scores (scores in their original units of measurement) are converted to z-scores
in factor analysis. As a result of this transformation, the variance of the scores equals 1.0.
The communality represents the portion of a particular subtest variance that is related
to the factor variance. The communality estimate is a number between 0 and 1 since any
distribution of z-scores has a mean of 0 and a standard deviation (and variance) of 1.0.
When two factors are correlated as illustrated in Table 9.8, Equation 9.8 is modified
as in Equation 9.9.
Applying Equation 9.9 to the obliquely estimated factor loadings in Table 9.8, we
have Equation 9.10.
Based on the results of Equation 9.10, we see that when factors 1 and 2 are corre-
lated, the communality of the vocabulary subtest is substantially lower than is observed
when the factors are constrained to be uncorrelated (i.e., for the orthogonal solution).
The unique part of the variance for any particular subtest is expressed in Equation
9.11.
Using Equation 9.11 and inserting the results of Equation 9.8 for the orthogonal
solution we have Equation 9.12a and for the oblique solution, Equation 9.12b.

Equation 9.8. Communality for vocabulary subtest

Equation 9.9. Communality for a subtest when factors are


correlated

• = communality expressed as the variance of subtest i.


• = squared loading of test i on factor 1.
• ai22 = squared loading of test i on factor 2.
• = product of the squared loadings of test i on factors
1 and 2.
• f = correlation between factor 1 and 2.
Factor Analysis  311

Equation 9.10. Communality for vocabulary subtest when factors


are correlated

Equation 9.11. Unique variance for a subtest

UI2 = 1 − HI2

• HI2 = communality expressed as the variance of subtest i.


• UI2 = squared variance on unique factor of test i.

Equation 9.12a. Uniqueness for vocabulary subtest when factors


are uncorrelated

UI2 = 1 − HI2

= 1 − .78
= .22

Equation 9.12b. Uniqueness for vocabulary subtest when factors


are correlated

UI2 = 1 − HI2

= 1 − .54
= .46
312  PSYCHOMETRIC METHODS

Equation 9.13. Common variance partitioned into specific and


error variance

1 = HI2 = S 2I + E 2I

• H2I = communality expressed as the variance of subtest i.


• SI2 = squared specific variance of test i on factor 1.
• E I2 = squared error variance of test i on factor 1.

Continuing with our illustration using the vocabulary subtest, we find that the
2
unique variance can be partitioned into two components—specific variance SI and error
variance . The specific variance is the part of the vocabulary subtest true score variance
that is not related to the true score variance on any other subtest. At this point, you may
notice a connection to classical test theory by the mention of true score. In fact, the com-
mon factor model provides another framework for measurement theory as related to true,
observed, and error scores (e.g., see McDonald, 1999, for an excellent treatment). Based
on the definition of unique variance being the sum of specific variance and error variance,
we can partition the communality into two additive parts, as in Equation 9.13.
Earlier in the chapter we raised a question regarding how factors are estimated since
they are unobservable variables. Recall that in Chapter 7 the topic of reliability was intro-
duced within the framework of classical test theory. In the common factor analysis model,
the reliability of the test scores can be conceived as the sum of the communality and the
specific variance for a subtest as presented in Equation 9.13. Figure 9.6 illustrates how
the variance in a set of test scores is partitioned in CTT (top half) and FA (bottom half).
The following conceptual connections can be made between the equations in Figure
9.6. In Equation 9.6, Fk is synonymous with Ti, zik represents observed scores on test k,
Xi, and Eik are synonymous with Ei in the true score model. Using the assumptions of the
true score model and extending these ideas to factor analysis, we can estimate common
factors that are unobservable.

9.8 Components, Eigenvalues, and Eigenvectors

Foundational to understanding factor analysis are eigenvalues, eigenvectors, and princi-


pal components. We begin with an introduction of a component and how it is used in a
principal components analysis; later the idea of a principal component and eigenvalues are
related to factor analysis. A principal component is derived based on a linear combination
of optimally weighted observed variables (e.g., variables being subtests on the GfGc tests of
intelligence and memory). Starting with a set of correlated variables (e.g., the 10 subtests
in our GfGc data), the goal of a principal components analysis is to end with a set of uncor-
related components that account for a large portion of the total score variance (e.g., ideally
Factor Analysis  313

XO = XT + XE
Variance from Observed Scores (VO)
Variance from
Variance from True Scores (VT) Error Scores (VE)

80% 20%

VT = VCO + VSP + VE

Total Variance (VT)


Variance from Specific Variance from
Variance from Common Factors (VCO) Scores (VSP) Error Scores (VE)

Factor 1 (VA) Factor 2 (VB)

Commonality (h2)

VT VA VB VSP VE Unified common factor analysis model and classical


VT = 1.00 = VT + VT + VT + VT
true score (test theory) model expressed as variance
(V) components

Test reliability (rtt)

Figure 9.6. Variance partition in classical test theory and factor analysis. V denotes the vari-
ance; CO signifies common factor.
314  PSYCHOMETRIC METHODS

> 75%) in the original variables (subtests). An important difference between PCA and com-
mon factor analysis is that in PCA, the correlation matrix is used during estimation of the
loadings (i.e., the main diagonal in the matrix contains “1’s”). Therefore, in PCA the vari-
ances of the measured variables are assumed to be zero. To this end, PCA is a variable reduc-
tion technique that provides an explanation of the contribution of each component to the
total set of variables. The fundamental unit of analysis in PCA is the correlation matrix. In a
PCA, all of the values along the diagonal in the correlation matrix to be analyzed are set to
unity (i.e., a value of 1). The intercorrelations for the 10 GfGc subtests are revisited in Table
9.8 (notice that all of the values along the diagonal or darkest shaded cells are set to 1.0).
Because all values along the diagonal of the correlation matrix are 1.0, all of the vari-
ance between the observed variables is considered shared or common. The components result-
ing from a PCA are related to the variables by way of the factor–component relationship.
The first component derived from a PCA is a linear combination of subtests that represents
the maximum amount of variance; the variance of this first component is equal to the
largest eigenvalue of the sample covariance matrix (Fabrigar & Wegner, 2012, pp. 30–35;
McDonald, 1999). The second principal component is a second linear combination of
the 10 ­subtests that is uncorrelated with the first principal component. For example, con-
sider the four subtests comprising crystallized intelligence in the GfGc dataset. To obtain

Table 9.8.  Intercorrelations for 10 GfGc Subtests


1 2 3 4 5 6 7 8 9 10
1. Short-term memory: based on 1 — — — — — — — — —
visual cues
2. S
 hort-term memory: auditory and .517** 1 — — — — — — — —
visual components
3. Short-term memory: math reasoning .540** .626** 1 — — — — — — —

4. Gc: measure of vocabulary .558** .363** .406** 1 — — — — — —

5. Gc: measure of knowledge .602** .326** .384** .717** 1 — — — — —

6. Gc: measure of abstract reasoning .572** .413** .478** .730** .667** 1 — — — —

7. Gc: measure of conceptual reasoning .548** .319** .365** .749** .694** .677** 1 — — —

8. Gf: measure of graphic orientation .420** .407** .545** .391** .394** .528** .377** 1 — —

9. Gf: measure of graphic identification .544** .480** .588** .392** .374** .544** .397** .654** 1 —

10. G
 f: measure of inductive/deductive .073* .121** .156** 0.01 0.04 .096** 0.03 .210** .199** 1
reasoning
Note. N = 1,000. **Correlation is significant at the 0.01 level (2-tailed).
*Correlation is significant at the 0.05 level (2-tailed). Shaded cells highlight the intercorrelations among the
subtests comprising each of the three areas of intelligence.
Factor Analysis  315

examinee-­level scores on a particular principal component, observed scores for the 1,000
examinees for the 10 intelligence subtests are optimally weighted to produce an optimal
linear combination. The weighted subtests are then summed yielding a principal compo-
nent. An eigenvalue (also called a latent or characteristic root) is a value resulting from the
consolidation of the variance in a matrix. Specifically, an eigenvalue is defined as the column
sum of squared loadings for a factor. An eigenvector is an optimally weighted linear combi-
nation of variables used to derive an eigenvalue. The coefficients applied to variables to form
linear combinations of variables in all multivariate statistical techniques originate from eigen-
vectors. The variance that the solution accounts for (e.g., the variance of the first principal
component) is directly associated with or represented by an eigenvalue.

9.9 Distinction between Principal Components Analysis


and Factor Analysis

Using the intercorrelations in Table 9.1, we see that FA is used (1) to identify underlying
patterns of relationships for the 10 subtests and (2) to determine whether the information
can be represented by a factor or factors that are smaller in number than the total number
of observed variables (i.e., the 10 subtests in our example). A technique related to but
not the same as factor analysis, principal components analysis (Hotelling, 1933, 1936;
Tabachnick & Fidell, 2007), is used to reduce a large number of variables into a smaller
number of components. In PCA, the primary goal is to explain the variance in observed
measures in terms of a few (as few as possible) linear combinations of the original vari-
ables (Raykov & Marcoulides, 2011, p. 42). The resulting linear combinations in PCA
are identified as principal components. Each principal component does not necessarily
reflect an underlying factor because the goal of PCA is strictly variable reduction based
on all of the variance among the observed variables. PCA is a mathematical maximization
technique with mainly deterministic (descriptive) goals. Strictly speaking, PCA is not a
type of FA because its use involves different scientific objectives than factor analysis. In
fact, PCA is often incorrectly used as a factor-analytic procedure (Cudeck, 2000, p. 274).
For example, principal components analysis is not designed to account for the correla-
tions among observed variables, but instead is constructed to maximally summarize the
information among variables in a dataset (Cudeck, 2000, p. 275).
Alternatively, consider the goal of FA. One type of FA takes a confirmatory approach
where researchers posit a model based on a theory and use the responses (scores) of
examinees on tests based on a sample to estimate the factor model. The scores used in an
FA are evaluated for their efficacy related to the theory that is supposed to have gener-
ated or caused the responses. This type is confirmatory in nature (i.e., confirmatory factor
analysis). Recall that in CFA, researchers posit an underlying causal structure where one
or more factors exist rather than simply reducing a large number of variables (e.g., tests
or test items) to a smaller number of dimensions (factors).
Researchers also use FA in an exploratory mode (i.e., EFA). For example, suppose that
the theory of general intelligence was not well grounded empirically or theoretically. Using
316  PSYCHOMETRIC METHODS

the 10 subtests in the GfGc data, you might conduct an EFA requesting that three factors
be extracted and evaluate the results for congruence with the theory of general intelligence.
From a statistical perspective, the main difference between FA and PCA resides in
the way the variance is analyzed. For example, in PCA the total variance in the set of
variables is analyzed as compared to factor analysis where common and specific variance
is partitioned during the analysis. Figure 9.7 illustrates the way the variance is partitioned
in PCA versus FA. For an accessible explanation of how a correlation matrix containing a
set of observed variables is used in PCA versus common factor analysis, see Fabrigar and
Wegner (2012, pp. 40–84). Notice in the top portion of Figure 9.7 that there is no provi-
sion for the shared or overlapping variance or among the variables. For this reason, PCA is
considered a variance-maximizing technique and uses the correlation matrix in the analy-
sis (e.g., 1’s on the diagonal and correlation coefficients on the off-diagonal of the matrix).
Conversely, in FA the shared variance is analyzed (see the lower half of Figure
9.7), thereby accounting for the shared relationship among variables. Capturing the
relationship(s) among variables while simultaneously accounting for error variance (ran-
dom and specific) relative to a theoretical factor structure is a primary goal of factor analysis
(and particularly true for confirmatory factor analysis). FA therefore uses a reduced cor-
relation matrix consisting of communalities along the diagonal (see Equation 9.11) of the
matrix and correlation coefficients on the off-diagonal of the matrix. The communalities
are the squared multiple correlations and represent the proportion of variance in that variable
that is accounted for by the remaining test in the battery (Fabrigar & Wegner, 2012).
In FA, the common variance is modeled as the covariation (see Chapter 2 for how the
covariance is derived) among the subtests (i.e., the covariance is based on the deviation
of each examinee score from the mean of the scores for a particular variable). In this way,
common or shared variance is accounted for among the variables by working in deviation
scores, thereby keeping the variables in their original units of measurement. Additionally,
the variance (i.e., the standard deviation squared) is included along the diagonal of the
matrix (see Table 9.9).

Diagonal elements in the matrix


to be analyzed
Principal components analysis

Unity (1's on the diagonal of


the matrix to be analyzed) Total variance

Factor analysis

Commonality (shared variance on


the diagonal of the matrix to be Common Specific and error
analyzed)

Variance extracted
Variance lost

Figure 9.7.  Variance used in principal components analysis and factor analysis. Adapted from
Hair, Anderson, Tatham, and Black (1998, p. 102). Copyright 1998. Reprinted by permission of
Pearson Education, Inc. New York, New York.
Factor Analysis  317

Table 9.9.  Variance–Covariance Matrix for 10 GfGc Subtests


1 2 3 4 5 6 7 8 9 10
1. cri1_tot 403.36
2. cri2_tot 187.27 103.03
3. cri3_tot 179.81 86.37 95.88
4. cri4_tot 208.28 102.80 95.75 123.15
5. fi1_tot 38.07 15.17 28.04 16.86 44.37
6. fi2_tot 35.24 12.24 27.62 16.00 33.74 46.10
7. fi3_tot –57.56 –32.72 –29.67 –34.83 –9.17 –11.07 22.82
8. stm1_tot 72.84 36.20 34.28 38.20 8.63 12.79 –18.01 19.52
9. stm2_tot 36.27 12.27 21.22 13.19 22.61 26.32 –13.98 12.87 40.65
10. stm3_tot 21.51 5.49 10.48 6.43 7.58 11.53 –11.54 7.56 18.89 17.09
Note. Shaded cells include the variance for a specific variable. Off-diagonal cell values are covariances.

Although statistical programs such as SPSS and SAS internally derive the variance–
covariance matrix within FA routines when requested, the following program creates a
correlation matrix for the set of 10 GfGc subtests used in our example and then transforms
the matrix into a variance–covariance matrix. You may find the program useful for calcu-
lating a variance–covariance matrix that can subsequently be used in a secondary analysis.

SPSS program for creating a variance–covariance matrix

CORRELATIONS VARIABLES= cri1_tot cri2_tot cri3_tot cri4_tot


fi1_tot fi2_tot fi3_tot stm1_tot stm2_tot stm3_tot
/MATRIX=OUT(*).
MCONVERT /MATRIX=IN(*) OUT("C:\gfgccovmatrix.sav").

Next, we turn to an illustration of how PCA and FA produce different results using the
10 subtests in the GfGc data. First, the 10 subtests in the GfGc dataset are used to conduct
a PCA. The correlation matrix derived from the subtests in Table 9.1 is used to conduct the
PCA (refer to Figure 9.7 to recall how the variance is used in PCA). A partial output is dis-
playing the eigenvalue solution and principal components in Table 9.10 for the PCA. In Table
9.10, 10 eigenvalues are required to account for 100% of the variance in the 10 subtests. An
eigenvalue (reviewed in greater detail shortly) is a measure of variance accounted for by a
given dimension (i.e., factor). If an eigenvalue is greater than 1.0, the component is deemed
significant or practically important in terms of the variance it explains (Fabrigar & Wegner,
2012, p. 53). However, the eigenvalue greater than one rule has several weaknesses, and alter-
native approaches should also be used (e.g., see Fabrigar & Wegner, 2012, pp. 53–64). Spe-
cifically, parallel analysis, likelihood ratio tests of model fit, and minimum average partial
correlation techniques all offer improvements over the eigenvalue greater than one rule.
318  PSYCHOMETRIC METHODS

Table 9.10.  Eigenvalue Solution from Principal Components Analysis


Total Variance Explained

Initial Eigenvalues Extraction Sums of Squared Loadings

Component Total % of Variance Cumulative % Total % of Variance Cumulative %

1 5.111 51.109 51.109 5.111 51.109 51.109

2 1.389 13.892 65.001 1.389 13.892 65.001

3 .908 9.077 74.078

4 .667 6.672 80.750

5 .450 4.498 85.248

6 .359 3.590 88.838

7 .336 3.361 92.199

8 .294 2.938 95.137

9 .262 2.618 97.755

10 .224 2.245 100.000

Extraction Method: Principal Component Analysis.

Returning to our interpretation of the PCA and inspecting Table 9.10, we see that
only two of the eigenvalues meet the criteria of 1.0 criterion for retaining or classifying
a component as significant. In Table 9.10, the first principal component consists of an
eigenvalue of 5.1 and accounts for or explains 51% of the variance in the 10 subtests.
The second principal component consists of an eigenvalue of 1.34 and accounts for or
explains an additional 14% of the variance in the 10 subtests; these principal components
are uncorrelated, so they can be summed to derive a total cumulative variance. Together,
components one and two account for 65% of the cumulative variance in the 10 subtests.
Following is the SPSS program that produced Table 9.10.

SPSS program that produced Table 9.10

FACTOR
/VARIABLES stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_

tot cri4_tot fi1_tot fi2_tot fi3_tot
/MISSING LISTWISE
/ANALYSIS stm1_tot stm3_tot stm2_tot cri1_tot cri2_tot cri3_

tot cri4_tot fi1_tot fi2_tot fi3_tot
/PRINT INITIAL EXTRACTION
/CRITERIA MINEIGEN(1) ITERATE(25)
/EXTRACTION PC
/ROTATION NOROTATE
/METHOD=CORRELATION.
Factor Analysis  319

9.10 Confirmatory Factor Analysis

To this point in the chapter, FA has been presented in an exploratory and descriptive
manner. Our goal has been to infer factor structure from patterns of correlations in
the GfGc data. For example, using the crystallized and fluid intelligence subtests, we
reviewed how FA works and how it is used to identify the factor(s) underlying the
crystallized and fluid intelligence subtests. To accomplish this review, we allowed every
subtest to load on every factor in the model and then used rotation to aid in interpret-
ing the factor solution. Ideally, the solution is one that approximates simple structure.
However, the choice of a final or best model could only be justified according to sub-
jective criteria. CFA makes possible evaluation of the overall fit of the factor model,
along with the ability to statistically test the adequacy of model fit to the empirical data.
In CFA, we begin with a strong a priori idea about the structure of the factor model.
CFA provides a statistical framework for testing a prespecified theory in a manner that
requires stronger statistical assumptions than the techniques presented thus far. For an
applied example of CFA with a cross validation using memory test data, see Price et al.
(2002).

9.11 Confirmatory Factor Analysis and Structural


Equation Modeling

Structural equation modeling (SEM) provides a statistical framework that allows


for a set of relationships between one or more independent variables and one or
more dependent variables. In fact, CFA is a type of SEM that deals specifically with
measurement models—that is, the relationships between observed measures or indi-
cators (e.g., test items, test scores, or behavioral ratings) and latent variables or fac-
tors (Brown, 2006, p. 1). The independent and dependent variables may be latent
or observed variables, and the level of measurement may be discrete or continuous.
SEM is also known as causal modeling, covariance structure modeling, or simulta-
neous equation modeling. Path analysis and CFA are two special types of SEM in
which certain restrictions are imposed on the model to be estimated. SEM provides a
powerful framework for testing a priori hypotheses about a variety of causal models.
Specific to CFA, SEM provides a rigorous approach to testing the factorial structure
of a set of measured variables (e.g., the crystallized and fluid intelligence subtests in
our examples).
Certain common conventions (Schumacker & Lomax, 2010) are used in SEM: (1)
Measured variables are graphically depicted as rectangles or squares (Figure 9.8) and
are called observed variables, indicators, or manifest variables, and (2) factors have two
or more indicators and are called latent variables, constructs, or unobserved variables.
320  PSYCHOMETRIC METHODS

Observed variables – X
Measurement errors-E

Language development –X1

Factor loadings - λ

λ1

Lexical knowledge – X2
Factor-F λ2

Crystallized Intelligence – F
λ3
Listening ability – X3

λ4

Communication ability – X4

Factor analysis equation in matrix notation:

X = λF + E

Figure 9.8.  Measurement model for crystallized intelligence.

Factors are represented by ovals in a path diagram. Relationships in an SEM are repre-
sented by lines—either straight (signifying a direct relationship) or curved (representing
a covariance or correlation). Furthermore, the lines may have one or two arrows. For
example, a line with a single arrow represents a hypothesized direct relationship between
two variables; the variable with the arrow pointing to it is the dependent variable. A line
that includes arrows at both ends represents a covariance or correlation between two
variables with no implied direct effect.
In a latent variable SEM, two parts comprise the full SEM; a measurement model
and a structural model. In our example using crystallized and fluid intelligence, the
measurement model relates the subtest scores to the factor. For example, the measure-
ment model for crystallized intelligence is provided in Figure 9.8.
Figure 9.9 illustrates the common factor model introduced earlier.
Figure 9.10 illustrates an orthogonal common factor model based on the examples
in this chapter. Figure 9.11 illustrates an oblique or correlated factors model based on
crystallized and fluid intelligence.
Factor Analysis  321

Unique factors or
Observed variables measurement errors
common factors

Language development Error 1

Lexical–knowledge Error 2

Measurement errors
Crystallized intelligence

Listening ability Error 3

Communication ability Error 4

Sequential reasoning Error 5

Fluid intelligence Abstract reasoning Error 6


Measurement errors

Induction/deduction Error 7

Figure 9.9.  Common factor model represented as a path diagram. Exploratory common fac-
tor model is one where each factor is allowed to load on all tests. The dashed arrows are not
hypothesized to “cross-load,” but in an exploratory analysis, this is part of the analysis. Also, a
common factor is a factor that influences more than one observed variable. For example, language
development, lexical knowledge, listening ability, and communication ability are all influenced
by the crystallized intelligence factor. The common factor analysis model above is orthogonal
because the factors are not correlated (e.g., a double-headed arrow connecting the two factors is
not present).

SEM provides a thorough and rigorous framework for conducting factor analysis of
all types. However, conducting FA using an SEM approach requires a thorough under-
standing of covariance structure modeling/analysis in order to correctly use the tech-
nique. Additionally, interpretation of the results of a confirmatory (or exploratory) factor
analysis using SEM involves familiarity with model fit and testing strategies. Readers
interested in using SEM for applied factor analysis work or in factor-analytic research
studies are encouraged to see Schumacker and Lomax (2010), and Brown (2006).
322  PSYCHOMETRIC METHODS

Unique factors or
Observed variables measurement errors

Common factors

Language development Error 1

Lexical knowledge Error 2

Crystallized intelligence Measurement errors

Listening ability Error 3

Communication ability Error 4

Sequential reasoning Error 5

Fluid intelligence Abstract reasoning Error 6


Measurement errors

Induction/deduction Error 7

Figure 9.10. Orthogonal factor model represented as a path diagram. A common factor is a


factor that influences more than one observed variable. For example, language development, lexi-
cal knowledge, listening ability, and communication ability are all influenced by the crystallized
intelligence factor. The common factor analysis model above is orthogonal because the factors are
not correlated (e.g., a double-headed arrow connecting the two factors is not present).

9.12 Conducting Factor Analysis: Common Errors to Avoid

Given that the integrity of a FA hinges on the design of the study and the actual use of the
technique, there are many possible ways for researchers to commit errors. Comrey and
Lee (1992, pp. 226–228) and Fabrigar and Wegner (2012, pp. 143–151) offer the follow-
ing suggestions regarding errors to avoid when conducting an FA:

1. Collecting data before planning how the factor analysis will be used.
2. Using data variables with poor distributions and inappropriate regressions forms:
a. Badly skewed distributions, for example, with ability tests that are too easy or
too hard for the subjects tested.
b. Truncated distributions.
Factor Analysis  323

Unique factors or
Observed variables measurement errors
Common factors

Language development error 1

Lexical–knowledge error 2

Crystallized intelligence Measurement errors

Listening ability error 3

Communication ability error 4

Sequential reasoning error 5

Fluid intelligence Abstract reasoning error 6


Measurement errors

Induction/deduction error 7

Figure 9.11.  Oblique factor model represented as a path diagram. A common factor is a fac-
tor that influences more than one observed variable. For example, language development, lexical
knowledge, listening ability, and communication ability are all influenced by the crystallized intel-
ligence factor. The common factor analysis model above is oblique because the factors are corre-
lated (e.g., a double-headed arrow connecting the two factors is present).

c. Bimodal distributions.
d. Distributions with few extreme cases.
e. Extreme splits in dichotomized variables.
f. Nonlinear regressions.
3. Using data variables that are not experimentally independent of one another:
a. Scoring the same item responses on more than one variable.
b. In a forced-choice item, scoring one response alternative on one variable and
the other on a second variable.
c. Having one variable as a linear combination of others, for example, in the
GfGc data used in this book, crystallized intelligence and fluid intelligence
comprise part of the construct of general intelligence, so the total score for
general intelligence should not be factor analyzed as a single variable.
324  PSYCHOMETRIC METHODS

4. Failing to overdetermine the factors. For example, the number of variables


should be several times as large as the number of factors. There should be at
least 5 variables for each anticipated factor; 10 or more variables for each fac-
tor may be required, depending on the type and quality of the measures being
used.
5. Using too many complex data variables. The best variables for determining fac-
tors are relatively factor pure. Only a few multiple-factor data variables should
be used. For example, if complex variables measuring both factors A and B are
included, there must be some variables that measure A and not B and others that
measure B and not A.
6. Including highly similar variables in the analysis that produce factors at a very
low level in the hierarchy of factors when constructs of greater generalizability
are being sought.
7. Failing to provide good marker tests or variables for a factor that may be pres-
ent in other factor-complex data variables that are included. Without adequate
marker variables, the factor will be difficult to locate, although variance for that
factor will be present in the analysis and must appear on some factor.
8. Using poor sampling procedures:
a. Taking a sample of cases that is too small to obtain stable correlations.
b. Combining two distinct groups with different factor structures in the same
sample for factor-analytic purposes.
c. Losing of a factor through biased sampling that restricts the range of vari-
ability on that factor.
9. Using inappropriate correlation coefficients such as phi-coefficient or tetrachoric
correlation in situations that violate the assumptions of its use.
10. Using inappropriate communality estimates, for example, 1.0, in the matrix diag-
onals when the objectives of the study are concerned only with common-factor
variance.
11. Extracting too few factors, forcing a factor solution for m factors into a space of
fewer dimensions, with consequent distortion of the factor solution.
12. Employing poor rotation procedures:
a. Failing to rotate at all.
b. Using an orthogonal rotation when an oblique solution is necessary to give
an accurate picture of the results.
c. Permitting an unwarranted degree of obliquity between factors in pursuit of
simple structure.
d. Using a rotational technique that has not been determined to be appropriate
for the kind of data involved.
e. Rotating of extra-small factors by a rotation method, such as Varimax, that
spreads the variance out to minor factors to an inappropriate degree.
f. Failing to plan the study so that a suitable rotational criterion can be employed.
Factor Analysis  325

13. Interpreting the first extracted factor as a general factor.


14. Leaping to conclusions about the nature of a factor on the basis of insufficient
evidence—for example, low loadings and lack of outside confirmatory informa-
tion or evidence. The interpretations must be verified on the basis of evidence
outside the factor analysis itself. Follow-up factor analytic and construct valida-
tion studies are an important part of this verification process.

9.13 Summary and Conclusions

Factor analysis is a technique for reducing multiple themes embedded in tests to a simpler
structure. This technique is used routinely in the psychometric evaluation of tests and
other measurement instruments. It is particularly useful in establishing statistical evi-
dence for the construct validity of scores obtained on tests. An overview of the concepts
and process of conducting an FA was provided as they relate to the conceptual definitions
underlying a set of measured variables. Core questions common to correctly conduct-
ing and interpreting a factor-analytic study were provided. Starting with the correlation
matrix comprising a set of tests, the process of how FA works relative to the common
factor model was introduced by way of applied examples. Exploratory and confirmatory
approaches to FA were described, with explanations of when their use is appropriate.
The distinction was made between principal components analysis and factor analysis—­
conceptually and statistically.
Structural equation modeling was introduced as a technique that provides a flexible
and rigorous way to conduct CFA. The chapter concluded by presenting common errors
to avoid when conducting factor analysis.

Key Terms and Definitions


Common factor. A factor with which two or more subtests are correlated.

Common factor model. Factor analytic model where variables are correlated in part due
to common unobserved influence.
Communality. Reflects the portion of the subtest’s variance associated with the common
factor. It is the sum of the squared loadings for a variable across factors.
Confirmatory factor analysis. A technique used to test (confirm) a prespecified relation-
ship (e.g., from theory) or model representing a posited theory about a construct or
multiple constructs; the opposite of exploratory factor analysis.
Eigenvalue. The amount of total variance explained by each factor, with the total amount
of variability in the analysis equal to the number of original variables (e.g., each
variable contributes one unit of variability to the total amount, due to the fact that the
variance has been standardized; Mertler & Vannatta, 2010, p. 234).
Eigenvector. An optimally weighted linear combination of variables used to derive an
eigenvalue.
326  PSYCHOMETRIC METHODS

Exploratory factor analysis. A technique used for identifying the underlying structure
of a set of variables that represent a minimum number of hypothetical factors. EFA
uses the variance–covariance matrix or the correlation matrix where variables (or test
items) are the elements in the matrix.
Factor. An unobserved or a latent variable representing a construct. Also called an inde-
pendent variable in ANOVA terminology.
Factor indeterminacy. The situation in estimating a factor solution where an infinite num-
ber of possible sets of factor loadings are plausible.
Factor loading. The Pearson correlation between each variable (e.g., a test item or total
test score) and the factor.
Factor rotation. The process of adjusting the factor axes after extraction to achieve a
clearer and more meaningful factor solution. Rotation aids in interpreting the factors
produced in a factor analysis.
Latent factor. Manifested as unobserved influences among variables.

Measurement model. A submodel in structural equation modeling that specifies the


indicators (observed variables) for each construct (latent variable). Additionally, the
reliability of each construct may be estimated using measurement models.
Oblique rotation. Technique yielding transformed factors that are correlated (i.e., permit
factor orientations > 90°).
Orthogonal rotation. Technique yielding transformed factors that are uncorrelated (i.e.,
factors are oriented at 90° angles in multidimensional space).
Pattern matrix. The pattern of unique relationships that exist between a subtest and a
factor while controlling for the influence of all the other subtests.
P-type factor analysis. A type of factor analysis that focuses on change within a single
person or persons captured by repeated measurements over time.
Q-type factor analysis. Analysis that forms groups of examinees based on their similari-
ties on a set of characteristics (similar to cluster analysis).
R-type factor analysis. A type of factor analysis whose focus is on grouping variables
(e.g., subtests in the GfGc data) into similar clusters that reflect latent constructs.
Reduced correlation matrix. A correlation matrix consisting of communalities along the
diagonal of the matrix.
Simple structure. Identification of as pure a set of variables as possible (e.g., each vari-
able or subtest loads on as few factors as possible and as many zeros as possible in
the rotated factor matrix; Kerlinger & Lee, 2000).
Specific variance s2i   . Systematic variance not shared by any other measure.
Factor Analysis  327

Structural equation modeling. A multivariate technique that combines multiple regression


(examining dependence relationships) and factor analysis (representing unmeasured
concepts or factors comprised of multiple items) to estimate a series of interdependent
relationships simultaneously (Hair et al., 1998, p. 583).
Structural model. A set of one or more dependence relationships linking the model’s
hypothesized constructs.
Unique factor. A factor that is correlated with only one subtest (i.e., its association is
exclusive to a single subtest).
10

Item Response Theory

In this chapter, an alternative to the classical test theory model is presented. Item response
theory (IRT) is a model-based approach to measurement that uses item response patterns
and ability characteristics of individual persons or examinees. In IRT, a person’s responses
to items on a test are explained or predicted based on his or her ability. The response
patterns for a person on a set of test items and the person’s ability are expressed by a
monotonically increasing function. This chapter introduces Rasch and IRT models and their
assumptions and describes four models used for tests composed of dichotomously scored
items. Throughout the chapter examples are provided, using data based on the general-
ized theory of intelligence.

10.1 Introduction

The classical test theory (CTT) model serves researchers and measurement special-
ists well in many test development situations. However, as with any method, there are
shortcomings that give rise to the need for more sophisticated approaches. Recall from
Chapter 7 that application of the CTT model involves using only the first and second
moments of a distribution of scores (i.e., the mean and variance or covariances) to index
a person’s performance on a test. In CTT, the total score for an examinee is derived by
summing the scores on individual test items. Using only the total score and first and sec-
ond moments of a score distribution (i.e., the mean and standard deviation) is somewhat
limiting because the procedure lacks a rigorous framework by which to test the efficacy
of the scores produced by the final scale. An alternative approach is to have a psychomet-
ric technique that provides a probabilistic framework for estimating how examinees will
perform on a set of items based on their ability and characteristics of the items (e.g., how
difficult an item is). Item response theory (IRT), also known as modern test theory, is a
system of modeling procedures that uses latent characteristics of persons or examinees
329
330  PSYCHOMETRIC METHODS

and test items as predictors of observed responses (Lord, 1980; Hambleton & Swaminathan,
1985; de Ayala, 2009). Similar to other statistical methods, IRT is a model-based theory
of statistical estimation that conveniently places persons and items on the same metric
based on the probability of response outcomes. IRT offers a powerful statistical frame-
work that is particularly useful for experts in disciplines such as cognitive, educational,
or social psychology when the goal is to construct explanatory models of behavior and/
or performance in relation to theory.
This chapter begins by describing the differences between IRT and CTT and provides
historical and philosophical perspectives on the evolution of IRT. The chapter proceeds by
describing the assumptions, application, and interpretation of the Rasch, one-­parameter
(1-PL), two-parameter (2-PL), and three-parameter (3-PL) IRT models for dichotomous
test item responses. Throughout the chapter, applied examples are provided using the
generalized theory of intelligence test data introduced in Chapter 2.

10.2 How IRT Differs from CTT

IRT is a probabilistic, model-based test theory that originates from the pattern of exam-
inees’ responses to a set of test items. Fundamentally, it differs from CTT because in CTT
total test scores for examinees are based on the sum of the responses to individual items.
For example, each test item within a test can be conceptualized as a “micro” test (e.g.,
an item on one of the subtests on crystallized intelligence used throughout this book)
within the context of the total test score (e.g., the composite test score conceptualized in
a “macro” perspective). The sum score for an examinee in CTT is considered a random
variable. One shortcoming of the CTT approach is that the statistics used in evaluating
the performance of persons are sample dependent (i.e., they are deterministic compared
to probabilistic). The impact of a particular sample on item statistics and total test score
can be restrictive during the process of test development. For example, when a sample
of persons or examinees comes from a high-ability level on a particular trait (e.g., intelli-
gence), they are often unlike persons comprising the overall population. Also, the manner
in which persons at the extreme sections of a distribution (e.g., our high-ability example)
perform differs from the performance samples composed of a broad range of ability.
Another restriction when using CTT is the need to adhere to the assumption of
parallel test forms (see Chapter 7 for a review). In CTT, the assumption of parallel forms
rests on the idea that, in theory, an identical set of test items meeting the assumption of
strictly parallel tests is plausible—an assumption rarely, if ever, met in practice. Further-
more, because CTT incorporates group-based information to derive estimates of reliabil-
ity, person or examinee-specific score precision (i.e., error of measurement) is lacking
across the score continuum. In fact, Lord (1980) noted that increased test score validity is
achieved by estimating the approximate ability level and the associated error of measure-
ment of each examinee with ability (q).
A third restriction of CTT is that it includes no probabilistic mechanism for estimat-
ing how an examinee might perform on a given test item. For example, a probabilistic
Item Response Theory  331

framework for use in test development is highly desirable if the goals are (1) to predict
test score characteristics in one or more populations or (2) to design a test specifically
tailored to a certain population. Finally, other limitations of CTT include the inability
to develop examinee-tailored tests through a computer environment (e.g., in computer
adaptive testing [CAT]), less than desirable frameworks for identifying of differential
item functioning (DIF), and equating test scores across different test forms (de Ayala,
2009; Lord, 1980; Hambleton, Swaminathan, & Rogers, 1991).

10.3 Introduction to IRT

IRT posits, first, that an underlying latent trait (e.g., a proxy for a person’s ability) can
be explained by the responses to a set of test items designed to capture measurements
on some social, behavioral, or psychological attribute. The latent trait is represented as a
continuum (i.e., a continuous distribution) along a measurement scale. This idea closely
parallels the factor analysis model introduced in Chapter 9 where an underlying unob-
servable dimension or dimensions (e.g., construct(s)) are able to be explained by a set of
variables (e.g., test or survey items/questions) through an optimum mathematical func-
tion. Unidimensional IRT models incorporate the working assumption of unidimension-
ality, meaning that responses to a set of items are represented by a single underlying
latent trait or dimension (i.e., the items explain different parts of a single dimension).
A second assumption of standard IRT models is local independence, meaning that
there is no statistical relationship (i.e., no correlation) between persons’ or examinees’
responses to pairs of items on a test once the primary trait or attribute being measured is
held constant (or is accounted for).
The advantages of using IRT as opposed to CTT in test development include (1) a
more rigorous model-based approach to test and instrument development, (2) a natural
framework for equating test forms, (3) an adaptive or tailored testing approach relative
to a person’s level of ability to reduce the time of testing (e.g., on the Graduate Record
Examination), and (4) innovative ways to develop and maintain item pools or banks for
use in computer adaptive testing. Moreover, when there is an accurate fit between an item
response model and an acquired set of data, (1) item parameter estimates acquired from
different groups of examinees will be the same (except for sampling errors); (2) exam-
inee ability estimates are not test dependent and item parameters are not group dependent;
and (3) the precision of ability estimates are known through the estimated standard errors
of individual ability estimates (Hambleton & Swaminathan, 1985, p. 8; Hambleton et al.,
1991; Baker & Kim, 2004; de Ayala, 2009). The last point illustrates that IRT provides
a natural framework for extending notions of score reliability. For example, IRT makes
it possible to estimate conditional standard errors of measurement and reliability at the
person ability level (Raju et al., 2007; Price, Raju, & Lurie, 2006; Kolen et al., 1992; Feldt
& Brennan, 1989; Lord, 1980). Estimating and reporting conditional standard errors of
measurement and score reliability is highly recommended by AERA, APA, and NCME
(1999) and is extremely useful in test development and score interpretation. Additionally,
332  PSYCHOMETRIC METHODS

using IRT to scale or calibrate a set of test items provides an estimate of the reliability
based on the test items.

10.4 Strong True Score Theory, IRT, and CTT

IRT is formally classified as a strong true score theory. In a psychometric sense, this
theory implies that the assumptions involved in applying models correctly to real data
are substantial. For example, the degree to which item responses fit an ideal or proposed
model is crucial. In fact, strong true score models such as IRT can be statistically tested
for their adequacy of fit to an expected or ideal model. Alternatively, consider CTT
where item responses are summed to create a total score for a group of examinees. In
Chapter 7, it was noted that the properties of CTT are based on long-run probabilistic
sampling theory using a mainly deterministic perspective. In CTT, a person’s true score
is represented by a sum score that is based on the number of items answered correctly.
The number correct or the sum score for a person serves as an unbiased estimate of the
person’s true score. In CTT, the total score (X) is a person’s unbiased estimate of his or
her true score (T). True score is based on the expectation over a theoretically infinite
number of sampling trials (i.e., long-run probabilistic sampling). Classical test theory
is not a falsifiable model, meaning that a formal test of the fit of the CTT model to the
data is not available.
In IRT, the probability that a person with a particular true score (e.g., estimated by
an IRT model) will exhibit a specific observed score makes IRT a probabilistic approach
to how persons or examinees will likely respond to test items. In IRT, the relationship
between observed variables (i.e., item responses) and unobserved variables or latent
traits (i.e., person abilities) is specified by an item response function (IRF) graphed as
an item characteristic curve (ICC). An examinee’s true score is estimated or predicted
based on his or her observed score. Thus, IRT is the nonlinear regression of observed
score on true score across a range of person or examinee abilities. Establishing an esti-
mated true score for a person by this probabilistic relationship formally classifies IRT
as a strong true score theory. Conversely, CTT is classified as weak theory and therefore
involves few assumptions.
To summarize, IRT is based on the following two axioms. The first axiom is that the
probability of responding correctly to a test item is a mathematical function of a person’s
underlying ability formally known as his or her latent trait or ability. The second axiom
states that the relationship between persons’ or examinees’ performance on a given item
and the trait underlying their performance can be described by a monotonically increas-
ing IRF graphically depicted as an ICC. The ICC is nonlinear or S-shaped owing to the
fact that the relationship between the probability of a correct response to an item (dis-
played on the Y-axis) is expressed as a proportion (a range of 0.0 to 1.0); the proportion
is mapped onto the cumulative normal distribution function (the X-axis) representing
a person’s ability or latent trait. The shape of the IRF/ICC is illustrated shortly using an
example from the intelligence data used throughout the book.
Item Response Theory  333

From a statistical perspective, an important difference between CTT and IRT is the
concept of falsifiability. Item response models are able to be falsified because an item
response model cannot be demonstrated to be correct or incorrect in an absolute sense
(or simply by tautology, meaning that it is valid without question). Instead, the appropri-
ateness of a particular IRT model relative to a particular set of observed data is established
by conducting goodness-of-fit testing for persons and items. For example, the tenability
of a particular IRT model given a set of empirical data is possible after inspection of the
discrepancy between the observed versus predicted residuals (i.e., contained in an error
or residual covariance matrix) after model fitting. Readers may want to return to Chapter
2 and the Appendix to review the role of the covariance matrix in statistical operations in
general, and regression specifically.
Finally, because all mathematical models used to describe a set of data are based on
a set of assumptions, the process of model selection occurs relative to the item develop-
ment and proposed uses of the test (e.g., the target population of examinees for which
the scores will be used).

10.5 Philosophical Views on IRT

Central to IRT is a unified model-based approach that provides a probabilistic frame-


work for how examinees of differing levels of ability are expected to respond to a set of
test items. Historically, two philosophical approaches have been proposed in relation to
IRT. The first approach aligns with classical probability theory in that over a theoreti-
cally repeated number of test administrations (i.e., frequentist long-run notion of prob-
ability), a person will respond to a test item correctly a specific proportion of the time
(i.e., proportion being a probability). Because the probability of responding correctly or
incorrectly to a test item is attainable and can be linked to frequentist probability theory,
this approach is extremely useful in the context of test construction. For example, Bush
and Mosteller (1955) view IRT as a probabilistic learning model based on choices and
decisions that are inescapable features of intelligent behavior. Further, they argue that
any data gathered in the social and behavioral sciences is statistical in nature. Similarly,
in Rasch’s (1960) seminal work, he ascribed to the approach of assigning every person
a probability of answering an item correctly based on the person’s ability. In the Rasch
approach to measurement, person ability is the driving factor in the model. Thus, a prob-
ability is assigned to a person rather than simply fitting a model or models to a set of item
responses. The Rasch modeling strategy is to develop a set of test items that conform to
how persons are expected to respond based on their abilities. From a probabilistic per-
spective, Rasch’s approach provides a cogent framework for what is described as sample-
free measurement of item analysis (Wright & Stone, 1979).
The second philosophical approach to IRT focuses on a sampling or data-driven
approach (Wainer, Bradlow, & Wang, 2007; Thissen & Wainer, 2001). In the data-driven
(sampling) approach, item-level scores (i.e., responses to test items) are related to a ran-
domly selected person’s given ability in mathematical functions that elicit the probability
334  PSYCHOMETRIC METHODS

of each possible outcome on an item (Lord, 1980; Lord & Novick, 1968). In the sampling
approach to IRT, the process of fitting statistical models to a set of item responses focuses
initially on a set of examinees’ item scores rather than on person ability. This differs
from Rasch’s sample-free approach to measurement where person ability is the dominant
component in the probabilistic model. In the Rasch model, test items are constructed or
designed to “fit” the properties of Rasch measurement theory. For readers interested in
the details of Rasch measurement theory, see Wright and Stone (1979) and Bond and Fox
(2001). With regard to the philosophical stances between the Rasch and IRT approaches,
as Holland and Hoskins (2003) note, item parameters and person abilities are always
estimated in relation to a sample obtained from a population. In this sense, it is illusory to
believe that a sample-free measurement exists. In the end, both philosophical approaches
have merit and should be considered when deciding on an approach to address practical
testing problems.
Table 10.1 provides the taxonomy of Rasch and IRT models. From this table, we
see that many Rasch and IRT models are available to meet a variety of measurement and
testing scenarios. In this chapter, we focus on four models (highlighted in gray in Table
10.1) that are foundational to understanding and using IRT: the Rasch, one-parameter,
two-parameter, and three-parameter unidimensional models for dichotomous items.
Once the foundations of Rasch and IRT models are presented, readers are encouraged to
expand their knowledge by reading the suggested references that introduce variations of
the models in this chapter.

10.6 Conceptual Explanation of How IRT Works

To illustrate how Rasch analysis and IRT works, an example is presented using our exam-
ple intelligence test data. The example given in the next sections is based on the Rasch
model. The Rasch model is formally introducted in Section 10.18, and is used in the sec-
tions that immediately follow because it is foundational to item response modeling. We
begin by illustrating how person ability and test items are related on a single continuum.
Next, the assumptions of Rasch and IRT models are reviewed, and applied examples of
how to evaluate the assumptions are provided.
Returning to the example intelligence test data used in this book, we see that a
person with a higher level of intelligence should be more likely to respond correctly to a
particular item in relation to a person with a lower level of intelligence. Recall that intel-
ligence is a latent trait or attribute that is not directly observable. Graphically, an example
of a continuum representing a latent attribute is provided in Figure 10.1. The values on
the horizontal line in the figure are called logits and are derived using the logistic equa-
tion (see Equations 10.5 and 10.6). Logit values are based on a transformation that yields
item locations that are linearized by applying the logarithmic transform to nonlinear data
(e.g., the probability of a correct response based on binary test item responses). Notice
in Figure 10.1 that both the location of items and the ability of a person are located on
the same scale.
Item Response Theory  335

Table 10.1.  Taxonomy of Unidimensional, Multidimensional, Nonparametric,


and Bayesian IRT Models
Type of data Model References
Unidimensional
Dichotomous Linear latent Lazarsfeld & Henry (1968)
Perfect scale Guttman (1944)
Latent distance Lazarsfeld & Henry (1968)
Rasch Rasch (1960)
1-, 2-, 3-parameter normal ogive Lord (1952)
1-, 2-, 3-parameter logistic Birnbaum (1957, 1958a, 1958b, 1968);
Lord & Novick (1968); Lord (1980);
Rasch (1960); Wright & Stone (1979)
4-parameter logistic McDonald (1967)

Multicategory Nominal response Bock (1972)


Graded response Samejima (1969)
Partial credit model Masters (1982)

Continuous Continuous response Samejima (1972)


Multidimensional (compensatory)
Dichotomous Multidimensional 2- and 3- Reckase (1985, 2009)
parameter logistic
Multidimensional 2-parameter Bock & Aitkin (1982)
normal ogive

Multicategory/ Loglinear multidimensional Kelderman (1992, 1997)


polytomous

Dichotomous/ Multidimensional linear logistic for Fischer (1983)


polytomous change

Dichotomous/ Multidimensional factor-analytic IRT McDonald (1982, 1999)


polytomous
Multidimensional (noncompensatory)
Multicomponent Whitely (Embretson) (1980)
response
Nonparametric
Dichotomous Mokken & Lewis (1982)

Polytomous Molenaar (2002)


Bayesian
Dichotomous Albert (1992); Albert & Chib (1993);
Bradlow, Wainer, & Wang (1999)

Polytomous Wainer, Bradlow, & Wang (2007); Fox


(2010)

Testlet Wainer, Bradlow, & Wang (2007)


336  PSYCHOMETRIC METHODS

Person Ability θ

lower ability higher ability

δ -3 -2 -1 0 1 2 3
item difficulty

1 2 3 4 5 6
Items

Figure 10.1.  Latent variable continuum based on six items mapped by item location/difficulty.
q = theta; d = delta.

To interpret Figure 10.1, consider the hypothetical person that exhibits an ability of
0.0 on the ability (q) scale. Easier items (e.g., 1 and 2) are on the left side of the continuum;
moderately difficult items (e.g., 3 and 4) are in the middle; and harder items (e.g., 5 and
6) are on the positive side of the continuum. From a probabilistic perspective, this person
with ability = 0.0 will be less likely to correctly answer item 5 than item 3 because item 5 is
displayed as having a difficulty of d5 = 2.0 on the logit scale, compared to item 3 having a
difficulty of 0.0. A person with ability of 0.0 is less likely to respond correctly to item 5 as
compared to item 3 because the discrepancy between the difficulty of item 5 and the abil-
ity of the person (0.0) is larger than the discrepancy between item 3 and the ability of the
person. For example, an item with a difficulty of 2.0 is more difficult than, say, an item with
d4 = 1.0. Conversely, the same person with ability q = 0.0 responding to item 1,d1 = –3.0 will
be very likely to respond correctly to the item given that the item location is on the extreme
lower end of the item location/difficulty and theta (ability) continuum.
The key idea in Figure 10.1 is that the greater the discrepancy between the person abil-
ity and item location, the greater the probability of correctly predicting how the person will
respond to an item or question (i.e., correct/incorrect or higher/lower on an ordinal-type
scale). In the Rasch model, the only item characteristic being measured is item diffi-
culty, d. Under these circumstances, as the discrepancy between the person ability and
item locations nears zero, the probability of a person responding correctly to an item
approaches .50, or 50%.

10.7 Assumptions of IRT Models

IRT is a statistical model and like most statistical models assumptions are involved. The
first assumption to consider prior to the application of any Rasch or IRT model is the
dimensionality of the set of items comprising the test or instrument. The dimensionality
of a test specifies whether there are one or more underlying abilities, traits, or attributes
Item Response Theory  337

being measured by the set of items. The term dimension(s) is used synonymously with
person ability or latent trait in IRT. Abilities or traits modeled in IRT can reflect educa-
tional achievement, attitudes, interests, or skill proficiency—all of which may be mea-
sured and scaled on a dichotomous, polytomous (ordinal), or unordered categorical level.
The most widely used Rasch and IRT model is unidimensional and assumes that there
is a single underlying ability that represents differences between persons and items on a
test. Strictly speaking, the assumption of unidimensionality is rarely able to be perfectly
met in practice owing to the interplay of a variety of factors such as test-taking anxiety,
guessing, and the multidimensional nature of human cognitive skills and abilities. How-
ever, the performance of Rasch and IRT models has been shown to be robust to minor
violations of the dimensionality assumption, provided that a single overriding or dominant
factor influences test performance (Hambleton et al., 1991). Users of conventional unidi-
mensional IRT models assume that a single ability sufficiently explains the performance
of an examinee or examinees on a set of test items.

10.8 Test Dimensionality and IRT

The dimensionality of a test is closely related to the idea of a single underlying factor (or
latent trait in IRT terminology) represented by a set of items or questions. Evaluating
the dimensionality of a test can proceed in a number of ways (Hattie, 1985). This sec-
tion begins with early approaches to dimensionality assessment related to IRT and then
transitions to more sophisticated approaches now commonly used. In early applications
of IRT, Lord (1980) recommended examining the eigenvalues produced from a linear
factor analysis in relation to the number of dominant factors present in a particular set of
items. For readers unfamiliar with factor analysis, the topic is presented in Chapter 9 and
should be reviewed to fully understand the key ideas presented here. In factor analysis, an
eigenvalue represents the amount of variance accounted for by a given factor or dimen-
sion. Figure 10.2 illustrates the situation where a single dominant factor (i.e., a distinct
eigenvalue between 4 and 5 on the Y-axis followed by a 90-degree elbow at the second fac-
tor) exists by way of a scree plot, a test attributed to Cattell (1966). A scree plot is a graph
of the number of factors depicted by the associated eigenvalues generated using principal
axis factor analysis. The eigenvalues that appear after the approximate 90-degree break
in the plot line (e.g., eigenvalue 2 and beyond) are termed “scree” synonymous with rem-
nants or rubble at the bottom of a mountain. In Figure 10.2, eigenvalues are plotted as a
function of the number of factors in a particular set of item responses or variables.

10.9 Type of Correlation Matrix to Use


in Dimensionality Analysis

Traditional factor analysis techniques such as those introduced in Chapter 9 are appropri-
ate for interval or continuous data. When the item-level response data are dichotomous
338  PSYCHOMETRIC METHODS

Eigenvalue
2

0
0 2 4 6 8 10 12
Number of factors

Figure 10.2.  Scree plot generated from principal axis factor analysis.

with an assumed underlying distribution on the latent trait being normal, the tetrachoric
correlation matrix is the appropriate matrix to use for analysis (Lord, 1980; Lord &
Novick, 1968). The tetrachoric correlation coefficient (introduced in the Appendix) is
a measure of the relationship between two dichotomous variables where the underly-
ing distribution of performance on each variable is assumed to be normal (McDonald
& Ahlawat, 1974). According to Lord and Novick (1968), a sufficient condition for the
existence of unidimensionality for a set of dichotomously scored test items is that the
result of factor analyzing a matrix of tetrachoric correlations results in a single common
factor. To illustrate factor analysis of our example item response data, the LISREL/PRELIS
8 program (Jörskog & Sörbom, 1999a) is used. The following PRELIS program produces
a factor analysis using polychoric/tetrachoric correlation matrices. For a review of poly-
choric/tetrachoric correlation coefficients and how they differ from Pearson correlation
coefficients see the Appendix. The syntax language in the PRELIS program below can be
referenced in the PRELIS 2 User’s Reference Guide (Jörskog & Sörbom, 1999b, pp. 7–8).

PRELIS DATA PREP PROGRAM FOR CRYSTALLIZED INTELLIGENCE TEST 2


!PRELIS SYNTAX
SY=CRI2.PSF
OU MA=PM SM=F:\CRI2.PML AC=F:\CRI2.ACP XM
Note. The CRI2.PSF is a data file that is created in PRELIS after importing the data from a text or SPSS
file. The .PML file contains the polychoric correlation matrix. The .ACP file contains the asymptotic
covariance matrix.

Once the output files are created and saved (output files are saved using the shaded
line in the program syntax above), the following LISREL program can be used to run
the factor analysis on tetrachoric correlations to evaluate the dimensionality of the set of
Item Response Theory  339

items. The syntax in the LISREL program below can be referenced in the LISREL 8 User’s
Reference Guide (Jörskog & Sörbom, 1996, pp. 248–249).

FACTOR ANALYSIS OF DICHOTOMOUS ITEMS ON CRYSTALLIZED


INTELLIGENCE TEST 2
DA NI=25 NO=1000 MA=PM
PM FI='F:\CRI2.PML'; AC FI='F:\CRI2.ACP'
MO NX=25 NK=1 LX=FR PH=ST
OU SE TV RS

Next, an abbreviated output from the LISREL program factor analysis results is provided
that includes the fit of the item-level data to a one-factor model.

ABBREVIATED OUTPUT FROM FACTOR ANALYSIS OF DICHOTOMOUS ITEMS ON


CRYSTALLIZED INTELLIGENCE TEST 2

Goodness of Fit Statistics

Degrees of Freedom = 275


Minimum Fit Function Chi-Square = 52.76 (P = 1.00)
Normal Theory Weighted Least Squares Chi-Square = 51.66 (P =
1.00)
Satorra-Bentler Scaled Chi-Square = 1302.86 (P = 0.0)
Chi-Square Corrected for Non-Normality = 19856.79 (P = 0.0)
Estimated Non-centrality Parameter (NCP) = 1027.86
90 Percent Confidence Interval for NCP = (919.27 ; 1143.95)

Minimum Fit Function Value = 0.053


Population Discrepancy Function Value (F0) = 1.03
90 Percent Confidence Interval for F0 = (0.92 ; 1.15)
Root Mean Square Error of Approximation (RMSEA) = 0.061
90 Percent Confidence Interval for RMSEA = (0.058 ; 0.065)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.00

Qplot of Standardized Residuals


3.5..........................................................
. ..
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. x . .
340  PSYCHOMETRIC METHODS

. . .
. x . .
. x . .
N . x . .
o . x . .
r . x x . .
m . x x . .
a . * . .
l . x x . .
. xx . .
Q . xx . .
u . x . .
a . xx . .
n . * . .
t . xx . .
i .* . .
l .* . .
e .x . .
s .x . .
.x . .
.x . .
. . .
x . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
–3.5..........................................................
–3.5 3.5
Standardized Residuals

Note. Asterisks represent multiple data points. The X’s represent single data points.

Reviewing the output from the LISREL factor analysis, we see that the one-­factor
model is supported as evidenced by the root mean square error of approximation (RMSEA)
being 0.06 (a value less than .08 is an established cutoff for adequate model-data fit in
factor analysis (FA) conducted in structural equation modeling). Also presented is the
Q-residual plot, which illustrates how well the one-factor model fits the data from the
view of the residuals (i.e., the observed versus predicted values). A residual is defined as
the discrepancy between the actual (sample) data and the fitted covariance matrix. For
example, if we see excessively large residuals (i.e., > 3.5 in absolute terms) or a severe
departure from linearity, or if the plotted points do not extend entirely along the diagonal
Item Response Theory  341

line, there is at least some degree of misfit (Jörskog & Sörbom, 1996, p. 110). In the
Q-plot, the standardized residuals are defined as the fitted residual divided by the large
sample standard error of the residual. Although the data presented in the Q-plot do not
reflect a perfect fit, inspection of it along with the fit statistics reported provides sufficient
evidence for supporting the existence of the one-factor model (i.e., we can be confident
the set of item responses is unidimensional).

10.10 Dimensionality Assessment Specific to IRT

Several methods of dimensionality assessment were developed specifically for use


with IRT. These methods fall into the classification of nonparametric approaches to
assessing the dimensionality of a test. Nonparametric techniques make no assumption
about the shape of distribution in the population (e.g., that the data reflect a normal
distribution). The primary nonparametric approach developed for IRT is attributed
to Nandakumar and Stout (1993) and Stout (1987). The nonparametric approach is
similar to parametric factor-analytic methods, except there is no specific form of the
item response function assumed other than that it is monotonically increasing (e.g.,
the S-shaped form of the item response curve). The nonparametric approach as incor-
porated in the computer program DIMTEST (Stout, 1987, 2006) provides a test of uni-
dimensionality that (1) evaluates the assumption of local item independence and (2)
tests the number of dimensions that exist in a set of item responses. Unique to DIM-
TEST is the implementation of a statistical test of essential unidimensionality (Stout,
1987). The DIMTEST program provides a T-statistic that tests the null hypothesis that
a set of test items equals a single dimension. Additionally, the IRT assumption of local
item independence (presented in detail in the next section) is simultaneously evalu-
ated with the DIMTEST T-statistic. Similar to factor-analytic approaches to dimen-
sionality assessment, the DIMTEST program operates in exploratory and confirmatory
modes. Below are the results of a DIMTEST analysis for the 25-item crystallized intel-
ligence test 2.

DIMTEST analysis results for crystallized intelligence test 2

DIMTEST SUMMARY OUTPUT


--------------------------------------------------

Original Data Set: F:\CRI2.DAT

Number of Items Used: 25

Number of Examinees Used to


Calculate DIMTEST Statistic: 1000
342  PSYCHOMETRIC METHODS

Minimum Cell Size for


Calculating DIMTEST Statistic: 2

Number of Examinees After


Deleting Sparse Cells: 978

Proportion of Examinees Used to


Calculate DIMTEST Statistic: 0.9780

Number of Simulations Used


to Calculate TGbar: 100

Randomization Seed: 99991

Estimate of Examinee
Guessing on Test: 0.0000
--------------------------------------------------

AT List PT List
--------------------------------------------------
5 1 2 3 4 6 7
8 13 16 17 18 19 20
9 21 22 23 24 25
10
11
12
14
15
--------------------------------------------------
--------------------------------------------------

TL=sum(TL,k)/sqrt(sum(S2,k)) {using original data}


TG=sum(TL,k)/sqrt(sum(S2,k)) {using simulated data}
TGbar = mean of ** TGs
T=(TL-TGbar)/sqrt(1+1/**)

--------------------------------------------------
DIMTEST STATISTIC
--------------------------------------------------

TL TGbar T p-value
--------------------------------------------------
8.0022 7.0387 0.9587 0.1688
Item Response Theory  343

We see from the DIMTEST T-statistic (last line in the output) that the hypothesis test
of multiple dimensions is rejected (p = .168), providing support for unidimensionality for
crystallized intelligence test 2.
An alternative to DIMTEST is full information item factor analysis—a factor-analytic
technique that uses tetrachoric correlations to estimate an item response function (IRF)
based on the item responses. The full information item factor analysis technique is imple-
mented in the program TESTFACT (Bock, Gibbons, & Muraki, 1988, 1996). Another
program that is similar to TESTFACT and very useful for IRT-based dimensionality
assessment is the Normal Ogive Harmonic Analysis Robust Method (NOHARM; Fraser
& McDonald, 2003). Returning to the TESTFACT program, we list below the TESTFACT
syntax for conducting a test of dimensionality for crystallized intelligence test 2 using full
information item factor analysis.

TESTFACT program for full information item factor analysis

>TITLE
CRYSTALLIZED INTELLIGENCE
TEST 2 - 25 ITEMS;
>PROBLEM NIT=25, RESPONSE=3;
>RESPONSE ' ','0','1';
>KEY 1111111111111111111111111;
>TETRACHORIC NDEC=3, LIST;
>RELIABILITY ALPHA;
>PLOT PBISERIAL, FACILITY;
>FACTOR NFAC=1, NROOT=3;
>FULL CYCLES=20;
>TECHNICAL NOADAPT PRECISION=0.005;
>INPUT WEIGHT=PATTERN, FILE='F:\CRI2.DAT';
(25A1)
>STOP
>END

The results of the TESTFACT analysis concurs with our factor analysis conducted
using LISREL and reveals one underlying dimension for the set of 25 items. For example,
the largest eigenvalue (latent root) is 9.513384 (see below), with the next largest eigen-
value dropping substantially to a value of 2.086334 and on down the line until at eigen-
value number six the value is .336450.

Partial output from TESTFACT program for full information item factor analysis

DISPLAY 8. THE NROOT LARGEST LATENT ROOTS OF THE


CORRELATION MATRIX

1 2 3 4 5 6
1 9.513384 2.086334 0.869951 0.510628 0.392143 0.336450
344  PSYCHOMETRIC METHODS

DISPLAY 15. PERCENT OF VARIANCE ACCOUNTED FOR IN 1-FACTOR


MODEL

1
1 0.464062

AVERAGE TETRACHORIC CORRELATION = 0.3437


STANDARD DEVIATION = 0.2069
NUMBER OF VALID ITEM PAIRS = 272

We can evaluate whether a one- or two-factor model best fits the data by conducting
two separate analyses with TESTFACT by simply changing the NFAC keyword in the
program (highlighted in gray), then comparing the results using a chi-square difference
test. Calculating the difference between the one-factor model chi-square and the two-
factor model yields a chi-square of 270.68; the results are provided below.

Chi-square fit statistics

One-Factor Model
Chi-square = 4552.15 and degrees of freedom = 449.00
Two-Factor Model
Chi-square = 4281.47 and degrees of freedom = 425.00

The difference between degrees of freedom for the two models is 24. We consult a chi-
square table and find that the difference between chi-square statistics of 270.68 and
degrees of freedom of 24 reveals that the two-factor model fits better from the point of
a statistical test. However, the one-factor model accounts for a substantial amount of
explained variance (relative to the two-factor model) in the set of items. Additionally, the
pattern and size of the eigenvalues (latent roots) do not differ much between the models.
Therefore, we can be reasonably confident that there is a single underlying dimension
that is explained by the 25 items.
Finally, when more than one dimension is identified (i.e., no single dominant fac-
tor emerges) to account for examinee performance on a set of test items, researchers
must either revise or remove certain test items to meet the unidimensionality assump-
tion or use a multidimensional approach to IRT (McDonald, 1985a; Bock et al., 1988;
Reckase, 1985, 2009; McDonald, 1999; Kelderman, 1997; Adams, Wilson, & Wang,
1997). One relatively new approach to use when multidimensionality is present is
called mixture modeling and is based on identifying mixtures of distributions of per-
sons within a population of examinees. This approach to IRT is based on latent class
analysis (LCA) of homogeneous subpopulations of persons existing within a sample
(de Ayala, 2009).
Item Response Theory  345

10.11 Local Independence of Items

A second assumption of IRT is local independence, also known as conditional item inde-
pendence. Recall that in IRT, a latent trait or dimension influences how a person or exam-
inee will respond to an item. Operationally, once examinees’ ability is accounted for (i.e.,
statistically controlled), no covariation (or correlation) remains between responses to dif-
ferent items. When local item independence holds, a particular test item in no way pro-
vides information that may be used to answer another test item. From classical probability
theory, when local item independence is present, the probability of a pattern of responses
to test items for an examinee is derived as the product of the individual probabilities of
correct and incorrect responses on each item (e.g., by applying the multiplicative rule of
probability). To formalize the local independence assumption within standard IRT ter-
minology, let q represent the complete set of latent abilities influencing examinee per-
formance on a set of test items, and Ui represent the response to item j (across the vector
of items j = 1, 2, 3, . . ., n). Using conditional probability theory, let P(U|q) represent the
probability of the response of a randomly chosen examinee from a population given abil-
ity q, with P(1|q) as a correct response and P(0|q) as an incorrect response. Equation 10.1
illustrates the probability of conditionally independent responses to items by a randomly
chosen examinee with a given level of ability (Hambleton et al., 1991, p. 33).

Equation 10.1. The probability of response to a set of items


by a randomly chosen examinee from a population of examinees

N
P(U1, U 2 , U 3 … , U N| q = P(U1| q)P(U2| q)P(U3 | q)…PR (UN| q) =å P(U J q|)
I =1

• P = probability of response to
an item.
• Un = probabilistic interpretation
of the response to an item,
either 1 for correct or 0 for
incorrect.
• P(U1|q) = probability of a randomly
chosen examinee respond-
ing to a set of items given
their ability.
• q = person ability or theta.
• P(U1|q)P(U2|q)P(U3|q)P(Un|q) = the product of the probabil-
ities of a correct response
to items 1 through n.
346  PSYCHOMETRIC METHODS

To illustrate local independence, an example is provided using a small subsample of 25


examinee responses from crystallized intelligence test 2 on items 1 and 2 (Table 10.2). To
examine the assumption of local independence, we start by considering what this means in
a statistical sense. One longstanding test used to evaluate whether two variables are inde-
pendent of one another is the chi-square test, and it is used here to illustrate the concept
of local independence. The problem of interest in IRT is whether test items are statistically
independent of one another for persons with the same level of ability on a latent trait or
attribute. For example, the examinees at the same ability level will likely have the same
number correct score on the items in the test (or very close). Below are responses from 25
examinees on items 1 and 2. Examining the response patterns in Table 10.2, the 25 exam-
inees appear to be at approximately the same level of ability (i.e., because the pattern of 0’s
and 1’s across the 25 examinees on items 1 and 2 match, except for examinees 10 and 13).
To test the assumption of local independence, we enter the data in Table 10.2 into
SPSS and construct a two-way table using the SPSS crosstabs procedure. The syntax and
result are shown in Table 10.3. Also produced by the SPSS crosstabs procedure is a chi-
square test of independence to evaluate the hypothesis of statistical independence. The
results of this test are provided in Table 10.4.

Table 10.2.  Responses to Crystallized Intelligence Test 2 on Items 1 and 2


for 25 Examinees
Examinee

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Item 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Item 2 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

Table 10.3.  Item 1 * Item 2 Table Used for Cross Tabulation


CROSSTABS
/TABLES=item1 BY item2
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT
/COUNT ROUND CELL.
Count
item2
.00 1.00 Total
item 1 .00 17 2 19
1.00 2 4 6
Total 19 6 25

Note. 0 = incorrect; 1 = correct.


Item Response Theory  347

Table 10.4.  Chi-Square Tests Based on Table 10.3


Asymp. Sig. Exact Sig. Exact Sig.
Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 7.879a 1 .005
Continuity Correction b
5.102 1 .024
Likelihood Ratio 7.129 1 .008
Fisher’s Exact Test .015 .015
Linear-by-Linear Association 7.564 1 .006
N of Valid Cases 25
a. 3 cells (75.0%) have expected count less than 5. The minimum expected count is 1.44.
b. Computed only for a 2x2 table.

Examining the results in Table 10.4, we reject the chi-square hypothesis test that
the items are independent of one another at an exact probability of p = .015. Fisher’s
Exact Test is appropriate when at least some cells in the analysis include less than five
scores; this is the case in the present analysis. In rejecting the hypothesis of indepen-
dence, we conclude that the assumption of local item independence does not hold for
these items for the 25 examinees; however, this is a very simple example using only two
examinees. In practice, local independence is evaluated based on the item response pat-
terns across all ability levels for all examinees in a sample. Computationally, this step is
challenging and is (1) performed in conjunction with testing the dimensionality of a set
of items as described earlier in the DIMTEST program explanation or (2) evaluated by
using a separate analysis, as presented next.
Another approach for evaluating the assumption of local item independence is using
Yen’s (1984, 1993) Q3 statistic. The advantage of using the Q3 statistic approach is that
(1) it is relatively easy to implement, (2) it requires no specialized software, and (3) it
yields reliable performance across a wide range of sample size conditions (Kim, de Ayala,
Ferdous, & Nering, 2007). The Q3 technique works by examining the correlation of the
residuals between pairs of items. A residual is defined as the difference between an exam-
inee’s observed response to an item and his or her expected response to the item. Two
residuals are necessary in order to implement the Q3: an item residual and a person-level
residual. The person-level residual for an item (j) is given in Equation 10.2 (de Ayala,
2009, p. 132).
The person-level residual for an item (k) is given in Equation 10.3 (de Ayala, 2009,
p. 133).
With the two residual components known, one can calculate the Q3 statistics as pro-
vided in Equation 10.4 (de Ayala, 2009, p. 133).
Applying the Q3 technique involves evaluating the magnitude and sign of the
pairwise correlations in Equation 10.4. The main point of the technique is to evaluate
the dependence between item pairs across all examinees in a sample. For example, a
348  PSYCHOMETRIC METHODS

Equation 10.2. Residual for a person on item j

DIJ = X IJ - P J (qˆ I)

• dij = a residual value for a person on item j.


• xij = person indexed as i and an item indexed as j.
• pj = probability of a correct response on item j.
• q̂I = estimate of person ability.

Equation 10.3. Residual for a person on item k

DIK = X IK - P K (qˆ I)

• dik = residual value for a person on item k.


• xik = person indexed as i and an item indexed as k.
• pk = probability of a correct response on item k.
• q̂I = estimate of person ability.

Equation 10.4. The Q3 statistic

Q3 JK = RD J DK

• Q3 JK = value based on the correlation between the residuals


from two unique items.
• r = Pearson correlation.
• dj = residual on item j.
• dk = residual on item k.

correlation of 0.0 between item pairwise residuals means that the primary condition for
the assumption local item independence is tenable. However, a 0.0 correlation may also
result from a nonlinear relationship. For this reason, a Q3 value of 0.0 is a necessary but
not sufficient condition that local item independence is evident in the test items. For
comprehensive details of implementing the Q3 technique, refer to Yen (1984, 1993) and
de Ayala (2009).
Item Response Theory  349

10.12 The Invariance Property

In Section 10.2 comparisons were presented between CTT and IRT. Arguably, the most
important difference between the two theories and the results they produce is the prop-
erty of invariance. In IRT, invariance means that the characteristics of item parameters
(e.g., difficulty and discrimination) do not depend on the ability distribution of exam-
inees, and conversely, the ability distribution of examinees does not depend on the item
parameters. In Chapter 7, CTT item indexes introduced included the proportion of
examinees responding correctly to items (i.e., proportion-correct) and the discrimination
of an item (i.e., the degree to which an item separates low- and high-ability examinees).
In CTT, these indexes change in relation to the group of examinees taking the test (e.g.,
they are sample dependent). However, when the assumptions of IRT hold and the model
adequately fits a set of item responses (i.e., either exactly or as a close approximation),
the same IRF/ICC for the test items is observed regardless of the distribution of ability of the
groups used to estimate the item parameters. For this reason the IRF is invariant across
populations of examinees. This situation is illustrated in Figure 10.3.
The property of invariance is also a property of the linear regression model. We
can make connections between the linear regression model and IRT models because IRT
models are nonlinear regression models. Recall from Chapter 2 and the Appendix that
the regression line for predicting Y from X is displayed as a straight line connecting the
conditional means of the Y values with each value or level of the X variable (Lomax,
2001, p. 26; Pedhazur, 1982). If the assumptions of the linear regression model hold,
the regression line (i.e., the slope and intercept) will be the same for each subgroup of
persons within each level of the X variable. In IRT, we are conducting a nonlinear regres-
sion of the probability of a correct response (Y) on the observed item responses (X). To
illustrate the property of invariance and how the assumption can be evaluated, we return
to the crystallized intelligence test 2 data, made up of 25 dichotomously scored items,
and focus on item number 11. First, two random subsamples of size 500 were created
from the total sample of 1,000 examinees. SPSS was used to create the random subsam-
ples, but any statistical package can be used to do this. Next, the classical item statistics
proportion-correct and point–biserial are calculated for each sample. To compare our
random subsamples derived from CTT item statistics with the results produced by IRT,
a two-parameter IRT model is fit to each subsample (each with N = 500) and the total
sample (N = 1,000). Although the two-parameter IRT model is yet to be introduced, it is
used here because item difficulty and discrimination are both estimated making compari-
sons between IRT and CTT possible. A summary of the CTT statistics and IRT parameters
is presented in Table 10.5. Figure 10.4 illustrates the two item characteristic curves for
random subsamples 1 and 2. The CTT item statistics and IRT parameter estimates are
provided in Table 10.5.
Inspection of the classical item statistics (CTT) in Table 10.5 (top half of the table)
for item 11 reveals that the two samples are not invariant with respect to the ability of
the two groups (i.e., the proportion-correct and point–biserial coefficients are unequal).
Next, comparing the parameters estimated for item 11 using the two-parameter IRT
350  PSYCHOMETRIC METHODS

low ability group high ability group

Ability -3 -2 -1 0 1 2 3
IQ 55 70 85 100 115 130 145
1.00
Probability of Yes Response

.50

Item location/difficulty = 0.0

0
-3 -2 -1 0 1 2 3
Ability

Figure 10.3.  Invariance of item response function across different ability distributions. A test
item has the same IRF/ICC regardless of the ability distribution of the group. For an item location/
difficulty of 0.0, the low-ability group will be less likely to respond correctly because a person in
the low-ability group is located at –1.16 on the ability scale whereas a person in the high-ability
group is located at 0.0 on the ability scale.

model, in Table 10.5 we see that the item difficulty or location estimates for the samples
are very close (.04 vs. –.07) and the discrimination parameters (labeled as “slope”)
are the same (.92 vs. .92). Finally, conducting a chi-square difference test between the
groups on item 11 yields no statistical difference, indicating that the item locations and
discrimination parameters are approximately equal. To summarize, (1) invariance holds
regardless of differences in person or examinee ability in the IRT model and (2) invari-
ance does not hold for the two random subsamples when using the CTT model (i.e., the
Item Response Theory  351

Table 10.5.  Classical Item Statistics and 2-PL IRT Parameter Estimates
for Two Random Samples
Statistic Sample 1 Sample 2 Total sample
CTT
Proportion correct 0.49 0.52 0.52
Point–biserial 0.60 0.53 0.61
Biserial correlation 0.67 0.67 0.70
IRT
Logit 0.02 –0.06 –0.06
Intercept (¡) –0.04 0.07 0.07
Slope (a or a) 0.92 0.92 0.99
Threshold (d or b) 0.04 –0.07 –0.07
Note. Correlation between all 25-item thresholds (locations) for samples 1 and 2 = .97. Logit is derived
based on the slope–intercept parameterization of the exponent in the 2-PL IRT model: a(q) + g. The relation-
ship between an item’s location or difficulty, intercept, and slope is δ = −γ/α , and for the total sample the
item difficulty/location is derived as –1.52/1.48 = 1.03.
The relationship between IRT discrimination and CTT biserial correlation for the total sample is
aI .99 .99 .99 .99
rBI = = = = = = .70.
1 + a I2 1 + .992 1 + .98 1.98 1.41

proportion-correct and point–biserial correlations are unequal). In the next section, the
process of simultaneously estimating the probability of item responses and person abil-
ity is introduced.

10.13 Estimating the Joint Probability of Item Responses


Based on Ability

Using actual item responses from a set of examinees (e.g., in Table 10.2) and applying
Equation 10.1, we can estimate the joint probability of correct and incorrect responses to
each item in a test by a set of examinees. To do this, we can use the likelihood function in
Equation 10.5. In the Appendix, the symbol Õ is defined as the multiplicative operator.
Applying the multiplicative operator to the likelihood values for individual item response
scores at a specific examinee ability level yields the total likelihood, in terms of probabili-
ties, for the response pattern of scores for a sample of examinees (see Equation 10.5 on
page 353 from Hambleton et al., 1991, p. 34).
The product resulting from the multiplicative operation yields very small values,
making them difficult to work with. To avoid this issue, the logarithms of the likelihood
functions (i.e., the log likelihoods) are used instead as given in Equation 10.6. Using
logarithms rescales the probabilities such that the log likelihood values are larger and
easier to work with. Furthermore, this step yields a linear model that allows for additive
operations. Because we now have an equation with additive properties, the summation
operator replaces the multiplicative operator. Equation 10.6 (Hambleton et al., 1991,
p. 35) illustrates these points about the use of log likelihoods.
352  PSYCHOMETRIC METHODS

Item Characteristic Curve: ITEM0011


a = 0.918 b = 0.038
1.0

0.8
Probability

0.6

0.4

0.2
b

0
-3 -2 -1 0 1 2 3
Ability

2-Parameter Model, Normal Metric Item: 11


Subtest: CRIT2
Chisq = 5.37 DF = 8.0 Prob < 0.7179

Item Characteristic Curve: ITEM0011


a = 0.928 b = -0.077
1.0

0.8
Probability

0.6

0.4

0.2
b

0
-3 -2 -1 0 1 2 3
Ability

Figure 10.4.  Item response functions for item 11 for subsamples 1 and 2. Vertical bars around
the solid dots indicate the 95% level of confidence around the fit of the observed data relative to the
predicted IRF based on the two-parameter IRT model. The nine dots represent the fit of different
distribution points along the ability continuum.
Item Response Theory  353

Equation 10.5. Likelihood of observed response to a set of items


by a randomly chosen examinee from a sample of examinees
N
L(U1,U2,…,UN | q) = ÕP UJ J Q1J - UJ
J=1

• L = likelihood of response to an item.


• L(u1,u2...,un|q) = likelihood of an examinee responding
correctly to an item given the examinee’s
ability or theta value.
• Pj = probability of responding to item j.
• uj = symbol for an item response for a sample
as opposed to a population.
UJ
• PJ = probability of responding correctly to item j.
1-U J
• Q J = probability of responding incorrectly to
item j.
• Qj = 1 – Pj.
• q = person ability or theta.
N
• Õ PJ Q J
UJ 1- U J
= product of the probabilities of correct and
J=1 incorrect responses to items 1 through n.

Equation 10.6. Summation of the logarithm of the likelihood


function based on responses to a set of items by a randomly chosen
examinee from a sample
N
L(U| q)= å[(U J)LN P J + (1 - UJ ) LN (1 - PJ )]
J=1

• L = likelihood of response to an item.


• lnL(u|q) = logarithm of the likelihood of the item
response vector given ability.
• u = item response vector.
• Pj = probability of responding to item j.
• 1 – Pj = also signified in IRT as Qj.
• L(u1,u2,...,un|q) = likelihood of an examinee responding cor-
rectly to an item given his or her ability or
theta value.
• qN = person ability or theta.
• å = sum of the logarithms of the probabilities
J=1 of correct and incorrect responses to items
1 through n.
354  PSYCHOMETRIC METHODS

Equation 10.7. Likelihood of response pattern for examinee number


4 in Table 10.6 with ability of 2.0

L(U1, U2 , U3 , U4 , U5|θ ) = (P11Q 10 )(P21Q 20 )(P30Q 13 )(P41Q 40 )(P51Q 50 )

= (.98 *1 - .98)(.95*1 - .95)(.88 *1 - .88)(.79 *1 - .79)(.5*1 - .5)

= (.02)(.05)(.11)(.16)(.25)
= .00000345

In logarithms of likelihoods:

-1.75 + (-1.34) + (-.976) + (-.788) + ( -.602)

= -5.456

Equation 10.7 (Hambleton et al., 1991, p. 34) illustrates the process of applying the
likelihoods for estimating the probability of the observed response pattern for examinee
number 4 based on the data in Table 10.6.
During an IRT analysis, the steps above are conducted for all examinees and all items
in a sample. As presented in the Appendix, to facilitate calculations in estimating the
location of a person’s ability that is located at its maximum, the logarithm of the likeli-
hood function is used as illustrated in Equations 10.6 and 10.7. Using logarithms, we
define the value of ability (q) that maximizes the log likelihood for an examinee as the
maximum likelihood estimate of ability (q̂) or MLE [q̂]; (the “hat” on top of q signifies

Table 10.6.  Item Responses for Six Examinees to Items 1–5


on Crystallized Test of Intelligence 2

d = –2.0 d = –1.0 d = 0.0 d = 1.0 d = 2.0


Examinee Item 1 Item 2 Item 3 Item 4 Item 5
1 1 1 1 0 0
2 1 0 1 1 0
3 0 1 0 0 1
4 1 1 0 1 1
5 0 1 0 1 0
6 0 0 1 0 1
Note. Person ability for examinee 4 is assumed to be 2.0. Item difficulty (d) is on a z-score
metric. This example assumes a Rasch model where item discriminations are all 1.0 and there
is no guessing (i.e., C-parameter = 0.0).
Item Response Theory  355

that it is an estimate from the population). The process of estimating the MLE is iterative,
meaning, for example, that ability for a sample of examinees is estimated based on initial
item parameters from the observed data, and the maximum of the likelihood estimate
for person ability is derived by using calculus-based numerical integration methods. The
numerical integration algorithms are included in IRT programs such as IRTPRO (2011),
BILOG-MG (2003), PARSCALE (1997), MULTILOG (2003), WINSTEPS (2006), and
CONQUEST (1998). The process of estimating ability and item parameters involves iter-
ative techniques because locating the maximum likelihood of ability necessitates com-
puter searching for the location where the slope of the likelihood function is zero; this
must be performed for all persons in a sample. Further explanation of the process of
estimating item parameters and person ability is provided in Section 10.16.
To make Equations 10.1 through 10.4 more concrete, we use the Rasch model as
an illustrative framework. The Rasch model receives a formal introduction later, but for
now it is used to illustrate the estimation of the probability of responses to items for a
sample of six examinees. In the Rasch model, the probability of a response depends on
two factors, the examinee’s ability to answer the item correctly and the difficulty of the
item. To account for both examinee ability and item difficulty in a single step, we can use
Equation 10.3. To illustrate the combined role of Equations 10.1 through 10.4, we use a
small portion of the data from our sample of 1,000 persons on crystallized intelligence
test 2 as shown in Table 10.7.
In the Rasch model (Equation 10.9 on page 356), the two parameters involved are the
difficulty of an item and the ability of the examinee. Note that in Table 10.7, we assume
the six examinees all possess an ability of 2.0. The probability of responding correctly
to an item given person ability is expressed on a 0 to 1 metric; ability is expressed on a
standard or z-score scale. The z-score metric is useful because z-scores can be mapped
onto the normal distribution as an area or proportion under the normal curve. Because
of the nonlinear metric of (1) the item responses being dichotomous and (2) the prob-
ability of a response to an item (i.e., a 0 to 1 range), a logistic function is used with a
scaling factor of 2.7183 as in Equation 10.8. A convenient result of using the logistic

Table 10.7.  Item Responses for Six Examinees to Items 1–5


on Crystallized Test of Intelligence 2

d = –2.0 d = –1.0 d = 0.0 d = 1.0 d = 2.0


Examinee Item 1 Item 2 Item 3 Item 4 Item 5
1 1 1 1 0 0
2 1 0 1 1 0
3 0 1 0 0 1
4 1 1 0 1 1
5 0 1 0 1 0
6 0 0 1 0 1
Note. Person ability for examinees 1–6 is assumed to be 2.0. Item difficulty (d) is on a z-score
metric.
356  PSYCHOMETRIC METHODS

Equation 10.8. Logistic equation

EZ
P( X) =
1 + EZ

• p(x) = probability of a correct response when the predictor


takes on values of x.
• e = constant equal to 2.718.
• z = linear combination of predictor variables.

equation is that by taking the exponent of the combination of predictor variables (in
this case, q – d), the result is a linear model that is much easier to work with. In fact,
logistic regression is a widely used alternative in statistical methods when the outcome
variable is on a 0.0 to 1.0 metric (e.g., a binomial distributed outcome variable rather
than a continuous one).
Next, inserting q – dj into the logistic equation, as illustrated in Equation 10.9, yields
the Rasch model. The key to understanding Equation 10.9 is to look closely at the expo-
nent in the numerator and denominator. In this part of the numerator, we see that item
difficulty is subtracted from person ability. It is this difference that is plotted against the
probability of an examinee responding correctly to an item. The importance of the previous
sentence cannot be overstated because other, more advanced types of IRT models build
on this concept. Continuing with our example, the probability of a correct response is
mapped onto the cumulative normal distribution (i.e., a z-score metric; see Chapter 2
and the Appendix for a review of the cumulative normal distribution function). The item

Equation 10.9. Rasch model

E(q - dJ )
P ( X J = 1 | q, d J ) =
1 + E ( q - dJ )

• p(xj = 1|q,dj) = the probability of response given person


location and item j difficulty.
• q = the person location, also called ability or
theta.
• dj = the item j difficulty or location.
• e = a constant equal to 2.7183.
Item Response Theory  357

a = 1.000 b = 0.000
1.0

0.8

0.6
Probability

0.4

0.2

0
-3 -2 -1 0 1 2 3
Ability

Figure 10.5.  A Rasch item response function.

difficulty and person ability are also represented on the z-score metric and are therefore
linked to the cumulative normal distribution.
The Rasch model, like other IRT models, incorporates the logistic function because
the relationship between the probability of an item response to person ability and item
difficulty is nonlinear (e.g., expressed as an S-shape curve). Figure 10.5 illustrates a Rasch
ICC where person ability is 0.0, item location or difficulty is 0.0, and the probability of
a response is .50 or 50%. In the figure, the ICC is based on the 1,000 item responses to
item 3 on the test of crystallized intelligence 2.
Continuing with our example, we can apply Equations 10.1, 10.5, and 10.6 to obtain
the probability of a correct response for examinee 2 regarding their response to item
number 4. For example, if we insert a value of 2.0 for the examinee’s ability (q), 1.0 for
the item 4 difficulty, and 1 for a correct response into Equation 10.9, we obtain the result
in Equation 10.10. To interpret, the probability is .73 that a person with ability 2.0 and
item difficulty 1.0 will answer the item correctly. In practice, a complete Rasch (or IRT)
analysis involves repeating this step for all examinees and all items on the test.
Finally, the goal in IRT is to estimate the probability of an observed item response
pattern for the entire set of examinees in a sample. To accomplish this, we estimate the
likelihood of observing an item response pattern using all 25 items on crystallized intel-
ligence test 2 for 1,000 examinees over a range of ability (a range of z = –3.0 to 3.0). We
return to the step of estimating the likelihood of unique response patterns for a sample
of examinees shortly.
358  PSYCHOMETRIC METHODS

Equation 10.10. Probability of an examinee with ability q = 2.0


responding correctly to an item with difficulty d = 1.0

2.7183(2.0-1.0) 2.7183
P(X I = 1 | q, d I ) = (2.0 -1.0)
= = .73
1 + 2.7183 1 + 2.7183

10.14 Item and Ability Information and the Standard Error


of Ability

The Appendix introduces maximum likelihood estimation, noting that its use is par-
ticularly important for challenging parameter estimation problems. The challenges of
estimating person ability and item parameters in IRT make maximum likelihood estima-
tion an ideal technique to use. The Appendix provides an example to illustrate how MLE
works. The distributional form of the total likelihood is approximately normally distrib-
uted, and the estimate of the standard deviation serves as the standard error of the MLE.
Once the item parameters are estimated, they are fixed (i.e., they are a known entity), and
the sampling distribution of ability (q) and its standard deviation can be estimated. The
standard deviation of the sampling distribution of ability (q) is the standard error of the
MLE of ability (q). The dispersion of likelihoods resulting from the estimation process
may be narrow or broad depending on the location of the value of (q) for examinee and
item parameters.
Closely related to the item response function (IRF/ICC) is the item information
function (IIF). The IIF plays an important role in IRT because (a) it provides a way to
identify where a test item is providing the most information relative to examinee ability
and (b) a standard error of the MLE is provided, making it possible to identify the preci-
sion of ability along the score scale or continuum. Additionally, IIFs can be summed to
create an index of total test information. The IIF is presented in Equation 10.11.
Because the slope is set to 1.0 in the Rasch model, the information function simpli-
fies, as illustrated in Equation 10.11a.
Equation 10.11a is also applicable to the one-parameter IRT model because, although
the slope is not required to be set to 1.0, it is required to be set to a constant value; this
constant is dictated by the empirical data. To illustrate Equation 10.11a with our intel-
ligence test data, let’s assume that we are interested in looking at the information for item
11 in relation to an examinee with ability of 0.0. Using the item location of –.358 and abil-
ity of 0.0, we insert these values into Equation 10.11a as illustrated in Equation 10.11b
for the Rasch model where the slope or discrimination is set to 1.0. For example, if the
probability of correctly responding to item 11 is –.358 (see Table 10.8 to verify this) for a
person with ability of 0.0, the information for the item is .23, as illustrated in Figure 10.6.
Finally, in the Rasch model, item information reaches its maximum at .25; this is the loca-
tion where d or item difficulty is –.358. You should verify the fact that item information
Item Response Theory  359

Equation 10.11. IRT information function

[P¢J ]2
I J (q) =
P J (1 - PJ )

• Ij(q) = information function for an item.


• pj = first derivative of the slope of the item response
function.
• pj(1 – pj) = variability at the point at which the slope of the
IRF/ICC is derived.

Equation 10.11a. Information function for the Rasch and one-


parameter IRT model

Ij(q) = a2pj(1 – pj)

• Ij(q) = information function for item j.


• a2 = square of the slope of the item response
function.
• pj(1 – pj) = variability at the point at which the slope is
derived.

Equation 10.11b. Simplified item information function in the


Rasch model

Ij(q) = a2pj(1 – pj) = 1* – .358(1 – (–.358)) =


–.358(.642) = .23

• Ij(q) = information for item j at ability q.


• pj = derivative (tangent of the slope) of the item
response function with respect to a specific
ability.
• a
2
= discrimination parameter squared.
• pj(1 – pj) = probability of a correct response times probabil-
ity of an incorrect response on item j.
360  PSYCHOMETRIC METHODS

Table 10.8.  Item Parameter Estimates for Crystallized Intelligence Test 2


a-parameter b-parameter c-parameter Chi-square
ITEM (S.E.) (S.E.) (S.E.) (PROB)
ITEM0002 1 –5.357 0 0
0.024* 0.271* 0.000* 0
ITEM0003 1 –5.441 0 0.4
0.024* 0.176* 0.000* –0.8223
ITEM0004 1 –2.654 0 14.7
0.024* 0.062* 0.000* –0.023
ITEM0005 1 –2.039 0 15.2
0.024* 0.055* 0.000* –0.0336
ITEM0006 1 –1.338 0 11.4
0.024* 0.055* 0.000* –0.1813
ITEM0007 1 –1.294 0 22.7
0.024* 0.056* 0.000* –0.002
ITEM0008 1 –2.169 0 35.3
0.024* 0.055* 0.000* 0
ITEM0009 1 –0.927 0 24.2
0.024* 0.048* 0.000* –0.0021
ITEM0010 1 –0.55 0 28.6
0.024* 0.048* 0.000* –0.0004
ITEM0011 1 –0.358 0 6.2
0.024* 0.047* 0.000* –0.6214
ITEM0012 1 –0.001 0 14.7
0.024* 0.048* 0.000* –0.0657
ITEM0013 1 0.011 0 64.7
0.024* 0.054* 0.000* 0
ITEM0014 1 0.048 0 20.9
0.024* 0.051* 0.000* –0.004
ITEM0015 1 –0.001 0 27.3
0.024* 0.047* 0.000* –0.0003
ITEM0016 1 0.259 0 42.6
0.024* 0.048* 0.000* 0
ITEM0017 1 0.495 0 22.7
0.024* 0.051* 0.000* –0.0019
ITEM0018 1 1.251 0 10.7
0.024* 0.053* 0.000* –0.0971
ITEM0019 1 1.722 0 8.2
0.024* 0.053* 0.000* –0.2263
ITEM0020 1 1.876 0 10.1
0.024* 0.053* 0.000* –0.0725
ITEM0021 1 2.112 0 22.2
0.024* 0.060* 0.000* –0.0005
ITEM0022 1 2.278 0 13.1
0.024* 0.059* 0.000* –0.0228
ITEM0023 1 2.552 0 17.1
(continued)
Item Response Theory  361

Table 10.8.  (continued)


0.024* 0.063* 0.000* –0.0018
ITEM0024 1 3.013 0 1.1
0.024* 0.067* 0.000* –0.8867
ITEM0025 1 3.889 0 8.7
  0.024* 0.090* 0.000* –0.034
Note. Item 1 is not provided because no maximum was achieved due to a perfect response string. This
output is a partial listing from phase 2 of BILOG-MG. Item 11 was plotted in Figure 10.4. The a-parameter
(slope) is set to a value of 1.0, conforming to the Rasch model assumptions. Also provided in BILOG-MG
phase 2 output is the item loading, which is the correlation between the item and the latent construct.

Item Information Curve: ITEM0011

Maximum information

Item 11 information

b = -.358

Scale Score

Figure 10.6.  Item information function based on item 11. I(q) on the Y-axis is the information
function of ~.23. Proficiency is the ability scale (q) on the X-axis. The information provided by the
item reaches maximum (.23) when the location parameter = –.358. The information for the item
is .25 (maximum) when the probability of a correct response is .50.

is .25 in the Rasch model by inserting the probability of a correct response of .50 and a
person ability of 0.0 into Equation 10.11a.
Because the slope model is set to 1.0 in the Rasch model, the information function
simplifies, as illustrated in Equation 10.11b.
Similarly, the information can be estimated for the estimate of examinee ability (q).
The ability that is estimated by an IRT model is q̂ (theta is now displayed as “theta hat”
362  PSYCHOMETRIC METHODS

Equation 10.12. Standard error of ability

1 1
SE(qˆ |q) = =
I (q) L
[P¢ ]2
å P J(1 -J P J)
I =1

• SE(qˆ | q) = standard error of the ability estimate theta


given the population theta.
• I(q) = information at a given level of ability.
L
[P¢J ]2
• å = information function summed over items
I=1 P J (1 - P J )
on the test.

Equation 10.13. Confidence interval for ability

[qˆ - Za / 2 SE(qˆ ), qˆ + Z a/ 2 SE(qˆ )]

• SE(qˆ ) = standard error of the ability estimate theta.


• za/2 = upper 1 – (a /2) percentile point in the normal
distribution (e.g., for the 95% level of confidence,
a = .05 and za/2 = 1.96).
• q̂ = ability estimate of an examinee.

because it is an estimate rather than a population parameter). The relationship between


the IIF and the standard error of the MLE of ability is illustrated in Equation 10.12.
The information function for person ability estimates serves as a measure of preci-
sion in relation to item difficulty and discrimination parameters. Because the estimate
of person or examinee ability q̂ is normally distributed, Equation 10.13 can be used to
derive a confidence interval around the MLE of ability. The standard error of the estimate
of ability in Equation 10.12 is useful for deriving conditional errors of measurement and
an IRT-based form of conditional reliability along the score scale (Raju et al., 2007; Price
et al., 2006; Kolen et al., 1992).

10.15 Item Parameter and Ability Estimation

At the outset of an IRT analysis, both item parameters and examinee ability are unknown
quantities and must be estimated. The estimation challenge is to find the ability of each
Item Response Theory  363

examinee and the item parameters using the responses to items on the test. It is beyond
the scope of this chapter to present a full exposition of the various estimation techniques
and the associated mechanics of how they are implemented. This section presents a con-
ceptual overview of how the estimation of ability and item parameters works. Readers
are referred to Baker and Kim (2004) and de Ayala (2009) for excellent treatments and
mathematical details of estimation techniques and their implementation in computer
programs currently employed in IRT.
Simultaneously estimating the item parameters and examinee abilities is computa-
tionally challenging. The original approach to estimating these parameters is joint maxi-
mum likelihood estimation (JMLE), and it involved simultaneously estimating both
examinee ability and item parameters (Baker & Kim, 2004, pp. 83–108). However, the
JMLE approach produces inconsistent and biased estimates of person abilities and item
parameters under circumstances such as small sample sizes and tests composed of fewer
than 15 items. Another problem associated with JMLE includes inflated chi-square tests
of global fit of the IRT model to the data (Lord, 1980). For these reasons, the marginal
maximum likelihood estimation (MMLE) approach (Bock & Aitkin, 1982) is the tech-
nique of choice and is incorporated into most, if not all, IRT programs (e.g., IRTPRO,
BILOG-MG, PARSCALE, MULTILOG, and CONQUEST). In the MMLE technique, the
test items are estimated first and subsequently considered fixed (i.e., nonrandom). Next,
the person abilities are estimated and are viewed as a random component sampled from
a population. The person ability being a random component within the population pro-
vides a way to introduce population information without directly and simultaneously
estimating the item parameters.
In practice, the item parameters are estimated first using MMLE. This step occurs
by first integrating out the ability parameters based on their known approximation
to the normal distribution. Specifically, in MMLE it is the unconditional (marginal-
ized) probability of a randomly selected person from a population with a continuous latent
distribution that is linked to the observed item response vector (de Ayala, 2009; Baker &
Kim, 2004; Bock & Aitkin, 1982). With person ability eliminated from the estima-
tion process through integration, the unconditional or marginal likelihood for item
parameter estimation becomes possible in light of the large number of unique per-
son ability parameters. Once the item parameters are estimated and model-data fit is
acceptable, the estimation of person ability is performed. The result of this estimation
process is a set of person abilities and item parameter estimates that have asymptotic
properties (i.e., item parameter estimates are consistent as the number of examinees
increases). When conducting an IRT analysis using programs such as BILOG-MG,
IRTPRO, PARSCALE, MULTILOG, and CONQUEST, the process of ability and item
parameter estimation is iterative (i.e., the program updates ability and item parameter
estimates until an acceptable limit or solution is reached). The process results in abil-
ity and item parameter estimates that have been refined in light of one another based
on numerical optimization.
IRT is a large-sample technique that capitalizes on the known properties of the cen-
tral limit theorem. For this reason, sample size is an important factor when estimating
364  PSYCHOMETRIC METHODS

ability and item parameters in any IRT analysis. Research has demonstrated (e.g., de
Ayala, 2009; Baker & Kim, 2004) that in general, for Rasch and one-parameter IRT model
estimation (also called Rasch or IRT calibration), a sample size of at least 500 examinees
is recommended. For the two- and three-parameter IRT models, a sample size of at least
1,000 is recommended. However, in some research and analysis situations these numbers
may be relaxed. For example, if the assumptions of the Rasch or IRT analysis are met
and inspection of the model-data fit diagnostics reveals excellent results, then the sample
size recommendations provided here may be modified. As an example, some simulation
research has demonstrated that sample sizes as low as 100 yield adequate model-data fit
and produce acceptable parameter estimates (de Ayala, 2009).
Now we return to the task of estimating the unobserved (latent) ability for per-
sons after item parameters are known (are a fixed entity). In MMLE, the population
distribution of ability for examinees or persons is assumed to have a specific form (usu-
ally normal). For explanation purposes, let’s assume that our population of interest is in
fact normally distributed. Knowing the statistical characteristics of the population, the
mechanics of ability estimation employs an empirical Bayesian statistical approach to
estimating all of the parameters of person ability within a range (usually under a standard
score range of –3.0 to +3.0). The Bayesian approach to probability and parameter estima-
tion is introduced in the Appendix. Readers should briefly review this information now.
Recall that the item parameters are integrated out during the estimation of ability in
the MMLE approach. With the item parameters temporarily out of the picture, the abil-
ity parameters (q) can be estimated more efficiently. Two Bayesian approaches are used
to estimate person ability: expected a posteriori (EAP) and the maximum a posteriori
(MAP). One of the two is selected based on the requirements of the analysis at hand.
For example, one of the techniques is used based on characteristics of the sample such
as sample size and distributional form in relation to the target population. Also, the type
of items that comprise the test (i.e., dichotomous, partial credit, or polytomous formats)
also has issues to be considered. In the Bayesian context, the population distribution of
ability (q) is called the prior, and the product of the likelihood of q and prior density gives
the posterior distribution of ability of q, given the empirical item response pattern (Du
Toit, 2003, p. 837). As a Bayesian point estimate (e.g., the mean or mode) of q, it is typical
to use the value of q at the mode of the posteriori distribution (MAP), or the mean of the
posteriori distribution (EAP). The choice depends on the context of the testing scenario
(e.g., the type and size of sample and the type and length of test). The equation illustrat-
ing the estimation of the likelihood of an item response vector, given person ability and
item parameters a-, b-, and c-, is provided in Equation 10.14.

10.16 When Traditional IRT Models Are Inappropriate to Use

There are two instances when the assumptions of IRT are violated and prevent the use of
standard IRT models. First, local independence is violated when examinees respond to test
items composed of testlets (Wainer & Kiely, 1987; Wainer, Bradlow, & Du, 2000). Testlets
Item Response Theory  365

Equation 10.14. Likelihood function of item responses given


ability and item parameters

N
L(U1, U2 , U3 ,…, UN |qˆ , A, B, C) = ÕÕ PIUI Q1I - UI
I =1 J=1

• u1 = response to an item.
• L = likelihood of item responses given ability and item
parameters.
• a = item discrimination parameter.
• b = item difficulty parameter.
• c = pseudoguessing parameter.
• q̂ = examinee’s estimate of ability.
N
• Õ = multiplication over person abilities.
I =1
N
• Õ = multiplication over item responses.
J=1

• PIUI = probability of a correct response to an item.


• Q1- UI
I = probability of an incorrect response to an item.

are a collection of items designed to elicit responses from a complex scenario (e.g., a mul-
tistep problem in mathematics or laboratory problems or a sequence in science) expressed
in a short paragraph. Such clusters of items are correlated by the structure of the item
format, thereby violating local item independence. Wainer et al. (2007) and Jannarone
(1997) provide rudimentary details and present a framework for developing IRT models
for items and tests that violate the conventional assumption of local independence.
Unidimensional IRT models are also inappropriate when a test is given under the
constraint of time (i.e., a speeded testing situation). For example, under a speeded test-
ing scenario two underlying abilities are being measured: cognitive processing speed and
achievement. Researchers interested in using IRT for timed or speeded tests are encour-
aged to read Verhelst, Verstralen, and Jansen (1997) and Roskam (1997), both of whom
provide comprehensive details regarding using IRT in these situations.
The next section presents Rasch and IRT models used in educational and psycho-
logical measurement. Specifically, the Rasch, one-, two-, and three-parameter logistic IRT
models for dichotomous data are presented. These models were the first to be developed
and are foundational to understanding more advanced types of Rasch and IRT mod-
els (e.g., tests and instruments that consist of polytomous, partial credit, or Likert-type
items, and multidimensional Rasch and IRT models).
366  PSYCHOMETRIC METHODS

10.17 The Rasch Model

Perhaps no other model has received more attention than Rasch’s model (1960). Georg
Rasch (1901–1980), a Danish mathematician, proposed that the development of items
comprising a test follow a probabilistic framework directly related to a person’s abil-
ity. Rasch, using a strict mathematical approach, proposed that a certain set of require-
ments must be met prior to obtaining objective-type measurement similar to those in the
physical sciences. Rasch’s epistemological stance was that in order for measurement to
be objective, the property of invariant comparison must exist. Invariant comparison is
a characteristic of interval or ratio-level measurement often used for analysis in applied
physics. According to Rasch (1960), invariant comparison (1) is a comparison between
two stimuli that should be independent of the persons who were used for the compari-
son, and (2) should be independent of any other related stimuli that might have been
compared. Thus, the process of Rasch measurement and modeling is different from
classic statistical modeling—and the other IRT modeling approaches presented in this
chapter. In the Rasch approach to measurement, the model serves as a standard or crite-
rion by which data can be judged to exhibit the degree of fit relative to the measurement
and statistical requirements of the model (Andrich, 2004). Also important to the Rasch
approach is the process of using the mathematical properties of the model to inform the
construction of items and tests (Wright & Masters, 1982; Andrich, 1988; Wilson, 2005;
Bond & Fox, 2001). Conversely, in general statistical approaches, models are used to
describe a given set of data, and parameters are accepted, rejected, or modified depend-
ing on the outcome. This latter approach is the one adopted and currently used by a large
proportion of the psychometric community regarding IRT.

10.18 The Rasch Model, Linear Models, and Logistic


Regression Models

In the Rasch and other IRT models, the probability of a correct response on a dichot-
omous test item is modeled as a logistic function (Equation 10.8) of the difference
between a person’s ability and an item’s difficulty parameter (Equation 10.9). The logistic
function is used extensively in statistics in a way that extends the linear regression model
to estimate parameters comprised of variables that are dichotomous. Although many dis-
tributions are possible for use with dichotomous variables, the logistic has the following
desirable properties. First, it is easy to use and is highly flexible. Second, interpretation
of the results is straightforward because application of the logistic function results in a
model that is linear based on the logarithmic transform, making interpretation similar to
a linear regression analysis. In linear regression, the key quantity of interest is the mean
of the outcome variable at various levels of the predictor variable.
There are two critical differences between the linear and logistic regression models.
First is the relationship between the predicator (independent) variables and the criterion
(dependent) variable. In linear regression, the outcome variable is continuous, but in
Item Response Theory  367

logistic regression (and in IRT), the outcome variable is dichotomous. Therefore, the
outcome is based on the probability of a correct response (Y) conditional to the ability
of a person (i.e., the x variable). In the linear regression model, the outcome variable is
expressed as the conditional mean expressed as E(Y |x – the expected value of Y given x).
In linear regression, we assume that this mean can be expressed as a linear equation.
The second major difference between the linear and logistic regression models
involves the conditional distribution of the outcome variable (probability of a correct
response). In the logistic regression model, the outcome variable is expressed as y = p(x) + e.
The symbol e is an error term and represents an observation’s deviation from the con-
ditional mean. The symbol p is a dichotomous random variable based on the binomial
distribution (i.e., a density function on a 0 to 1 metric). A common assumption about e is
that it follows a normal distribution with mean 0.0 and constant variance across the lev-
els of the independent variable. Based on the assumption that errors are normally distrib-
uted, the conditional distribution of the outcome variable given x will also be normally
distributed. However, this is not true for dichotomous variables modeled on the range of
0 to 1. In the dichotomous case, the value of the outcome variable is expressed as 1 – p(x).
The quantity of e may assume only 0 or 1. If y = 1, then e = 1 – p(x) with probability p(x)
and if y = 0, then e = –p(x) with probability 1 – p(x).
Inspection of Figure 10.3 reveals that as ability increases, the probability of a cor-
rect response increases. Also, as Figure 10.5 illustrates, the relationship E(Y|x) is now
expressed as p(x) for the logistic model or simply p(x) in the Rasch or any IRT model.
Notice that because the conditional mean of Y (the probability) gradually approaches 0
or 1 (rather than directly in a linear sense), the IRF is depicted as an S-shaped curve. In
fact, the curve in Figure 10.5 resembles one-half of the cumulative normal distribution
(see the Appendix). The following logistic Equation 10.15 (and Equation 10.8 presented
earlier) and the Rasch model Equation 10.16 (Equation 10.9 presented earlier) yield
parameters that are linear in the logistic transformation.
To illustrate, in Figure 10.7 the probability of a person responding correctly to item
3 (from Figure 10.1 in the beginning of the chapter) is provided on the Y-axis, and

Equation 10.15. Logistic equation

EZ
P( X) =
1 + EZ

• p(x) = probability of a correct response values of 1 when


the predictor takes on values of x.
• e = constant equal to 2.7183.
• z = linear combination of predictor variables.
368  PSYCHOMETRIC METHODS

Equation 10.16. Rasch model

E(q - d J)
P ( X J = 1 | q, d J ) =
1 + E(q - d J)

• p(xj = 1|q,dj) = probability of response given person ­location


and item difficulty.
• q = person location (also called ability or theta).
• dj = item difficulty or location.

a = 1.000 b = 0.000
1.0

0.8

0.6
Probability

0.4

0.2

0
-3 -2 -1 0 1 2 3
Ability

Figure 10.7.  IRF for a person with an ability of 0.0 and item difficulty or location of 0.0.

person ability is given on the X-axis. Notice that the item location or difficulty is 0.0 and
is marked on the X-axis by the letter b (this is denoted as d in the Rasch model). The
b-parameter (or d) is a location parameter, meaning that it “locates” the item response
function on the ability scale.
Using Equation 10.15 and inserting the values of 0.0 for person location and 0.0 for
item location into Equation 10.16, we see that the probability of a person responding cor-
rectly to item 3 is .50 (see Equation 10.17).
Item Response Theory  369

Equation 10.17. Application of the Rasch model

2.7183(0.0-0.0) 1
P( X I = 1 | q, d I ) = (0.0 -0.0)
= = .50
1 + 2.7183 1+1

In words, Equation 10.17 means that a person with ability 0.0 answering an item with a
location (difficulty) of 0.0 has a 50% probability of a correct response. Next, we calibrate
the item response data for intelligence test 2 with the Rasch model using BILOG-MG
(Mislevy & Bock, 2003). In Figure 10.8, we see the result of Rasch calibration using
BILOG-MG for item 11 on the crystallized intelligence test 2. In the BILOG-MG phase 2
output, the chi-square test of fit for this item under the Rasch model was observed to be

Item Characteristic Curve: ITEM0011


a = 1.000 b = −0.358
1.0

0.8

0.6
Probability

0.4

0.2

0
-3 -2 -1 0 1 2 3
Ability

Figure 10.8.  Rasch logistic ICC for item 11 on crystalized intelligence test 2. The graph pro-
vides the fit of the observed response patterns versus the predicted pattern. Slope is constrained
to 1.0. The solid dots indicate the number of segments by which the score distribution is divided.
In the graph, notice that as person ability (X-axis) and item difficulty (the b-value in the graph)
become closer together in the center area of the score distribution, the probability of responding
correctly to the item is 5. The dots also indicate that as the discrepancy between ability and item
difficulty becomes larger, the model does not fit the data within the 95% level of confidence (e.g.,
the 95% error bars do not include the dot).
370  PSYCHOMETRIC METHODS

0.62—indicating a good fit (i.e., the chi-square test of independent was not rejected for
this item). However, by inspecting the 95% confidence bars in Figure 10.8, we see that
at the extremes of the ability distribution, the observed versus predicted model–data fit
is not within the range we would like (i.e., the solid dots are not within the 95% level of
confidence). Later, we fit the 1-PL IRT model to these data and compare the results with
the Rasch analysis for item 11.
The BILOG-MG syntax below provided the ICC presented in Figure 10.8 (Du Toit,
2003).

One-Parameter Logistic Model with RASCH scaling BLM –


CRYSTALLIZED intelligence TEST 2 ITEMS 1–25

>COMMENTS BILOG-MG EXAMPLE FOR FIGURE 9.5


>GLOBAL NPARM=1, LOGISTIC, DFNAME='C:\rpbispoly.DAT';
>LENGTH NITEMS=25;
>INPUT NTOTAL=25, NGROUPS=1, NIDCHAR=9;
>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
>CALIB NQPT=10, CHI=(25,8), RASCH, CYCLES=15, CRIT=0.005,
NEWTON=2, PLOT=1;

Figure 10.9 illustrates the results of the BILOG-MG analysis in relation to item
parameter and person ability estimates for all 25 items from the Rasch analysis. To aid
interpretation, the IQ metric is included to illustrate the direct relationship to the ability
scale (q) typically scaled on a z-score metric and item difficulty scale. In Rasch and IRT
analyses, scale transformation from the ability metric to other metrics (such as IQ) is

Person Ability -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0

EASY MODERATE DIFFICULT


Item difficulty

Item 16 Item 20
Item 11 Item 21 Item 23
Item 3 Item 4 Item 5 Item 6
Item 7 Items
Item 9 Item 10 12 - 15 Item Item 22
2 Item 25
Item 8 Item 18 Item 19
9 Item 24
17

Item location
(difficulty δ ) -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0

IQ metric 40.0 55.0 70.0 85.0 100.0 115.0 120.0 130.0 140.0

Note.
*Item 1 not
scaled due
perfect score
*Item 2 = -5.3

Figure 10.9.  Item–ability graph for crystallized intelligence test 2 based on Rasch analysis.
Item Response Theory  371

possible owing to the property of scale indeterminacy. Scale indeterminacy exists because
multiple values of q and d lead to the same probability of a correct response. Therefore,
the metric is unique up to a linear transformation of scale.

10.19 Properties and Results of a Rasch Analysis

In the Rasch model (and all IRT models for that matter), a metric for the latent trait
continuum is derived as the nonlinear regression of observed score on true score, with
the person ability and item locations established on the same metric (i.e., a z-score met-
ric). To explain how examinees and items function under the Rasch model, if examinee
1 exhibits ability twice that of examinee 2, then this discrepancy is mathematically
expressed by applying a multiplicative constant of 2 (i.e., h1 = 2h2 or equivalently q1 =
2q2). Also, if item 1 is twice as difficult as item 2, then d1 = 2d2. Providing that the prop-
erties of person ability and item difficulty hold, a ratio level of measurement is attained,
with the only changes being due to the value of the constant involved. Theoretically,
such a ratio level of measurement is applicable to any sample of persons and items as
long as the same constants are used. This allows for direct comparisons across differ-
ent samples of persons and items, a property known in the Rasch literature as specific
objectivity, or sample-free measurement. With regard to the question of minimum
sample size for a Rasch analysis, simulation research supports the recommendation of
a minimum of 100 examinees and test length of at least 15 items for accurate item and
ability parameter estimates (Baker & Kim, 2004; Hambleton et al., 1991); however, this
is only a recommendation because of the complexity of the characteristics of the sample, test
items, test length, and amount of missing data which have implications for the performance
of the model given the data.
Importantly, as in any statistical modeling scenario, evaluating the fit of the model
to the data is crucial regardless of sample recommendations. Table 10.8 provides the item
parameter results from a Rasch analysis of the 25-item crystallized test of intelligence 2
for the total sample of 1,000 examinees using BILOG-MG. The item parameter estimates
for item 11 are highlighted in gray.
The BILOG-MG syntax below provided the output for Table 10.8 (Du Toit, 2003).

One-Parameter Logistic Model.BLM - CRYSTALLIZED INT.TEST 2


ITEMS 1-25
>COMMENTS BILOG-MG EXAMPLE FOR FIGURE 9.2
>GLOBAL NPARM=1, LOGISTIC, DFNAME='C:\rpbispoly.DAT';
>LENGTH NITEMS=25;
>INPUT NTOTAL=25, NGROUPS=1, NIDCHAR=9;
>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
>CALIB NQPT=10, CHI=(25,8), RASCH, CYCLES=15, CRIT=0.005,
NEWTON=2, PLOT=1;
372  PSYCHOMETRIC METHODS

Table 10.9 provides the proportion correct, person ability, and standard error esti-
mates for the crystallized intelligence test 2 data. The values in this table are provided in
the BILOG-MG phase 3 output file. Table 10.9 provides only a partial listing of the actual
output.
As shown in Table 10.9, an examinee or person with ability of approximately 0.0
answered 12 out of 24 items correct (there are only 24 items in this analysis because
item 1 had no maximum and therefore no item statistics). Comparing Table 10.9 with
CTT proportion correct, we see that a person answering 50% of the items correctly cor-
responds to an ability of 0.0. Finally, notice that the standard error of ability is smallest at
ability of 0.0 (i.e., where information is highest in the Rasch model).
Graphically, Figure 10.8 illustrated an item characteristic curve for item 11 based on
a sample of 1,000 examinees. The item location parameter (i.e., difficulty) d or b = –.358
(see Table 10.8). Notice in Figure 10.8 that as a person’s ability parameter increases on
the X-axis, his or her probability of correctly responding to the item also increases on the
Y-axis. In Figure 10.8, the only item parameter presented is the item location or difficulty

Table 10.9.  Ability Estimates for Crystallized Intelligence Test 2


Number of Number of Standard
items tried items correct Percent Ability error
24 2 8.33 –2.59 0.59
24 3 12.50 –2.26 0.56
24 4 16.67 –1.96 0.55
24 5 20.83 –1.66 0.53
24 6 25.00 –1.40 0.50
24 7 29.17 –1.15 0.50
24 8 33.33 –0.90 0.51
24 9 37.50 –0.65 0.48
24 10 41.67 –0.43 0.46
24 11 45.83 –0.21 0.48
24 12 50.00 0.03 0.50
24 13 54.17 0.27 0.47
24 14 58.33 0.48 0.46
24 15 62.50 0.70 0.48
24 16 66.67 0.95 0.50
24 17 70.83 1.19 0.49
24 18 75.00 1.42 0.49
24 19 79.17 1.67 0.51
24 20 83.33 1.94 0.52
24 21 87.50 2.22 0.52
24 22 91.67 2.50 0.54
24 23 95.83 2.81 0.56
24 24 100.00 3.12 0.56
Note. Item 1 is not provided because no maximum was achieved. Each ability estimate is
only shown once. The output phase 3 of BILOG-MG provides ability estimates for all 1,000
examinees.
Item Response Theory  373

because in the Rasch model the slope of the curve for all items is set to a value of 1.0
(verify this in Table 10.8). Also, in Table 10.8, we see that the discrimination parameters
(labeled column a) are all 1.0 and the c-parameter for pseudoguessing is set to 0.0 (this
parameter is introduced in the section on the three-parameter IRT model).

10.20 Item Information for the Rasch Model

Earlier in this chapter, Equation 10.11b and Figure 10.6 illustrated the information
function for item 11 under the Rasch model. Reviewing briefly, item information Ij(q)
is defined as the information provided by a test item at a specific level of person or
examinee ability (q). The IIF quantifies the amount of information available in estimat-
ing the person ability estimate (q). The information function capitalizes on the fact that
the items comprising a test are conditionally independent. Because of the independence
assumption, individual items can be evaluated for the unique amount of information they
contribute to a test. Also, individual items can be summed to create the total information
for a test. The test information function provides an overall measure of how well the test
is working specific to the information provided. In test development, item information
plays a critical role in evaluating the contribution an item makes relative to the underly-
ing latent trait (ability). For this reason item information is a key piece of information in
test development. Examining Equations 10.11a and 10.11b, we see that item information
is higher when an item’s b-value is closer to person ability (q) as opposed to further away
from (q). In fact, in the Rasch model, information is at its maximum at the location value
of d (or b in IRT). Item information can also be extended to the level of the total test
yielding a test information function (TIF) by summing the item information functions.
Summation of the IIFs is possible because of the assumption of local item independent
(i.e., responses to items by examinees are statistically independent of one another, allow-
ing for a linear summative model).

10.21 Data Layout

An example of how item responses and persons are included in the data matrix used for
a Rasch analysis is presented next. The item-level responses are represented as a two-
dimensional data matrix composed of N persons or examinees responding to a set of n
test items. The raw-data matrix is composed of a column vector (uij) of item responses of
length n. In Rasch’s original work, items were scored dichotomously (i.e., 0 or 1), so in
(uij) the subscript i represents items i = 1, 2, . . . , N, and subscript j represents persons
j = 1, 2, . . . , N. Given this two-dimensional data layout, each person or examinee is
represented by a unique column vector based on his or her responses to items of length
n. Because there are N vectors, the resulting item response matrix is n (items) X N (per-
sons). Figure 10.10 illustrates a two-dimensional matrix based on Rasch’s original item
response framework for a sample of persons.
374  PSYCHOMETRIC METHODS

Person

1 2 … … N (item total)

1 11 12 … … 1 1.= 1

2 21 22 … … 2 2.= 2

. . . … . … . .

. . . … . … . .

. . . … . … . .
Item

1 2 … … 2.=

. . . … . … . .

. . . … . … . .

. . . … . … . .

1 2 … . … .=

.1= 1 .2= 2 … . = … . =
(person total)

Figure 10.10. Two-dimensional data matrix consisting of items (rows) and persons (col-
umns) in the original data layout for Rasch analysis. In IRT, the data layout is structured as items
being columns and persons or examinees as rows.

Referring to the data matrix in Figure 10.10, we find that the two parameters of
interest to be estimated are (1) a person’s ability and (2) the difficulty of an item. Origi-
nally, Rasch used the symbols hj for person ability and di for the difficulty of an item. In
the Rasch model, these symbols represent properties of items and persons, although now
the symbol h is presented as q in the Rasch model and in the IRT models. The next sec-
tion transitions from the Rasch model to the one-parameter IRT model.

10.22 One-Parameter Logistic Model for Dichotomous


Item Responses

The one-parameter (1-PL) logistic IRT model extends the Rasch model by including a
variable scaling parameter a (signified as a in IRT). Understanding the role of a in relation
to the Rasch model is perhaps best explained by thinking of it as a scaling factor in the
Item Response Theory  375

Equation 10.18. Rasch model as a one-parameter logistic IRT


model

Ea I (q - dI )
P( X I = 1|q, a I, d I) =
1 + Ea I (q - dI )

• p(xi = 1|q,ai,di) = probability of a randomly selected exam-


inee answering an item correctly given
ability theta, the item’s discrimination
and difficulty.
• ai(q – di) = difference between ability and item dif-
ficulty multiplied by the discrimination
of the item.

regression of observed score on true score. For example, the a-parameter (a-parameter in
IRT language) scales or adjusts the slope of the IRF in relation to how examinees of different
ability respond to an item or items. To this end, the scaling factor or slope of the ICC of the
items is not constrained to a value of 1.0, but may take on other values for test items. The
addition of the scaling parameter a to the Rasch model is illustrated in Equation 10.18.
In the Rasch model, the scaling parameter a is set to a value of 1.0. However, in the
one-parameter IRT model the restriction of 1.0 is relaxed in a way that allows the slope
of the IRF to conform to the empirical data (e.g., in a way that provides the best fit of the
nonlinear regression line). Another way of thinking about this is that in the one-parameter
IRT model, the slope of the IRF is now scaled or adjusted according to the discrimina-
tion parameter a, and the discrimination parameter is estimated based on the empirical
item response patterns. Introducing the scaling factor a allows us to conceptualize the
IRT model in slope–intercept form (as in standard linear regression modeling). Equation
10.19 (de Ayala, 2009, p. 17) illustrates the slope–intercept equation using the symbolic
notation introduced so far.
The inclusion of the scaling parameter provides a way to express the Rasch or one-
parameter model in terms of a linear equation. Remember that IRT models are regression
models, so taking the approach of a linear equation allows us to think about IRT as a
linear regression model. For example, the effect of multiplying the scaling factor or item
discrimination parameter (i.e., a) with the exponent in the one-parameter IRT model
provides a way to rewrite the exponent as Equation 10.19 (de Ayala, 2009, p. 17). Obtain-
ing the item location or difficulty using the elements from Equation 10.19 involves rear-
ranging and solving for the intercept term g, thereby yielding Equation 10.20a. Recall
from earlier in the chapter the linear equation aq + g yields the logit. Graphically, the
slope–intercept equation (expressed in logits) is depicted in Figure 10.11 for item 11 on
the crystallized intelligence test 2.
376  PSYCHOMETRIC METHODS

Equation 10.19. Slope–intercept equation

a(q – d) = aq – ad = aq + g

• a(q – d) = difference between ability and item difficulty


multiplied by the discrimination of the item.
• aq = discrimination or scaling factor multiplied by
ability.
• ad = discrimination or scaling factor multiplied by
item difficulty.
• g = intercept in the slope–intercept parameterization.

3.0

2.0

1.0

Logit (α*θ+γ) 0.0

-1.0

-2.0

-3.0

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 2.0 3.0


Ability (θ)

Figure 10.11. The linear parameterization (logit) for item 11 using values from Table 10.5.
Applying the linear equation to get the logit for item 11 IRT parameters for the total sample in
Table 10.5 of 1,000 examinees: a(q) + g = .99(–07) + .07 = 0.0.
Item Response Theory  377

Equation 10.20a. Item location based on the slope–intercept


equation

g
d=-
a

• g = intercept in the slope–intercept parameterization.


• d = difficulty or location of an item.
• a = discrimination or scaling factor multiplied by item
difficulty.

Equation 10.20a illustrates how, by rearranging terms, one can derive the item loca-
tion if the intercept and discrimination are known.
Next, using the item location or difficulty from our example in Figure 10.5 and
inserting it into Equation 10.20a we have the result in Equation 10.20b for the item loca-
tion (i.e., difficulty).
Practically speaking, the slope–intercept equation tells us that as a changes, the slope
of the IRF changes across the continuum of person ability. This becomes more relevant
later when we introduce the two-parameter IRT model. In the two-parameter model, the
item discrimination parameter (a-parameter) is allowed to be freely estimated and there-
fore varies for each item. Figure 10.12 illustrates the IRF for item 11 based on the one-
parameter IRT model. Notice that the slope is 1.66 and the location or difficulty is –.303
as opposed to 1.0 and –.358 in the Rasch model. These new values are a direct result of
relaxing the constraints of the Rasch model in regard to the fit of the empirical data to the
one-parameter IRT model.
Next we have the result of a one-parameter IRT model estimated using the same
data as was previously done with the Rasch model with item 11 as the focal point. Figure
10.12 illustrates the ICC for item 11 based on the one-parameter IRT analysis.
The BILOG-MG syntax on p. 378 provided the graph presented in Figure 10.12 (Du
Toit, 2003).
Notice that the ability metric (0,1) has been rescaled to (100,15) in the SCORE
command.

Equation 10.20b. Application of the slope–intercept equation

g -.358
d=- = = -.358
a 1
378  PSYCHOMETRIC METHODS

Item Characteristic Curve: ITEM0011


a = 1.664 b = -0.303
1.0

0.8
Probability

0.6

0.4

0.2

0
-3 -2 -1 0 1 2 3
Ability

Figure 10.12.  One-parameter logistic IRF for item 11 with the slope, location, and intercept
freely estimated based on the characteristics of the item responses of 1,000 examinees. The graph
provides the fit of the observed response patterns versus the predicted pattern. The solid dots indi-
cate the number of segments by which the score distribution is divided. In the graph notice that
as person ability (X-axis) and item difficulty (the b-value in the graph) become closer together in
the center area of the score distribution, the probability of responding correctly to the item is .5.
The dots also indicate that as the discrepancy between ability and item difficulty becomes larger, the
model still fits the data within the 95% level of confidence (e.g., the 95% error bars do not include
the dots). The slope is now estimated at 1.664 rather than 1.0 in the Rasch model.

One-Parameter Logistic Model.BLM - CRYSTALLIZED INT.TEST 2


ITEMS 1-25
>COMMENTS BILOG-MG EXAMPLE FOR FIGURE 9.2
>GLOBAL NPARM=1, LOGISTIC, DFNAME='C:\rpbispoly.DAT';
>LENGTH NITEMS=25;
>INPUT NTOTAL=25, NGROUPS=1, NIDCHAR=9;
>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
>CALIB NQPT=10, CHI=(25,8), RASCH, CYCLES=15, CRIT=0.005,
NEWTON=2, PLOT=1;
>SCORE INFO=2, RSCTYPE=3, LOCATION=(100.0000),
SCALE=(15.0000),
POP, YCOMMON, METHOD=2;
>SAVE CAL='CRI2.CAL', SCO='CRI2_CAL.SCO', PARM='CRI2_CAL.
PAR';
Item Response Theory  379

Inspecting Figure 10.12, we see that the one-parameter IRT model fits the data for
item 11 better than did the Rasch model, where the slope was constrained to 1.0 (e.g.,
all of the solid dots are now within the 95% error bars). Table 10.10 provides the item
parameter estimates for all 25 items on the test.
At this point, you may be wondering how to decide which model to use in a test
development situation. Recall that the philosophical tradition when using the Rasch
model is to construct a test composed of items that conform to the theoretical require-
ments or characteristics of the model. This differs from the IRT approach where the goal
is to fit a model that best represents the empirical item responses, after the item construc-
tion process and once data are acquired. In the current example, you will either have to
(1) remove or revise the items so that the requirements of the Rasch model are met or
(2) work within the data-driven approach of the IRT paradigm. Of course, in the IRT or
data-driven approach, individual items are still reviewed for their adequacy relative to
the model based on early activities within the test development process (e.g., theoretical
adequacy of items in terms of their validity). Returning to the current example using item
11, we observe the chi-square test of fit for this item using the 1-PL model to be 0.62—
indicating a good fit (i.e., the chi-square test of independence was not rejected). The item
fit chi-square statistics are provided in the phase 2 output of BILOG-MG, PARSCALE,
MULTILOG, and IRTPRO. An important point regarding evaluating item fit is that the chi-
square fit statistics are only accurate for tests of 20 items or longer (e.g., the accuracy of
the item parameter estimates is directly related to the number of items on the test). The
item difficulty parameter estimated in the one-parameter model is now labeled as b (as
opposed to d in the Rasch model). In the one-parameter IRT model, the b-parameter for
an item represents the point or location on the ability scale where the probability of an
examinee correctly responding to an item is .50 (i.e., 50%). The greater the value of the
b-parameter, the greater the level of ability (q) required for an examinee to exhibit a prob-
ability of .50 of answering a test item correctly.
As in the Rasch model, the item b- or difficulty parameter is scaled on a metric with
a mean of 0.0 and standard deviation of 1.0 (on a standard or z-score metric). In the one-
parameter IRT model, the point at which the slope of the ICC is steepest represents the value
of the b-parameter. The ability (q) estimate of an examinee or person is presented as q̂ and is
also scaled on the metric of a normal distribution (i.e., mean of 0 and standard deviation of 1).
Finally, we see that in the one-parameter IRT model, the test items provide the max-
imum amount of information for persons with ability (q) nearest to the value of the
b-parameter (in this case, a value of b = –.303). Derivation of the information is the same
as presented earlier in Equations 10.11a–10.11b. However, the maximum information
possible in the one-parameter model is not .25 because the slope of the IRF now may
take on values greater or less than 1.0. This result can be seen in Figure 10.13 where the
maximum information for item 11 is .69.
The next section introduces the two-parameter (2-PL) IRT model. In the two-­
parameter model, the slope parameter is freely estimated based on the empirical charac-
teristics of the item responses.
380  PSYCHOMETRIC METHODS

Table 10.10.  One-Parameter Model Item Parameter Estimates


Intercept a-parameter b-parameter
(¡ parameter) (a) (d) c-parameter Chi-square
ITEM (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
ITEM0002 6.613 1.664 –3.975 0 0
0.450* 0.024* 0.271* 0.000* 0
ITEM0003 5.692 1.664 –3.422 0 0.4
0.294* 0.024* 0.176* 0.000* –0.8223
ITEM0004 2.8 1.664 –1.683 0 14.7
0.104* 0.024* 0.062* 0.000* –0.023
ITEM0005 2.185 1.664 –1.313 0 15.2
0.092* 0.024* 0.055* 0.000* –0.0336
ITEM0006 1.484 1.664 –0.892 0 11.4
0.091* 0.024* 0.055* 0.000* –0.1813
ITEM0007 1.44 1.664 –0.866 0 22.7
0.093* 0.024* 0.056* 0.000* –0.002
ITEM0008 2.315 1.664 –1.392 0 35.3
0.092* 0.024* 0.055* 0.000* 0
ITEM0009 1.073 1.664 –0.645 0 24.2
0.080* 0.024* 0.048* 0.000* –0.0021
ITEM0010 0.696 1.664 –0.418 0 28.6
0.081* 0.024* 0.048* 0.000* –0.0004
ITEM0011 0.504 1.664 –0.303 0 6.2
0.079* 0.024* 0.047* 0.000* –0.6214
ITEM0012 0.147 1.664 –0.089 0 14.7
0.080* 0.024* 0.048* 0.000* –0.0657
ITEM0013 0.135 1.664 –0.081 0 64.7
0.090* 0.024* 0.054* 0.000* 0
ITEM0014 0.098 1.664 –0.059 0 20.9
0.085* 0.024* 0.051* 0.000* –0.004
ITEM0015 0.147 1.664 –0.089 0 27.3
0.079* 0.024* 0.047* 0.000* –0.0003
ITEM0016 –0.113 1.664 0.068 0 42.6
0.080* 0.024* 0.048* 0.000* 0
ITEM0017 –0.349 1.664 0.21 0 22.7
0.084* 0.024* 0.051* 0.000* –0.0019
ITEM0018 –1.104 1.664 0.664 0 10.7
0.088* 0.024* 0.053* 0.000* –0.0971
ITEM0019 –1.576 1.664 0.947 0 8.2
0.088* 0.024* 0.053* 0.000* –0.2263
ITEM0020 –1.73 1.664 1.04 0 10.1
0.088* 0.024* 0.053* 0.000* –0.0725
ITEM0021 –1.966 1.664 1.182 0 22.2
0.100* 0.024* 0.060* 0.000* –0.0005
ITEM0022 –2.132 1.664 1.281 0 13.1
0.097* 0.024* 0.059* 0.000* –0.0228
(continued)
Item Response Theory  381

Table 10.10.  (continued)


ITEM0023 –2.405 1.664 1.446 0 17.1
0.104* 0.024* 0.063* 0.000* –0.0018
ITEM0024 –2.867 1.664 1.723 0 1.1
0.111* 0.024* 0.067* 0.000* –0.8867
ITEM0025 –3.743 1.664 2.25 0 8.7
  0.151* 0.024* 0.090* 0.000* –0.034
Note. Item 1 is not provided because no maximum was achieved due to a perfect response string. This output is a
partial listing from phase 2 of BILOG-MG. Item 11 was plotted in Figure 10.4. The a-parameter (slope) is set to a
value of 1.0, conforming to the Rasch model assumptions. Also provided in BILOG-MG phase 2 output is the item
loading, which is the correlation between the item and the latent construct.

Item Information Curve: ITEM0011


Maximum information 0.7
0.69

0.6

0.5
Item 11 information
0.45
Information

0.4

0.3

0.2

0.1

b = –.303
0
–3 –2 –1 0 1 2 3
Scale Score

Figure 10.13.  Item information based on the 1-PL in Figure 10.12. I(q) on the Y-axis is the
information function of ~.45. Proficiency is the ability scale (q) on the X-axis. The information
provided by the items reaches maximum (.69) when the b-parameter = –.303. Note the difference
in the maximum item information possible in the Rasch model for item 11 being .25 versus .69 in
the 1-PL IRT model. This change in maximum information is due to relaxing the assumptions of
the Rasch model during the estimation process.

10.23 Two-Parameter Logistic Model for Dichotomous


Item Responses

The two-parameter (2-PL) IRT model marks a clear shift from the Rasch model in
that a second parameter, the item discrimination, is included in the estimation of the
item parameters. The assumptions of local item independence, unidimensionality, and
382  PSYCHOMETRIC METHODS

invariance presented earlier in this chapter are the same for the two-parameter model. In
this model, one works from a data-driven perspective by fitting the model to a set of item
responses designed to measure, for example, ability or achievement. However, the two-
parameter model estimates (1) the difficulty of the items and (2) how well the items dis-
criminate among examinees along the ability scale. Specifically, the two-parameter model
provides a framework for estimating two parameters: a, representing item discrimination
(previously defined as a in the Rasch or a- in the 1-PL IRT model), and b, representing
item difficulty expressed as the location of the ICC on the person ability metric (X-axis).
Increasing the number of item parameters to be estimated means that the sample size
must also increase in order to obtain reliable parameter estimates. The sample size recom-
mended for accurate and reliable item parameter and person ability estimates in the 2-PL
model is a minimum of 500 examinees on tests composed of at least 20 items; however, this is
only a recommendation because sample size requirements will vary in direct response to the
characteristics of the sample, test items, test length, and amount of missing data. The N = 500
general recommendation is based on simulation studies (de Ayala, 2009, p. 105; Baker
& Kim, 2004). Alternatively, some simulation research has demonstrated that one can
use as few as 200 examinees to calibrate item responses using the two-parameter model
depending on (1) the length of the test, (2) the quality of the psychometric properties of
the test items, and (3) the shape of the latent distribution of ability. However, as in any
statistical modeling scenario, evaluating the fit of the model to the data is essential, rather
than relying solely on recommendations from the literature.
In the two-parameter model, the varying levels of an item’s discrimination are
expressed as the steepness of the slope of the ICC. Allowing discrimination parameters
to vary provides a way to identify the degree to which test items discriminate along the
ability scale for a sample of examinees. Specifically, the ICC slope varies across test items,
with higher values of the a-parameter manifested by steeper slopes for an ICC. Items
with high a-parameter values optimally discriminate in the middle of the person ability
(q) range (e.g., q values ± 1.0). Conversely, items with lower values of the a-parameter
discriminate better at the extremes of the person ability (q) range (i.e., outside the range
of q ± 1.0). As is the case in the 1-PL IRT model, the b-parameter of 0.0 corresponds to
an examinee having a 0.50 probability (i.e., 50% chance) of answering an item correctly.
Once person ability (q), the a-parameter, and b-parameter of an item are known, the
probability of a person correctly responding to an item is estimated. The two-parameter
IRT model is given in Equation 10.21a.
To illustrate the two-parameter equation for estimating the probability of a correct
response for an examinee on item 11 on the intelligence test data with ability of 0.0, we
insert person ability of 0.0, location or difficultly of –0.07, and discrimination of .99
into Equation 10.20a yielding Equation 10.21b. Therefore, the probability of a correct
response for an examinee at ability 0.0 on item 11 is .53.
Notice in Equations 10.21a and 10.21b that the element D is introduced. This element
serves as a scaling factor for the exponent in the equation, as a result of which the logistic
equation and normal ogive equation differ by less than .01 over the theta range (Camilli,
1994). The normal ogive IRT model is the logistic model rescaled to the original metric of
the cumulative normal distribution. Next, we calibrate the 25-item crystallized intelligence
Item Response Theory  383

Equation 10.21a. Two-parameter logistic IRT model

E DAI (q - BI)
P( X I = 1| q, AI , BI ) =
1 + E DAI (q - BI)

• p(xi = 1|q,aj,bj) = probability of a randomly selected exam-


inee with ability theta, item discrimination
a-, and location b- responding correctly to
item j.
• aj = item discrimination parameter for item i
(previously a in the Rasch model).
• bj = difficulty parameter for item i (previously
d in the Rasch model).
• D = scaling factor of 1.702 that adjusts the
shape of the logistic equation to closely
align with the normal ogive.

Equation 10.21b. Application of the two-parameter IRT model


equation

E1.7*.99(0.0-0.07)
P( X I = 10.0,.99,
| -0.07) = = .53
1 + E1.7*.99(0.0-0.07)

test 2 item response data (N = 1,000) using the two-parameter IRT model using the fol-
lowing BILOG-MG program. Again, for comparison purposes with the one-parameter and
Rasch models, we focus on item 11. The ICC for item 11 is provided in Figure 10.14.
The BILOG-MG syntax below provided the output for Figure 10.14 and Tables 10.11
and 10.12 (Du Toit, 2003).

Two-Parameter Logistic Model.BLM - CRYSTALLIZED INT.TEST 2 ITEMS


1–25

>COMMENTS BILOG-MG EXAMPLE FOR 2-PL Model


>GLOBAL DFNAME='cri2_tot_sample_N1000.dat',NPARM=2, SAVE;
>SAVE CAL='CRI21000.CAL', SCO='CRI21000.SCO',
PARM='CRI21000.PAR';
>LENGTH NITEMS=(25);
>INPUT NTOTAL=25, NIDCHAR=9;
>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
384  PSYCHOMETRIC METHODS

Item Characteristic Curve: ITEM0011


a = 0.989 b = –0.071
1.0

0.8
Probability

0.6

0.4

0.2

0
–3 –2 –1 0 1 2 3
Ability

Figure 10.14.  Two-parameter logistic model IRF for item 11.

>CALIB NQPT=10, CHI=(25,8), CYCLES=15, CRIT=0.005, NEWTON=2,


PLOT=1;
>SCORE INFO=2, RSCTYPE=3, LOCATION=(0.0000), SCALE=(1.0000),
POP, YCOMMON, METHOD=2;
Note. NPARM=2 means the 2-PL model will be used; including the SCORE command and the
METHOD=2 option means that the person ability estimates and standard errors for all 1,000 exam-
inees will be included in the output.

We see that item 11 fits the 2-PL model well as evidenced by (1) a nonsignificant
chi-square statistic (i.e., p = –.395), and (2) the solid dots representing different levels
of ability falling within the 95% level of confidence (i.e., within the confidence level
error bars). Table 10.11 provides the item parameter estimates for the 2-PL model, and
Table 10.12 provides a partial listing of the phase 3 output from BILOG-MG. BILOG-MG
produces phases 1–3 for the one-, two-, or three-parameter models. In Table 10.11 we
see that the item parameters are now estimated for item 1 (recall that in the Rasch and
1-PL no results were produced because no maximum was being achieved). Notice that
in addition to the ability estimates for all 1,000 examinees, phase 3 produces a reliability
estimate for the total test (see the bottom portion of Table 10.12). The reliability pro-
vided in phase 3 is defined as the reliability of the test independent of the sample of persons
(based on the idea of invariance introduced earlier in this chapter). The way reliability
is conceptualized here (i.e., as a property of how the item-level scores for persons are
relative to a set of test items) is a major difference from CTT-based reliability introduced
in Chapter 7.
Item Response Theory  385

Table 10.11.  Item Parameter Estimates for 2-PL Model of Crystallized Intelligence
Test 2
Intercept a-parameter b-parameter c-parameter Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
ITEM0001 3.4 0.562 –6.049 0 0.5
0.277* 0.155* 1.597* 0.000* –0.7764
ITEM0002 2.956 0.639 –4.622 0 1.4
0.228* 0.152* 0.908* 0.000* –0.7157
ITEM0003 1.372 0.63 –2.179 0 2.5
0.080* 0.073* 0.195* 0.000* –0.8654
ITEM0004 1.094 0.692 –1.582 0 6.4
0.067* 0.067* 0.123* 0.000* –0.6078
ITEM0005 0.922 1.105 –0.835 0 2.6
0.071* 0.085* 0.055* 0.000* –0.9568
ITEM0006 0.959 1.239 –0.774 0 7.3
0.077* 0.098* 0.050* 0.000* –0.4023
ITEM0007 1.086 0.564 –1.927 0 12.1
0.061* 0.057* 0.172* 0.000* –0.1468
ITEM0008 0.556 0.788 –0.705 0 26.9
0.053* 0.067* 0.065* 0.000* –0.0007
ITEM0009 0.376 0.897 –0.42 0 29.9
0.049* 0.070* 0.056* 0.000* –0.0002
ITEM0010 0.262 0.846 –0.31 0 13.5
0.046* 0.065* 0.056* 0.000* –0.0952
ITEM0011 0.07 0.989 –0.071 0 8.4
0.047* 0.073* 0.048* 0.000* –0.3965
ITEM0012 0.064 1.488 –0.043 0 20.4
0.056* 0.111* 0.038* 0.000* –0.0047
ITEM0013 0.039 1.21 –0.032 0 10.7
0.051* 0.092* 0.043* 0.000* –0.15
ITEM0014 0.069 0.889 –0.077 0 32.3
0.046* 0.069* 0.053* 0.000* 0
ITEM0015 –0.083 0.992 0.084 0 31.5
0.049* 0.077* 0.048* 0.000* 0
ITEM0016 –0.254 1.185 0.214 0 7.2
0.054* 0.092* 0.042* 0.000* –0.3047
ITEM0017 –0.756 1.219 0.62 0 5.1
0.066* 0.095* 0.046* 0.000* –0.6493
ITEM0018 –0.924 0.968 0.954 0 8.5
0.069* 0.087* 0.064* 0.000* –0.2019
ITEM0019 –0.991 0.932 1.064 0 15.3
0.076* 0.089* 0.068* 0.000* –0.0093
ITEM0020 –1.434 1.388 1.033 0 7.4
0.107* 0.121* 0.052* 0.000* –0.1925
ITEM0021 –1.333 1.099 1.214 0 10.9
0.102* 0.109* 0.068* 0.000* –0.0539
(continued)
386  PSYCHOMETRIC METHODS

Table 10.11.  (continued)


Intercept a-parameter b-parameter c-parameter Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
ITEM0022 –1.58 1.198 1.318 0 8.9
0.121* 0.118* 0.067* 0.000* –0.1146
ITEM0023 –1.644 0.946 1.738 0 1.9
0.114* 0.110* 0.124* 0.000* –0.7626
ITEM0024 –2.51 1.277 1.966 0 4.3
0.215* 0.168* 0.126* 0.000* –0.2293
ITEM0025 –3.614 1.62 2.231 0 6.6
  0.419* 0.269* 0.151* 0.000* –0.0364
Note. The intercept is based on the linear parameterization of the logistic model. The chi-square column is a test of
fit for each item. In the BILOG-MG phase 2 output, an additional column is provided labeled “loading.” This column
will be illustrated in the phase 2 output of the three-parameter model in the next section.

Table 10.12 provides a partial output from BILOG-MG phase 3 that includes propor-
tion correct, person or examinee ability, standard errors of ability, and reliability of ability
estimates.

Summary statistics for score estimates (from BILOG-MG phase 3 output)

======================================

CORRELATIONS AMONG TEST SCORES

CRIT2
CRIT2 1.0000

MEANS, STANDARD DEVIATIONS, AND VARIANCES OF SCORE ESTIMATES

TEST: CRIT2
MEAN: -0.0013
S.D.: 0.9805
VARIANCE: 0.9614

ROOT-MEAN-SQUARE POSTERIOR STANDARD DEVIATIONS

TEST: CRIT2
RMS: 0.3299
VARIANCE: 0.1088

EMPIRICAL
RELIABILITY: 0.8983
Note. Reliability here relates to the reliability of the test independent of the sample of persons, based
on the idea of invariance introduced earlier in this chapter. Because of the IRT property of invariance,
the reliability estimate above represents a major difference between CTT reliability in Chapter 5 and
IRT-based reliability.
Item Response Theory  387

Table 10.12.  Person Ability Estimates, Standard Errors, and Marginal Probability
Tried Right Percent Ability S.E. Marginal prob
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2898 0.5647 0.0001
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 2 8 –2.2906 0.5648 0.0052
25 3 12 –1.9726 0.5364 0.0057
25 3 12 –1.9478 0.5336 0.0002
25 3 12 –1.7086 0.4837 0.0000
25 3 12 –1.7583 0.4983 0.0006
25 3 12 –1.9431 0.533 0.0029
25 3 12 –1.9431 0.533 0.0029
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
25 11 44 –0.2461 0.3813 0.0000
25 11 44 –0.3461 0.2961 0.0000
25 11 44 –0.3033 0.3384 0.0000
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
25 24 96 2.2445 0.3206 0.0032
25 24 96 2.3032 0.3251 0.0018
25 24 96 2.3032 0.3251 0.0018
(continued)
388  PSYCHOMETRIC METHODS

Table 10.12.  (continued)


Tried Right Percent Ability S.E. Marginal prob
25 24 96 2.3358 0.3382 0.0005
25 24 96 2.3172 0.33 0.0005
25 24 96 2.3032 0.3251 0.0018
25 25 100 2.6537 0.4968 0.0046

Note. This is a partial output from BILOG-MG phase 3.

Reviewing the item statistics in Table 10.11, we see that item 11 has a slope (dis-
crimination) of .99 and a location (difficulty) of –0.07. These are different from the Rasch
model where the discrimination was constrained to 1.0 and difficulty was observed as
–.358. Also, in comparison to the 1-PL model the differences are substantial with the
discrimination for item 11 being 1.66 and location or difficulty being –.303.

10.24 Item Information for the Two-Parameter Model

Item information in the two-parameter model is more complex than in the Rasch or one-
parameter IRT model because each item has a unique discrimination estimate. As the ICC
slope becomes steeper, the capacity of the item to discriminate among persons or examin-
ees increases. Also, the higher the discrimination of an item, the lower the standard error
of an examinee’s location on the ability scale. For items having varying discrimination
parameters, their ICCs will cross one another at some point along the ability continuum.
Item discrimination parameter values theoretically range between –infinity to infinity
(–¥ to ¥), although for purposes of item analysis, items with discrimination values of
0.8 to 2.5 are desirable. Negative item discrimination values in IRT are interpreted in a
similar way as in classic item analysis using the point–biserial correlation. For example,
negative point–biserial values indicate that the item should be discarded or the scoring
protocol reviewed for errors. Equation 10.22 illustrates the item information function for

Equation 10.22. Item information function for the two-parameter


model

I J (q) = A 2J PJ (1 - PJ )

• Ij(q) = information for item j.


• A 2J = discrimination for item j.
• pj = probability of a correct response to item j.
Item Response Theory  389

Item Information Curve: ITEM0011


2.0

1.5
Information

1.0

.70

0.5

b = .07
0
–3 –2 –1 0 1 2 3
(ability)
Scale Score

Ability

Figure 10.15.  Item information function based on the 2-PL IRF in Figure 10.14.

the two-parameter model. Figure 10.15 illustrates the item information function for the
two-parameter model (item 11, Figure 10.14).
In the next section, the three-parameter logistic IRT model is introduced with an
example that allows for comparison with the Rasch, one-, and two-parameter IRT mod-
els. As previously, we focus on item 11 in the crystallized intelligence test 2 data.

10.25 Three-Parameter Logistic Model for Dichotomous


Item Responses

The three-parameter logistic (3-PL) IRT model is based on the assumptions presented
earlier in this chapter and is the most general of the IRT models (i.e., imposes the fewest
restrictions during item parameter estimation). In the three-parameter model the item
parameters a-, b-, and c- are simultaneously estimated along with examinee ability. The
c-parameter in the three-parameter model is known as the guessing or pseudoguessing
parameter and allows one to model the probability of an examinee guessing a correct
answer. The c-parameter is labeled pseudoguessing because it provides a mechanism for
accounting for the situation where an examinee correctly responds to an item when the
IRT model predicts that examinee should not. However, this contradictory circumstance
may occur for reasons other than an examinee simply guessing. For example, a person
with very low ability may respond correctly to an item of moderate difficulty because of
390  PSYCHOMETRIC METHODS

cheating or other test-taking behavior such as an ability that has been developed that
enables an examinee to answer correctly based on a keen knowledge of how to take
multiple-choice tests.
Recall that in the one-parameter model (and Rasch model) only the b-parameter is
estimated, with no provision for modeling differential item discrimination or the possibility
of correctly answering a test item owing to chance guessing (or another ability altogether).
In the two-parameter model, provision is allowed for the estimation of discrimination and
difficulty parameters (a- and b-) but no possibility for guessing a correct response. In the
one- and two-parameter models, the lower asymptote of the ICC is zero and the upper
asymptote is 1.0 (e.g., refer back to the top half of Figure 10.3). In the one-and two-­parameter
models, because the lower asymptote is always zero and the upper asymptote is 1.0, the
probability of a correct response at an item’s location (i.e., the difficulty d or b-value) is given
as (1 + 0.0)/2 or .50. In the 3-PL model, the lower asymptote, called the c-parameter, is esti-
mated along with a- and b-parameters. When the probability of guessing a correct response
(or pseudoguessing parameter) is above zero, this is represented by the lower asymptote of
the ICC (i.e., the c-parameter) being greater than zero. The result of the c-parameter being
greater than zero is that the location of the item’s difficulty or b-parameter shifts such that
the probability of a correct response is greater than .50.
The advantage of using the three-parameter model is its usefulness for test items
or testing situations where guessing is theoretically and practically plausible (e.g., in
multiple-choice item formats). More precisely, the three-parameter model provides a way
to account for a chance response to an item by examinees. Because the c-parameter has
implications for examinees and items in a unique way, the role of the c-parameter merits
discussion. Consider a multiple-choice test item that includes a five-response option.
To account for random guessing, the c-parameter (i.e., lower asymptote of the ICC) for
such an item is set to 1/5 or .20 in the item parameter estimation process. However, the
random guessing approach assumes that all multiple-choice item alternatives are equally
appealing to an examinee, which is not the case in most testing conditions. For exam-
ple, an examinee who does not know the answer to a multiple-choice test item may
always answer the item based on the longest alternative (e.g., a test-taking strategy). In
the three-parameter model, parameters are estimated for persons of varying ability, but
their inclination to guess remains constant (i.e., the c-parameter remains constant for
all examinees), which is not likely to be the case. So, in this sense, the three-parameter
model may or may not accurately account for guessing.
Another artifact of using the three-parameter model is that nonzero c-parameter val-
ues reduce the information available for the item. If you compare the two-parameter
model item calibration results with the three-parameter calibration results, there is no
mechanism for modeling or accounting for the probability of a person of very low ability
responding correctly to items of medium to high level of difficulty (or even easy items).
This is the case because in the 2-PL model the lower asymptote is constrained to zero
(i.e., there is no chance guessing when the model does not mathematically provide for it).
The previous scenario regarding the probability of guessing or of a person with very
low ability correctly answering an item correctly can be explained by the following ideas.
Item Response Theory  391

First, personal characteristics or attributes of examinees such as having a personality type


that increases their inclination to cheat or simply guess may account for the situation
of low ability examinees answering a moderate or difficult item correctly. Other person
characteristics that may affect guessing warrant the use of the three-parameter model,
and thus inclusion of the c-parameter includes test-taking experience or expertise. These
examinee or person-specific issues are known as latent person variables and are a consid-
eration when deciding which IRT model to use.
Returning to our example data, the three-parameter IRT model is provided in Equa-
tion 10.23a.
To illustrate the three-parameter model equation, we continue to use item 11 on crys-
tallized intelligence test 2 and insert ability, a-, and b-values. To estimate the probability
of a correct response for an examinee with ability of 0.0, we insert person ability of 0.0,
location or difficulty of 0.08, and discrimination of 1.15 into Equation 10.23a. This step
is illustrated in yielding Equation 10.23b. After inserting these values, the probability of
a correct response for an examinee at ability 0.0 on item 11 is .54. Readers should verify
these steps and calculations for themselves.
Next, we calibrate (i.e., estimate the parameters) for the 25-item crystallized intel-
ligence test 2 item response data (N = 1,000) using the three-parameter model for the

Equation 10.23a. Three-parameter logistic IRT model for dichoto-


mous items

E DA J (q - BJ)
P( X I = 1|q, AJ , BJ, CJ ) = C J + (1 - C J )
1 + E DA J (q - BJ)

• p(xi = 1|q,aj,bj,cj) = probability of a randomly selected


examinee with ability theta, item dis-
crimination a-, and location b- respond-
ing correctly to item 1.
• aj = item discrimination parameter for
item j.
• bj = difficulty parameter for item j.
• cj = guessing or pseudo-chance parameter
for item j.
• D = scaling factor of 1.702 that adjusts the
shape of the logistic equation to closely
align with the normal ogive.
392  PSYCHOMETRIC METHODS

Equation 10.23b. Application of the three-parameter logistic IRT


model

E1.7*1.15(0.0-.08)
P( X11 = 1|q, A11, B11, C11) = .08 + (1 - .08)
1 + E1.7*1.15(0.0-.08)
E1.95( -.08) E1.95(.16) 1.17
= .08 + .92 1.95( -.08)
= 1* = = .54
1+ E 1+ E 1.95(.16)
1 + 1.17

BILOG-MG program provided below. We focus on item 11, and the IRF is provided in
Figure 10.16. Notice that the only change in the BILOG-MG program syntax from the
two-parameter model to a model with three-parameters is changing the NPARM=2 option
to NPARM=3 (highlighted in gray). Table 10.14 provides the item parameter estimates,
standard errors, and marginal probability fit statistics for the three-parameter analysis.


Three-Parameter Logistic Model.BLM - CRYSTALLIZED INT.TEST 2
ITEMS 1–25

>COMMENTS BILOG-MG EXAMPLE FOR 3-PL Model


>GLOBAL DFNAME='cri2_tot_sample_N1000.dat',NPARM=3, SAVE;
>SAVE CAL='CRI21000.CAL', SCO='CRI21000.SCO',
PARM='CRI21000.PAR';
>LENGTH NITEMS=(25);
>INPUT NTOTAL=25, NIDCHAR=9;
>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
>CALIB NQPT=10, CHI=(25,8), CYCLES=15, CRIT=0.005, NEWTON=2,
PLOT=1;
>SCORE INFO=2, RSCTYPE=3, LOCATION=(0.0000), SCALE=(1.0000),
POP, YCOMMON, METHOD=2;

Table 10.13 provides the classical item statistics and the logit scale values for the 25-item
crystallized intelligence test 2. Notice that item 1 is now reported because the maximum
of the log likelihood was obtained, making the estimation of item 1 parameters possible.
However, you see that this item (and item 2 as well) contributes very little to the test
because its point–biserial coefficient is .02 and 99.5% of the examinees answered the item
correctly. Shortly, we provide a way to decide which model—the two- or three-parameter—
is best to use based on the item response data from crystallized intelligence test 2. Table
10.13 is routinely provided in phase I of the BILOG-MG output.
Table 10.14 provides the parameter estimates from the BILOG-MG phase 2 output.
Notice that an additional column labeled “loading” is included. The loading values are
Item Response Theory  393

Table 10.13.  Item Statistics for the Three-Parameter Model


Item*Test
Item Name #Tried #Right Percent Logit/1.7 (Point-Biserial) Biserial
1 ITEM0001 1000 995 99.50 –3.11 0.02 0.11
2 ITEM0002 1000 988 98.80 –2.59 0.09 0.30
3 ITEM0003 1000 872 87.20 –1.13 0.30 0.49
4 ITEM0004 1000 812 81.20 –0.86 0.37 0.54
5 ITEM0005 1000 726 72.60 –0.57 0.54 0.72
6 ITEM0006 1000 720 72.00 –0.56 0.57 0.76
7 ITEM0007 1000 826 82.60 –0.92 0.31 0.45
8 ITEM0008 1000 668 66.80 –0.41 0.48 0.62
9 ITEM0009 1000 611 61.10 –0.27 0.52 0.67
10 ITEM0010 1000 581 58.10 –0.19 0.51 0.64
11 ITEM0011 1000 524 52.40 –0.06 0.55 0.69
12 ITEM0012 1000 522 52.20 –0.05 0.67 0.84
13 ITEM0013 1000 516 51.60 –0.04 0.61 0.77
14 ITEM0014 1000 524 52.40 –0.06 0.53 0.67
15 ITEM0015 1000 482 48.20 0.04 0.56 0.71
16 ITEM0016 1000 444 44.40 0.13 0.60 0.76
17 ITEM0017 1000 327 32.70 0.42 0.57 0.74
18 ITEM0018 1000 261 26.10 0.61 0.49 0.67
19 ITEM0019 1000 241 24.10 0.67 0.47 0.64
20 ITEM0020 1000 212 21.20 0.77 0.53 0.75
21 ITEM0021 1000 193 19.30 0.84 0.47 0.68
22 ITEM0022 1000 164 16.40 0.96 0.46 0.69
23 ITEM0023 1000 122 12.20 1.16 0.37 0.59
24 ITEM0024 1000 65 6.50 1.57 0.34 0.66
25 ITEM0025 1000 30 3.00 2.04 0.27 0.67

Note. Table values are from the phase I output of BILOG-MG.


Table 10.14.  Item Parameter Estimates for the 3-PL Model

394
Threshold Loading (item*total test Asymptote
Intercept Slope (a-parameter) (b-parameter) correlation) (c-parameter) Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
ITEM0001 3.26 0.55 –5.94 0.48 0.20 0.00
0.283* 0.155* 1.627* 0.136* 0.090* –0.98
ITEM0002 2.86 0.68 –4.24 0.56 0.20 1.70
0.245* 0.151* 0.772* 0.125* 0.090* –0.44
ITEM0003 1.22 0.70 –1.74 0.57 0.22 3.20
0.118* 0.087* 0.249* 0.071* 0.093* –0.78
ITEM0004 0.92 0.79 –1.17 0.62 0.21 7.50
0.112* 0.095* 0.211* 0.075* 0.087* –0.48
ITEM0005 0.74 1.39 –0.53 0.81 0.18 3.80
0.100* 0.167* 0.103* 0.097* 0.052* –0.80
ITEM0006 0.83 1.54 –0.54 0.84 0.15 8.70
0.097* 0.183* 0.087* 0.100* 0.047* –0.28
ITEM0007 0.94 0.61 –1.54 0.52 0.18 13.70
0.100* 0.068* 0.242* 0.058* 0.082* –0.09
ITEM0008 0.12 1.26 –0.09 0.78 0.28 5.70
0.136* 0.214* 0.120* 0.133* 0.052* –0.58
ITEM0009 0.26 0.99 –0.26 0.70 0.09 37.10
0.077* 0.101* 0.091* 0.072* 0.038* 0.00
ITEM0010 0.12 0.96 –0.12 0.69 0.10 12.00
0.085* 0.103* 0.096* 0.074* 0.040* –0.15
ITEM0011 –0.09 1.15 0.08 0.75 0.08 12.00
0.089* 0.123* 0.073* 0.081* 0.031* –0.15
ITEM0012 –0.01 1.57 0.01 0.84 0.03 25.90
0.067* 0.131* 0.043* 0.071* 0.014* 0.00
ITEM0013 –0.04 1.28 0.03 0.79 0.04 11.60
0.066* 0.111* 0.051* 0.068* 0.018* –0.12
ITEM0014 0.00 0.93 0.00 0.68 0.04 36.10
0.060* 0.079* 0.065* 0.058* 0.020* 0.00
ITEM0015 –0.14 1.02 0.14 0.71 0.03 38.90
0.060* 0.085* 0.055* 0.059* 0.015* 0.00

ITEM0016 –0.33 1.24 0.27 0.78 0.03 12.00


0.070* 0.108* 0.048* 0.068* 0.014* –0.06
ITEM0017 –0.89 1.34 0.66 0.80 0.03 7.90
0.103* 0.130* 0.048* 0.078* 0.012* –0.34
ITEM0018 –1.03 1.04 0.99 0.72 0.03 13.80
0.101* 0.109* 0.066* 0.075* 0.012* –0.03

(continued)

395
396
Table 10.14.  (continued)
Threshold Loading (item*total test Asymptote
Intercept Slope (a-parameter) (b-parameter) correlation) (c-parameter) Chi-square
Item (S.E.) (S.E.) (S.E.) (S.E.) (S.E.) (PROB)
0.095* 0.101* 0.070* 0.073* 0.009* 0.00
ITEM0020 –1.58 1.51 1.05 0.83 0.02 7.80
0.150* 0.154* 0.051* 0.085* 0.008* –0.16
ITEM0021 –1.40 1.14 1.23 0.75 0.01 17.60
0.120* 0.121* 0.068* 0.080* 0.007* 0.00
ITEM0022 –1.67 1.26 1.33 0.78 0.01 14.80
0.147* 0.134* 0.066* 0.083* 0.006* –0.01
ITEM0023 –1.80 1.04 1.74 0.72 0.02 8.50
0.172* 0.147* 0.124* 0.102* 0.008* –0.07
ITEM0024 –2.77 1.44 1.92 0.82 0.01 7.90
0.327* 0.234* 0.121* 0.133* 0.005* –0.05
ITEM0025 –3.99 1.84 2.18 0.88 0.01 10.00
  0.669* 0.402* 0.149* 0.192* 0.003* –0.01
Note. Table values are from the phase 2 output of BILOG-MG. The intercept is based on the linear parameterization of the logistic model. The loading column is synonymous
with the results obtained from a factor analysis and reflects the impact or contribution of each item on the latent trait or ability.
Item Response Theory  397

Item Characteristic Curve: ITEM0011


a = 1.146 b = 0.080 c = 0.079
1.0

0.8
Probability

0.6

0.4

0.2 c

0
–3 –2 –1 0 1 2 3
Ability

Figure 10.16.  Three-parameter model item response function for item 11.

synonymous with the results from a factor analysis and reflect the strength of association
between an item and the underlying latent trait or attribute.
Figure 10.16 provides the ICC for item 11 based on the three-parameter model. We
see in the figure that the c-parameter is estimated at a constant value of .079 for item 11
for the sample of 1,000 examinees. Notice that at the lower end of the ability scale the
item does not fit so well (e.g., at ability of –1.5, the solid dot is outside the 95% level of
confidence for the predicted ICC).

10.26 Item Information for the three-Parameter Model

The estimation of item information in the three-parameter model is slightly more com-
plex compared to the one- and two-parameter models. The introduction of the c-param-
eter affects the accuracy of locating examinees along the ability continuum. Specifically,
the c-parameter is manifested as uncertainty and therefore an inestimable source of error.
A test item provides more information when the c-parameter is zero given an item’s dis-
crimination and difficulty. Thus, the two-parameter model offers an advantage. Equation
10.24 illustrates this situation. Figure 10.17 presents the item information for item 11 on
crystallized intelligence test 2.
The maximum level of item information in the three-parameter model differs from
the one- and two-parameter models in that the highest point occurs slightly above an
398  PSYCHOMETRIC METHODS

Equation 10.24. Item information for the three-parameter logistic


model

D2 * A 2J (1 - C J )
I J (q) =
[C J + E1.7( A J )(q - BJ)][1 + E -1.7( A J ) (q - BJ)]2
2

• Ij(q) = information for item i at person ability theta.


• aj = item discrimination parameter for item j.
• bj = item location or difficulty.
• cj = pseudoguessing parameter.
• D = constant of 1.7.

item’s location or difficulty. The slight shift in maximum information is given by Equa-
tion 10.25 (de Ayala, 2009, p. 144). For example, the item 11 location (b) is .08, but the
information function shifts the location to .085 in Figure 10.17. Birnbaum, as described
in Lord and Novick (1968), demonstrated that an item provides its maximum informa-
tion according to Equation 10.25.

Item Information Curve: ITEM0011


2.5

2.0
Information

1.5

1.0

.81

0.5

.085
0
–3 –2 –1 0 1 2 3
Scale Score

Figure 10.17.  Three-parameter model item information function for item 11.
Item Response Theory  399

Equation 10.25. Item information scaling adjustment for three-


parameter model

1
qMAXIMUM = B J + LN[0.5(1 + 1 + 8CJ )]
DA J .

• qmaximun = ability maximum.


• aj = item discrimination parameter for item j.
• bj = item location or difficulty.
• cj = item pseudoguessing parameter.
• D = constant of 1.7.
• ln = logarithm.

Equation 10.26a. The likelihood ratio test

DLR = –2ln(L2 – PL) – (–2ln(L3 – PL)) = LR2 – PL – LR3 – PL

• ­DLR = change in the likelihood ratio values.


• –2ln(L2 – PL) = two times the logarithm of the maximum
likelihood for the 2-PL model.
• –2ln(L3 – PL) = two times the logarithm of the maximum
likelihood for the 3-PL model.

Equation 10.26b. The likelihood ratio test

DLR = –2ln(L2 – PL) – (–2ln(L3 – PL))


= 20166.30 – 20250.02 = –83.72

Equation 10.27. General equation for relative change between


nested models

(LRREDUCED MODEL - LRFULL MODEL ) (20166.30 - 20250.02)


R D2 = = = .0042
LRREDUCED MODEL 20166.30
400  PSYCHOMETRIC METHODS

10.27 Choosing a Model: A Model Comparison Approach

Now that the Rasch, one-, two-, and three-parameter models have been introduced, we turn
to the question of which model to use. As you may now realize, this is a complex question
involving the entire test development process; not simply the mathematical aspects of fit-
ting a model to a set of item response data. From a statistical perspective, I present a way to
select among the possible models using a model comparison approach. Recall that the three-​
parameter model is the most general of those presented in this chapter. Working from the most
general three-parameter model (i.e., least restrictive in terms of assumptions or constraints
placed on the item parameters), we can statistically compare the two-, one-parameter, and
Rasch models to it because they are variations on the three-parameter model. For example,
by imposing the restriction that the c-parameter is zero (i.e., there is no possibility of guess-
ing or adverse test-taking behavior), we have the two-parameter model. Likewise, imposing
the restriction that the c-parameter is zero and that the a-parameter is set to a constant, we
have the one-parameter model. Finally, imposing the restriction that the c-parameter is zero
and that the a-parameter is set to a value of 1.0, we have the Rasch model. The adequacy of
each model can be tested against one another by taking the difference between the –2 log
likelihood values available in phase 2 output of BILOG-MG. For the three-parameter model,
the final convergence estimate for the –2 log likelihood is 20250.0185 (highlighted in gray
in the display on page 401). Below is a partial display of the three-parameter model output
from phase 2 BILOG-MG that illustrates the expectation–maximization (E-M) cycles from
the calibration process for our crystallized intelligence test 2 data.

Phase 2 output for 3-PL model illustrating the –2 log likelihood values, interval counts for item
chi-square fit statistics, and average ability (theta) values across eight intervals based on the
empirical item response data

[E-M CYCLES]

-2 LOG LIKELIHOOD = 22516.946

CYCLE 1; LARGEST CHANGE= 4.08434


-2 LOG LIKELIHOOD = 20388.056

CYCLE 2; LARGEST CHANGE= 0.32663


-2 LOG LIKELIHOOD = 20270.231

CYCLE 3; LARGEST CHANGE= 0.07953


-2 LOG LIKELIHOOD = 20253.568

CYCLE 4; LARGEST CHANGE= 0.06396


-2 LOG LIKELIHOOD = 20250.731

CYCLE 5; LARGEST CHANGE= 0.02147


-2 LOG LIKELIHOOD = 20250.082
Item Response Theory  401

CYCLE 6; LARGEST CHANGE= 0.03321


-2 LOG LIKELIHOOD = 20250.056

CYCLE 7; LARGEST CHANGE= 0.01379


-2 LOG LIKELIHOOD = 20249.965

CYCLE 8; LARGEST CHANGE= 0.00205

[NEWTON CYCLES]
-2 LOG LIKELIHOOD: 20250.0185

CYCLE 9; LARGEST CHANGE= 0.00262

INTERVAL COUNTS FOR COMPUTATION OF ITEM CHI-SQUARES


-------------------------------------------------------------
158. 52. 73. 154. 68. 233. 74. 188.
-------------------------------------------------------------

INTERVAL AVERAGE THETAS


-------------------------------------------------------------
-1.585 -1.127 -0.704 -0.387 0.047 0.435 0.739 1.388
-------------------------------------------------------------
1

SUBTEST CRIT2 ; ITEM PARAMETERS AFTER CYCLE 9

Next, I provide the same section of the BILOG-MG output based on calibration of the
data using the two-parameter model.

Phase 2 output for 2-PL model illustrating the –2 log likelihood values, interval counts for item
chi-square fit statistics, and average ability (theta) values across eight intervals based on the
empirical item response data

[E-M CYCLES]

-2 LOG LIKELIHOOD = 20358.361

CYCLE 1; LARGEST CHANGE= 1.65168


-2 LOG LIKELIHOOD = 20171.007

CYCLE 2; LARGEST CHANGE= 0.20020


-2 LOG LIKELIHOOD = 20167.612

CYCLE 3; LARGEST CHANGE= 0.06552


-2 LOG LIKELIHOOD = 20166.801

CYCLE 4; LARGEST CHANGE= 0.03836


-2 LOG LIKELIHOOD = 20166.412
402  PSYCHOMETRIC METHODS

CYCLE 5; LARGEST CHANGE= 0.00624


-2 LOG LIKELIHOOD = 20166.352

CYCLE 6; LARGEST CHANGE= 0.01116


-2 LOG LIKELIHOOD = 20166.340

CYCLE 7; LARGEST CHANGE= 0.00443

[NEWTON CYCLES]

-2 LOG LIKELIHOOD: 20166.3010


CYCLE 8; LARGEST CHANGE= 0.00142

INTERVAL COUNTS FOR COMPUTATION OF ITEM CHI-SQUARES


-------------------------------------------------------------
135. 73. 63. 168. 65. 234. 76. 186.
-------------------------------------------------------------

INTERVAL AVERAGE THETAS


-------------------------------------------------------------
-1.617 -1.172 -0.720 -0.404 -0.011 0.428 0.736 1.406
-------------------------------------------------------------

The –2 log likelihood values are formally called deviance statistics because they are
derived from the fitted (predicted) versus observed item responses to the IRF. Because the
two-parameter model is completely nested within the three-parameter model, and know-
ing the final –2 log likelihood values known for the two- and three-parameter models, we
can conduct a test of the difference between the final deviance (i.e., –2 log likelihoods)
values using the likelihood ratio (LRT) test (Kleinbaum & Klein, 2004, p. 132), as illus-
trated in Equation 10.26a. The likelihood ratio statistic is distributed as a chi-square
when the sample size is large (e.g., the sample sizes normally used in IRT qualify).
Inserting the deviance values for the two-parameter and three-parameter models
into Equation 10.26a yields the result in Equation 10.26b.
To evaluate the difference between the two models, we need to know the degrees
of freedom for the two models. The degrees of freedom for each model is derived as the
number of parameters in the model (e.g., for the three-parameter model there are three)
times the number of items in the test (in our crystallized intelligence test 2 there are
25 items). Therefore, the degrees of freedom for the two-parameter model is 2*25 = 50,
and for the three-parameter model, 3*25 = 75. Next, we subtract the degrees of freedom
from the two-parameter model (50) from the three-parameter model degrees of freedom
(75), yielding a result of 25. Next, we use the chi-square distribution to test whether
the change between the two models is significant; by consulting a chi-square table of
critical values with 25 degrees of freedom (testing at a = .05) we find a critical value of
37.65. Recall that our value of the difference between the two model deviance statistics
Item Response Theory  403

is –83.72. The difference of –83.72 is larger than chi-square critical, so we reject the test
that the models are the same.
The deviance value for the two-parameter model is smaller (20166.30) than the devi-
ance for the three-parameter model (20250.02). And since the difference between the two
values is statistically significant, the two-parameter model appears to be the best choice
given our data—unless there is an overwhelming need to employ a three-parameter model
for reasons previously discussed. Similarly, one can conduct a model comparison between
the two-parameter and one-parameter model to examine the statistical difference between
the two models (e.g., as in Table 10.15). However, the decision between using the one- or
two-parameter model may require more than a statistical test because a goal of the test
may be to estimate how the items discriminate differently for examinees of different ability
levels. Table 10.15 provides a summary of the three IRT models using the 25-item crystal-
lized intelligence test data with N = 1,000 examinees. The relative change values are derived
using Equation 10.27. In Equation 10.27 the deviance statistics are inserted into the equa-
tion to illustrate the relative change between the two- and three-parameter IRT models.
The column labeled “relative change” in Table 10.15 provides a comparison strategy
similar to that used in comparing multiple linear regression models. For example, in regres-
sion analysis a key issue is identifying the proportion of variance (R2) that a model accounts
for (i.e., how well a regression model explains the empirical data). The larger the R2, the bet-
ter the model explains the empirical data. Using this idea, we can compare our models by
examining the relative change in terms of proportion or percent change (or improvement)
across our competing models. Inspection of Table 10.15 shows that the relative change
from the one- to two-parameter model is very large (i.e., 97%). Next, we see that the change
between the three-parameter and two-parameter models is less than 1% (although the LRT
detected a statistically significant difference). Evaluating our three IRT models this way tells
us that the one-parameter model is the most parsimonious and that the difference between
the two- and three-parameter models, though statistically significant, is of little practical
importance from the perspective of how much variance each model explains. Based on the
model comparison results, it appears that the two-parameter model is the best to use if item
discrimination is an important parameter to be estimated for testing purposes. If item dif-
ficulty is the only parameter deemed as being important with regard to the goals of the test,
the one-parameter model provides an acceptable alternative to use.

Table 10.15.  Model Summary Statistics


Number of
Model –2 lnL parameters Relative change AIC BIC
1-PL 10229.75 25.00 10279.75 10402.44
2-PL 20166.30 50.00 0.97 (97%) 20266.30 20511.69
3-PL 20250.02 75.00 0.42 (<1%) 20400.02 20768.10
Note. Relative change is calculated using Equation 10.27. Akaike information criterion (AIC; Akaike, 1973) and
Bayesian information criterion (BIC; Schwartz, 1978) are measures of model parsimony that consider the complexity
of a model, given the number of parameters estimated and the deviance statistic.
404  PSYCHOMETRIC METHODS

10.28 Summary and Conclusions

This chapter highlighted some key theoretical differences between CTT and Rasch and IRT
modeling. The ideas of weak and strong true score test theory were introduced. Next, the
assumptions of unidimensionality, local item independence, and invariance were covered,
and examples were provided regarding how these assumptions are evaluated with empiri-
cal data. We then turned to applied examples of the Rasch, one-, two-, and three-parameter
models for dichotomous item responses. Importantly, this chapter serves as a primer to
understanding other types of IRT models. For example, other types of Rasch and IRT mod-
els that are extensively used include (1) Rasch and IRT models for test items that yield item
responses scored on a partial credit basis (e.g., on problems in mathematics that require
steps in arriving at an answer); (2) Rasch and IRT models for attitude or rating scales that
yield item responses that are scored using Likert-type items (i.e., polytomous); (3) models
for exclusively nominal response data (e.g., in personality assessment); and (4) tests that are
multidimensional in structure (e.g., tests that measure two or more attributes or constructs
simultaneously). Excellent resources are available for learning about and implementing these
models including de Ayala (2009), Baker and Kim (2004), and Ostini and Nering (2006).
The foundational material in this chapter will serve you well in preparation for the transition
to using other Rasch and IRT models for addressing practical measurement problems.

Key Terms and Definitions


Computer adaptive testing. An interactive, computer-administered test-taking process
where the items presented to examinees are partially based on their responses to
previous items.
Dimensionality. A term used to describe the presence of one or more underlying abilities
or latent traits being measured by a set of test items.
Eigenvalue. A single value that represents the amount of variance in all test items that can
be explained by a particular principal component or factor.
Falsifiability. The tenet that an item response model cannot be shown to be correct
or incorrect in an absolute sense; rather, the appropriateness of a particular IRT
model relative to a particular set of observed data can be established by conducting
goodness-­of-fit testing.
Goodness of fit. The congruence between observed item responses on a test or instru-
ment and those responses that are predicted according to a model.
Item characteristic curve. A trace line produced by a Rasch or IRT model.

Item information function (IIF). The contribution that an item makes to estimation of
person ability.
Item response function. The mathematical function that produces a trace line or ICC.
Item Response Theory  405

Item response theory. A system of modeling procedures that uses latent characteristics
of persons and test items as predictors of observed responses.
Joint maximum likelihood estimation. An early approach to the simultaneous estima-
tion of item parameters and person ability.
Latent class analysis. An analysis technique used when there are homogeneous sub-
populations of examinees within a sample. LCA can be used within the IRT framework
when multidimensionality is present in a measurement model.
Latent trait. A person’s underlying ability that is only observed indirectly.

Local independence. An axiom of IRT that states there is no statistical relationship (i.e.,
no correlation) between persons’ item responses to pairs of items once the primary
trait or attribute being measured is held constant or is accounted for.
Logistic function. An S-curve incorporating an exponent (2.718) where the initial
rate is exponential, slowing through the middle section, and finally reaching a
plateau.
Marginal maximum likelihood estimation. An optimal IRT parameter estimation tech-
nique in terms of consistent large-sample asymptotic properties of item parameter
estimates. This asymptotic consistency property exists for short and long tests.
Maximum likelihood estimate. An iterative numerical method where the maximum of
the likelihood function of person ability reaches its maximum point (the slope of the
tangent line = 0) with a desired degree of precision (equivalent to the log likelihood
function).
Mixture model. A measurement or psychometric model that includes a heterogeneous
subpopulation of persons or examinees. Mixture IRT models also include item types
with heterogeneous response formats.
MLE[θ̂]. The maximum of the likelihood estimate of person ability.

Objective measurement. In the Rasch model, in order for measurement to be objective,


the property of invariant comparison must exist (a characteristic of interval or ratio-
level measurement in physics).
Principal axis factor analysis. A confirmatory factor-analytic technique that includes the
squared multiple correlation (R2) of each item with all other items on the diagonal of
the correlation matrix.
Rasch measurement. Model that serves as a standard or criterion by which data can
be judged to exhibit fit of the measurement and statistical requirements of the model.
The Rasch approach to measurement includes the process of using the mathematical
properties of the model to inform the construction items and tests.
Scree plot. An X-Y graph of eigenvalues used to identify distinct breaks between the slope
of larger eigenvalues and smaller ones.
Specific objectivity. Allows for direct comparisons across different samples of persons
and items; a property known in the Rasch literature as sample-free measurement.
406  PSYCHOMETRIC METHODS

Strong true score theory. A theory that involves applying mathematical models to data
obtained on tests or other social and behavioral measuring instruments. The assump-
tions involved in applying the model correctly to real data are substantial as com-
pared to classical test theory. The true relationship between observed variables (i.e.,
item responses) and unobserved variables or latent traits formally classifies IRT as a
strong true score theory.
Test information function. The sum of the item information functions; represents the con-
tribution that a set of items comprising a test makes to estimation of ability.
Testlets. An item that is purposively designed to consist of a related set of items that
are therefore correlated. In IRT, this type of test item violates the assumption of local
independence.
Unidimensionality. An assumption of unidimensional IRT models whereby responses
to a set of items are represented by a single underlying latent trait, dimension, or
continuum.
11

Norms and Test Equating

This chapter introduces norms and test score equating and the role each plays in psycho-
metrics. First, standard scores are introduced along with the role they play in testing. Next,
the development and use of standard scores is introduced along with techniques for creat-
ing normative scores. Examples of linear, equipercentile, and item response theory–based
methods of equating observed and true scores are provided with intelligence test data.

11.1 Introduction

A primary difficulty in interpreting test scores stems from the variety of scales that exist
in psychological measurement. Furthermore, there are a variety of examinee groups on
which the scales (and test items) are defined during the process of test development.
These circumstances reveal that it is nearly impossible for users of psychological tests
to develop practical familiarity with a number of scales and/or tests they may use. In
Chapter 2, foundational principles and concepts of psychological measurement were
introduced, and comparisons were made with familiar physical measurement scales (e.g.,
temperature, weight, and length). Interpreting numerical values acquired from using these
well-established standard physical measurement scales, which are so common to our daily
lives, requires no reference manual to describe their characteristics (e.g., the precision,
accuracy, and reliability of numbers the scales provide). Finally, there is no need for nor-
mative information about the measurements acquired on standard measurement scales
for temperature, weight, and length in order to ensure the correct use and interpretation
of the scores or numerical values obtained since direct experience with their use provides
standard guidance in most situations where these standard measurement scales are used.
In psychological (and educational) measurement and testing, we face substan-
tially more challenges than those encountered in standard physical measurements. In

407
408  PSYCHOMETRIC METHODS

fact, psychological measurement and testing is a complex endeavor requiring detailed


information and guidance on the proper use of test scores. Psychological testing is com-
plex because of (1) the multidimensional nature of constructs, attributes, and behaviors
being measured and (2) the multitude of tests available to measure a particular construct,
attribute, or behavior. Finally, we acknowledge that psychological measurement is more
imprecise relative to measurement in the physical sciences. Given the challenges men-
tioned, we turn next to the topic of norms, the process of norming, and norm-referenced
testing.
In this chapter, norms are defined along with a rationale for their use. The process
of norming is described with examples using the GfGc data. A description of norm­-
referenced testing is provided along with its proper use. Test equating is introduced,
and examples are provided specific to how equating of test scores works using the GfGc
data.

11.2 Norms, Norming, and Norm-Referenced Testing

Most standardized tests of achievement and ability in psychology and education use
norms such as percentiles, age, or grade equivalents and standard scores. A standard
score is a raw score converted from one scale to another where the latter employs an
arbitrary mean and standard deviation. Standard scores are more easily interpreted than
raw scores, and the position of an examinee’s performance relative to other examinees
is clearly indexed. The term norm is used in the scholarly literature to refer to a behav-
ior that is usual, average, normal, standard, expected, or typical (Cohen & Swerdlik,
2010, p. 111). Norms are defined as test performance data on a group of examinees used
as a reference for evaluating, interpreting, or placing in context of persons’ test scores
(Cohen & Swerdlik, 2010, p. 111). Norming is the process of creating norms based on
a normative sample—a sample of examinees whose performance is analyzed and then
used as a reference for other individual persons taking the test. Norm-referenced testing
and assessment is defined as a method for evaluating and interpreting an examinee’s score
and comparing it to scores of examinees on the same test.

11.3 Planning a Norming Study

The following points provide guidelines for planning and conducting a norming study or
project. The information is general enough to be adapted to the specific needs and goals
of a particular study. We use the characteristics of subjects in the GfGc dataset to make
the examples concrete.
1.  Decide on the (target) population to be used to derive the norms. Example: A
large sample is obtained from the U.S. population for the purpose of calculating norma-
tive values on the crystallized intelligence, fluid intelligence, and short-term memory
Norms and Test Equating  409

subtests. The sample is stratified by age (ages 15–90 years, grouped into eight age bands),
sex, and region, and the total sample size is at least 1,600 (e.g., 200 subjects per age
band). The sample includes a minimum of 25–100 individuals in each targeted demo-
graphic and language subgroup.
2.  Select the sampling strategy. Example: A probability sampling strategy will
be employed. Probability (random) sampling ensures that every person in a defined
target population has a known probability of being selected into the sample. Knowing
this probability of selection, we can compute estimates of sampling error, leading to
information about the precision of the statistics computed from the raw scores. Fur-
thermore, in selecting a probability sample, stratification (the partitioning of a popula-
tion into homogeneous subgroups or strata and sampling independently from each)
will increase the precision (via representativeness) of the population estimates based
on the sample data. Based on the information in point 1 above, our sampling strategy
will be stratified random. Other random sampling strategies include simple random
sampling, systematic sampling, and cluster random sampling. See Groves et al. (2009)
for details of various sampling strategies that are applicable to a variety of norming
study designs.
Note on nonrandom sampling: Sometimes norms are developed using samples of
convenience or samples acquired for a specific purpose (i.e., convenience or purpose-
ful sampling). For example, consider the increased use of the Internet for web-based
survey sampling or test administration. Although convenience or purposive samples are
sometimes used in situations where normative information is calculated and reported,
the drawback to developing and using norms based on these types of samples is the pos-
sibility of systematic bias that influences respondents’ or examinees’ data (responses).
Additionally, the composition of the sample is unlikely to represent the target population
of interest accurately. In this situation, poststratification and weighting adjustments are
possible to better align the sample with the population of interest (e.g., see Groves et al.,
2009).
3.  Select the statistics that will be calculated in preparation for the standard or
scale score creation using the norming sample. Example: The mean, standard devia-
tion, variance, skewness, kurtosis, and percentile ranks are based on raw scores from the
sample. Assuming we are using a random or probability-based sampling protocol, the
mean (or any other sample statistic) computed for the norming sample is an estimate of
the population parameter. For example, classical (long-run) probability theory tells us
that employing random sampling, we can construct a frequency distribution of sample
means around the single population means (e.g., We assume that we take a large number
of repeated independent random samples of a particular size and construct a sampling
distribution of the means. There is an estimate of error associated with the sampling dis-
tribution of the means.). Because we are using a random (probability-based) sample, the
error distribution is distributed normally (approximates the normal curve). The afore-
mentioned points provide us with a way to quantify the degree of sampling error in our
norming sample. This step in turn allows us to report the amount of error attributable to
410  PSYCHOMETRIC METHODS

our sampling protocol, which affects the accuracy of any interpretation made from using
the normative data.
4.  Decide on the level of sampling error that is acceptable. Example: Previously,
sampling error was introduced in the context of probability theory. Sampling error is the
discrepancy between the sample estimate and the population parameter. The acceptable
margin of sampling error depends on the goals of the norming study and how the scores
are to be used.
5.  Acquire the sample and review any anomalies that will influence the devel-
opment of the norms. Example: Conduct thorough data screening to identify outliers,
missing or out-of-range values. Carefully inspect the sample data for numerical errors or
out-of-range values; make corrections as necessary. Decide on and then apply decision
rules for replacing or imputing missing data points.
6.  Compute the values of group-level statistics such as the mean, variance,
skewness, kurtosis, and standard error of the mean (see the Appendix for a
review).
7.  Select the type of normative scores that are most useful and develop raw
score to standard score conversion tables. Example: Percentiles or linear transformed
z-scores with a standard score metric of, for example, mean = 10/SD = 3; normalized scale
scores with, for example, an IQ scale score metric – mean = 100/SD = 15 for composite
scores (e.g., verbal IQ).
8. Develop detailed written documentation of the norming procedures.
9.  Draft guidelines or a technical manual for the purpose, development, use,
and interpretation of the norms.

11.4 Scaling and Scale Scores

In Chapter 2, the fundamental properties of measurement were presented. Levels of


measurement were introduced and examples provided. Units of measurement were dis-
cussed, with implications for psychological measurement and testing. In this section, we
discuss a variety of scale scores that have been proposed for use with psychological tests.
Chapter 5 introduced scaling and provided several approaches to scaling psychological
data. You may want to review Chapters 2 and 5 for a review of measurement and scaling
after reading the following section of this chapter.
In the following section, we describe different types of derived or scale scores. Some
scale scores are defined as exhibiting approximately equal units of measurement (e.g.,
an interval level of measurement). Other scale scores are created in such a way that they
exhibit qualities or meaning in terms of the performance of some well-defined group of
people. Also, some scale scores exhibit meaning in terms of the quality of performance
on a test by persons based on judgments.
Norms and Test Equating  411

Raw Score Scale


The raw score scale has no meaning without supporting data that translate into meaning-
ful information. For example, supporting data may include minimum score level that is
acceptable for matriculating to the next grade level in school or receiving certification or
licensure. Another example of supporting data is normative (e.g., raw scores allow us to
describe the performance of a set of persons with known characteristics). The raw score
scale is the sum of the items answered correctly. In this type of score scale, one score point
is considered to be the same amount of ability or achievement wherever it occurs along the
score scale. Based on the previous explanation, the difference between any particular num-
ber of score points is equal anywhere on the score scale (e.g., a raw score point difference
of 10 means the same thing whether the 10-point difference is observed in the middle or
the lower portion of the score scale). The raw score scale is restrictive in its utility because
the scores are a product of the items on the test—and the items that comprise a test exhibit
certain properties (e.g., item discrimination and difficulty based on the sample of examin-
ees responding to the items). One positive aspect of the raw score scale is that it lends itself
to identifying items that are performing poorly for a specific group of persons being tested.
Such information is useful for revising the test to improve its utility.
The lack of generality of raw scores typically limits their use beyond certain types of
testing (e.g., criterion-referenced testing or licensure testing). Perhaps one of the most prob-
lematic aspects of raw scores occurs when we want to create a second or alternate form of the
same test. For example, there is variation from form to form based on different items on each
form. This situation often creates confusion for test users and examinees. To circumvent this
problem, we can covert raw scores to an arbitrary scale metric that is different from any of the
original raw score scales. Typically, we convert raw scores to derived scores or scale scores
with a specified metric that facilitates explanation of examinee performance relative to a refer-
ence group. Once the scores for the two test forms are on the same scale (e.g., set to a standard
metric such as mean of 50 and standard deviation of 10), we can link or equate the standard
scores for the two forms of the test so that score interpretation is meaningful. The distinction
between linking and equating scores is presented later in the chapter.

11.5 Standard Scores under Linear Transformation

The unadjusted linear transformation is the most basic of the formal scaling methods.
To apply this method, a standard reference or normative sample is selected. Recall that the
method of sample selection depends on the goals of the norming study. For example, the
sample may be acquired randomly from a defined population with specific characteristics of
the sample or may be purposive (i.e., selected for the specific purpose of developing standard
scores for a particular group). Once the sample is defined and raw score data are acquired,
application of the unadjusted linear transformation method involves (1) relocating the raw
score mean at the desired scale score location and (2) ensuring a uniform change in the size of
412  PSYCHOMETRIC METHODS

the score units to yield the desired scale score standard deviation. Under the unadjusted linear
transformation method, only the mean and standard deviation of the raw score distribution
is changed (i.e., the first two moments of the distribution of scores). Therefore, the skewness
and kurtosis (i.e., the third and fourth moments of the distribution of scores) of the original
raw distribution is unaffected by the transformation to the standard scale score metric. Exces-
sive kurtosis or leptokurtosis in the new scale score distribution will mirror the characteristics
of the original distribution to the same degree. Finally, the linear transformation method does
not transform the raw score units of measurement to a scale where equal units of measure-
ment are obtained (i.e., ensuring an interval level of measurement).
Once the data are acquired from the reference or normative sample, the goal in
the unadjusted linear transformation method is to create standard score deviates (i.e.,
z-scores) for each corresponding raw score in the original distribution. Conceptually, this
is illustrated in Equation 11.1.
Creating the new standard or scale scores involves a transformation using Equation
11.2 where we calculate the slope and intercept of a straight line. Using the constants for
the slope and intercept, we can derive the new standard scores for each raw score in the
original distribution. Equation 11.2 illustrates how to derive the new scale scores using
the slope and intercept equation.
To provide a working example of Equations 11.1 and 11.2, we use a subset of the
GfGc data consisting of examinees 15 to 20 years of age. The total sample size for this
group is N = 231. Our goal is to create a scale score with a mean of 10.0 and a standard
deviation of 3.0 from the distribution of the raw number correct scores for the crystallized
intelligence test 1 (vocabulary subtest). We will use the unadjusted linear transformation
method to accomplish our goal. The transformation of the original raw scores to the scale
scores can be accomplished using the COMPUTE command in SPSS. However, first we

Equation 11.1. Linear scaling equation

• = normal curve deviate for target scale score.


• = normal curve deviate for original raw score.
• T = target scale being created.
• = mean for target scale scores.
• = standard deviation for target scale scores.
• X = raw score scale.
• = mean for the raw score scale.
• = standard deviation for original raw scores.
Norms and Test Equating  413

Equation 11.2. Intercept and slope for converting raw scores to


scale scores

T = AX + B

• T = target scale scores being created.


• A = s lope defined as the ratio of the standard deviation of
the target distribution to the standard deviation of the
raw score distribution .

• X = raw score on original distribution.


• B = intercept or the mean of the target distribution minus
the intercept (A) times the mean of the original raw score
distribution .

need to calculate the constants for the slope and intercept so that we can use these in the
SPSS COMPUTE command. Next, we need to obtain the mean and standard deviation
of the original raw score distribution for crystallized intelligence test 1. Below are the
descriptive statistics for original raw score distribution for crystallized intelligence test 1.
Next, using the information provided in Table 11.1, we can calculate the slope (A)
and intercept (B) constants required for creating our new scale scores as follows.

Table 11.1.  Descriptive Statistics


for Raw Scores on Crystallized
Intelligence Test 1
Statistics
Gc meas of vocabulary
N Valid 231
Missing 0
Mean 33.5281
Median 35.0000
Std. Deviation 8.38673
Skewness -.382
Std. Error of Skewness .160
Kurtosis -.551
Std. Error of Kurtosis .319
Minimum 10.00
Maximum 49.00
414  PSYCHOMETRIC METHODS

Equation 11.3. Calculation of slope and intercept for linear


transformation

3.0
A= = .357
8.386
B = 10.0 − .357 ( 33.528 )

= 10.0 − 11.969
= −1.969

• 3.0 = standard deviation for the target or new scale


score distribution.
• 8.386 = standard deviation for the original raw score
distribution.
• 10.0 = mean for the target or new scale score
distribution.
• 33.528 = mean for the original raw score distribution.
Note. A mean of 10.0 and a standard deviation of 3.0 were selected
arbitrarily. Other transformations such as mean of 50 and standard
deviation of 10 or mean of 100 and standard deviation of 15 are
commonly used.

Finally, using the constants in Equation 11.3 we can calculate the new scale scores
using the following SPSS COMPUTE command syntax.

SPSS COMPUTE syntax for creating linearly transformed scale scores

COMPUTE cri1_tot_SS=cri1_tot*.357-1.969.
EXECUTE.

To verify that our linear transformation was successful according to theory, we can
inspect the descriptive statistics of the raw and scale score distributions. For example, the
mean of the scale score distribution should be 10.0, and the standard deviation should
be 3.0. Recall that in the unadjusted linear transformation method only the mean and
standard deviation are used in deriving the slope and intercept constants. Therefore, the
skewness and kurtosis for the new scale score distribution should be unchanged from
the original raw score distribution. We can check to see if this is true by inspecting
the descriptive statistics (Table 11.2) of both distributions. Reviewing the statistics in
Table 11.2, we see that the unadjusted linear transformation worked as anticipated. By
using the COMPUTE command in SPSS, we have created a new variable in our dataset
Norms and Test Equating  415

Table 11.2.  Descriptive Statistics for Raw


and Transformed Scale Scores
Statistics
scale score
using linear
Crystallized transformation
Intelligence Test 1 to 10/3 metric
N Valid 231 231
Missing 0 0
Mean 33.5281 10.0005
Median 35.0000 10.5260
Std. Deviation 8.38673 2.99406
Skewness -.382 -.382
Std. Error of Skewness .160 .160
Kurtosis -.551 -.551
Std. Error of Kurtosis .319 .319
Minimum 10.00 1.60
Maximum 49.00 15.52

labeled “cri1_tot_SS,” where each original raw score now has an associated scale score on
the mean of 10.0 and a standard deviation of 3.0 metric (within rounding error).

11.6 Percentile Rank Scale

A percentile rank corresponds to a specific raw score where the percentage of examinees in
the norm group scored below the score of interest (Crocker & Algina, 1986, p. 439). Percen-
tile ranks are useful for making relative or normative evaluations of examinee’s performance
within a specific group. The percentile rank scale is a type of normative scale that provides the
percentage of examinees in a specific group scoring below the midpoint of each score or score
interval. The percentile rank is defined in Equation 11.4 (Crocker & Algina, 1986, p. 439).
In calculating percentile ranks, we assume that the underlying construct of ability
or achievement is continuous (even though raw scores are discrete variables). Given this
assumption, each raw score point represents a score interval on the ability or achievement
continuum. To properly account for raw score intervals on the ability or achievement con-
tinuum, theoretically one-half of the examinees are expected to score below the midpoint
and one-half are expected to score above the midpoint. Therefore, in the numerator of
Equation 11.4, the value of .5 is used to index the midpoint of a particular class interval.
416  PSYCHOMETRIC METHODS

Equation 11.4. Percentile rank

• cfl = cumulative frequency for all scores lower than the


score of interest.
• fi = frequency of scores in the interval of interest.
• N = sample size in the norming study.

Using percentile ranks as normative information is sometimes based on a random


sample of examinees from a more general population. Alternatively, percentile ranks
may be used with specific groups of people selected for their particular social or demo-
graphic characteristics. Finally, the percentile rank scale is ordinal, thereby making the
assumption of equal score units untenable. For example, the units are inherently unequal
because they yield proportions of scores at score intervals for a group along the score
scale rather than equal intervals on an ability or achievement scale expressed in score
point units. Table 11.3 illustrates the relationship among the raw scores, percentile ranks,
linear transformed z-scores, and normalized z-scores for the N = 231 sample of persons
on crystallized intelligence test 1.

11.7 Interpreting Percentile Ranks

Percentile ranks are easy to compute and appealing to laypersons and professionals alike.
They are used extensively in reporting or communicating the results of standardized or
norm-referenced tests. For example, provided that characteristics of the testing scenario
such as time of year and administration procedures were the same as those experienced
by the group originally used to develop the norms, certain statements can be made about
a person’s performance. For example, a person with a raw score of 44 scored higher
than 90% of the examinees in the norm group. Percentile ranks have one primary short-
coming: they distort the measurement scale, and this is particularly true at the extreme
regions of the scale (Gregory, 2000, p. 64).
Consider the case where four persons take an exam and person 1 scores at the 50th
percentile, person 2 at the 60th percentile, person 3 at the 90th percentile, and person 4 at the
99th percentile. Is the difference in raw score points between person 1 and 2 the same as
the difference in raw score points for persons 3 and 4? It seems that the difference may be
the same. However, inspection of Figure 11.1 reveals that the distance between the 50th
and 60th percentile ranks (e.g., 10 percentile points—not raw score points) is the same
Norms and Test Equating  417

Table 11.3.  Raw Scores, Frequencies, Percentile Ranks,


and Linear z-Scores
Linear Normalized
Raw score f cf Percentile rank z-score z-score
10 1 1 1 −2.80 −2.33
13 1 2 1 −2.40 −2.33
14 1 3 1 −2.30 −2.33
15 1 4 2 −2.20 −2.06
16 1 5 2 −2.10 −2.06
17 3 8 3 −1.90 −1.89
18 4 12 4 −1.80 −1.75
19 2 14 6 −1.70 −1.56
20 4 18 7 −1.60 −1.48
21 7 25 9 −1.50 −1.34
22 4 29 12 −1.40 −1.18
23 3 32 13 −1.20 −1.13
24 6 38 15 −1.10 −1.04
25 6 44 18 −1.00 −0.92
26 7 51 21 −0.90 −0.81
27 6 57 23 −0.77 −0.74
28 6 63 26 −0.66 −0.65
29 11 74 30 −0.54 −0.53
30 7 81 34 −0.42 −0.42
31 4 85 36 −0.30 −0.36
32 10 95 39 −0.18 −0.28
33 11 106 44 −0.06 −0.16
34 7 113 47 0.05 0.02
35 11 124 51 0.17 0.03
36 17 141 57 0.29 0.08
37 8 149 63 0.41 0.23
38 8 157 66 0.53 0.39
39 11 168 70 0.65 0.50
40 8 176 74 0.77 0.62
41 12 188 79 0.89 0.71
42 10 198 84 1.00 0.88
43 7 205 87 1.10 1.00
44 4 209 90 1.12 1.13
45 8 217 92 1.40 1.23
46 7 224 95 1.50 1.41
47 3 227 98 1.60 1.65
48 2 229 99 1.70 1.89
49 2 231 100 1.80 2.06
Note. Percentile ranks are rounded to the nearest one-hundredth. Normalized z-scores are
obtained from the standard normal deviate table (z-table).
418  PSYCHOMETRIC METHODS

Figure 11.1.  Percentile ranks in a normal distribution. Adapted from Gregory (2000, p. 64).
Copyright 2000. Reprinted by permission of Pearson Education, Inc. New York, New York.

as the distance between the 90th and 99th percentile ranks (i.e., 10 percentile points).
However, the difference in raw score points between the 90th and 60th percentiles is
much less than the difference in raw score points between the 90th and 99th percentile
point. To summarize, there is not a one-to-one correspondence between percentile ranks and
raw scores throughout the score continuum. This is because the conversion of raw scores to
percentile ranks is nonlinear rather than linear.

11.8 Normalized z- or Scale Scores

The decision to create normalized standard or scale scores during the process of creating
norms varies according to the goal(s) of how scores on the test will be used. For exam-
ple, for tests of ability (i.e., intelligence) or achievement (i.e., education or scholastically
based), the practice of using normalized scale scores is defensible for at least two reasons.
First, the items on such tests are developed or written in such a way that the distribution of
responses provided by an examinee group will approximate the normal distribution. Second,
Norms and Test Equating  419

the development of normalized scale scores usually involves large samples during the
norms development process. Creating norms using the normalized scale score approach on
large-scale tests of ability and achievement involves meticulous steps to ensure that (1) the
items are written in such a way as to yield a distribution of scores that is approximately
normally distributed and (2) large, representative samples are acquired for the norming
process. Under this scenario, creating and using the normalized scale scores makes sense
and is appropriate. However, in the situation where local norms (i.e., norms for a specific
group of examinees where generalization to outside groups is not conducted) are being
developed and the distribution of scores is not normally distributed (and is not expected
to be), creating normalized scale scores is arguably the incorrect decision. Instead, alterna-
tive norms (e.g., linear z-scores transformed to a useful metric or percentile scores) can be
developed that are more appropriate for the intended use of the test scores.
Normalized scale scores are created by applying a nonlinear transformation of the
original raw score distribution. Specifically, the inverse of the standard normal cumula-
tive distribution relative to the proportion of each raw score estimate in a distribution of
scores is applied to create the normalized scale scores. Although most raw score distribu-
tions do not meet the criteria for being classified as “normally distributed,” at times it
is reasonable to create normalized scale scores using the inverse of the standard normal
cumulative distribution. Normalized scale scores are typically created in a manner that
provides representation of (1) an examinee’s score relative to the norm group and (2) the
location of that norm group’s distribution in relation to that of other group distributions
(Crocker & Algina, 1986, p. 453). When applying this approach, the raw score distribu-
tion is changed or transformed into a metric that meets the normal distribution criteria.
The advantage of using normalized scale scores based on a z-score scale metric is that
regardless of the sample respondents involved, for each score point on the scale, a fixed
percentage of cases (i.e., persons) fall above and below that point. Normalized scale scores
differ from linear z-scores (transformed from raw scores) in that the normalized adhere
to the properties of the normal distribution (e.g., the mean, standard deviation, skewness,
and kurtosis values adhere to the characteristics of the normal distribution). For this rea-
son, linear z-scores and normalized scale scores (or z-scores) will differ depending on how
the raw score distribution originally departed from the normal distribution.
The following SPSS syntax produces the normalized scale score values in the last col-
umn of Table 11.4. Normalized scale scores are obtained using the inverse of the standard
normal cumulative distribution of the proportion estimate in the sample.

SPSS syntax for producing normalized scale scores

RANK VARIABLES=cri1_tot (A)


/NORMAL INTO cri1_NSS
/RANK
/PRINT=YES
/TIES=MEAN
/FRACTION=RANKIT.
420  PSYCHOMETRIC METHODS

Table 11.4.  Raw Scores, Frequencies, Percentile Ranks, and Normalized


Scale Scores
Percentile Linear Normalized
Raw score f cf rank z-score scale score
10 1 1 1 −2.80 −2.85
13 1 2 1 −2.40 −2.48
14 1 3 1 −2.30 −2.29
15 1 4 2 −2.20 −2.16
16 1 5 2 −2.10 −2.06
17 3 8 3 −1.90 −1.91
18 4 12 4 −1.80 −1.71
19 2 14 6 −1.70 −1.59
20 4 18 7 −1.60 −1.48
21 7 25 9 −1.50 −1.32
22 4 29 12 −1.40 −1.19
23 3 32 13 −1.20 −1.12
24 6 38 15 −1.10 −1.03
25 6 44 18 −1.00 −0.92
26 7 51 21 −0.90 −0.82
27 6 57 23 −0.77 −0.73
28 6 63 26 −0.66 −0.64
29 11 74 30 −0.54 −0.53
30 7 81 34 −0.42 −0.42
31 4 85 36 −0.30 −0.36
32 10 95 39 −0.18 −0.28
33 11 106 44 −0.06 −0.16
34 7 113 47 0.05 0.06
35 11 124 51 0.17 0.03
36 17 141 57 0.29 0.18
37 8 149 63 0.41 0.33
38 8 157 66 0.53 0.42
39 11 168 70 0.65 0.53
40 8 176 74 0.77 0.66
41 12 188 79 0.89 0.79
42 10 198 84 1.00 0.97
43 7 205 87 1.10 1.14
44 4 209 90 1.12 1.26
45 8 217 92 1.40 1.41
46 7 224 95 1.50 1.69
47 3 227 98 1.60 1.97
48 2 229 99 1.70 2.21
49 2 231 100 1.80 2.58
Note. Percentile ranks are rounded to the nearest one-hundredth. Normalized scale scores are obtained using
the inverse of the standard normal cumulative distribution of the proportion estimate. To derive the ranks in
the distribution, use the formula (r − 1/2)/w, where w is the number of observations and r is the rank, ranging
from 1 to w.
Norms and Test Equating  421

Next, we can evaluate and/or compare the observed (original raw score) versus
expected shape (i.e., normal) of the distribution of the variable cri1_tot using the syn-
tax below (Figures 11.2 and 11.3). We do this for the original raw score followed by
the normalized scale score variable to evaluate if the normalized scale score in fact fits
the normal distribution. Inspection of Figure 11.3 reveals this to indeed be the case.

SPSS syntax for producing quantile–quantile plots of raw and normalized scale scores

*Raw score syntax:


PPLOT cri1_tot
/FRACTION=rankit.

*Normalized score syntax:


PPLOT cri1_NSS
/FRACTION=rankit.

Note. The "rankit" option in SPSS applies the formula (r-1/2)/w, where w is the number of observa-
tions and r is the rank, ranging from 1 to w. Other options are available for deriving percentile ranks
depending on the goal of the norming study.

Figure 11.2. Normal Q–Q plot of cri1_tot raw score variable. An expended departure from
normality occurs at the upper end of the raw score distribution. This is depicted by the dots moving
away from the diagonal line.
422  PSYCHOMETRIC METHODS

Figure 11.3.  Normal Q–Q plot of cri1_tot normalized scale score variable. No departure from
normality occurs in the normalized scale score distribution. This is depicted by the solid dots fall-
ing on the diagonal line through the range of z = −3.0 to +3.0.

11.9 Common Standard Score Transformations


or Conversions

One problem with normalized scale scores r, such as those displayed in Table 11.4, is that
negative scores exist in the distribution, making interpretation and reporting results to
nontechnical audiences a challenge. We present two techniques for converting the scale to
a more useful metric.
The first example involves transforming the linear z-scores created and displayed in
Table 11.4 to a derived score. We can linearly transform the linear scale scores to another
metric. For example, the metric for subtests on the Wechsler test of intelligence and mem-
ory is mean = 10 and standard deviation = 3. To transform the linear scale scores previously
created to this metric, we can use the SPSS syntax below. Using the syntax, a new viable
in the dataset is created, named “cri1_tot_LSS.” Note that this transformation moves the
scores to a more useful metric, the original (1) location (mean), (2) scale (standard devia-
tion), (3) skewness, and (4) kurtosis of the original raw score distribution. The SPSS dataset
“GfGc_Ageband_01.sav” includes the results of applying the syntax below.
*Derived score conversion syntax:
COMPUTE cri1_tot_subtest_metric1=10+3*cri1_tot_LSS.
EXECUTE.
Norms and Test Equating  423

The second example involves transforming the normalized z-scores created and dis-
played in Table 11.4 to normalized scale scores. Again for our example, we use the metric
for subtests on the Wechsler test of intelligence and memory where the mean = 10 and the
standard deviation = 3. To transform the normalized z-scores previously created to this
metric, we can use the SPSS syntax below. Using the syntax, a new variable in the dataset
is created named “cri1_tot_NSS.” Note that although the metric changes, this conver-
sion retains the properties of the normal distribution created by applying the inverse of
the standard normal cumulative distribution. The SPSS dataset “GfGc_Ageband_01.sav”
includes the results of applying the syntax below.

*Normalized scale score conversion syntax:


COMPUTE cri1_tot_subtest_metric=10+3*cri1_NSS.
EXECUTE.

Creating Derived or Scale Scores for Composites


Sometimes a collection of subtests yield scores that work in unison to make up a com-
posite score based on a theory. For example, in the GfGc dataset, four subtests, each
measuring a specific part of crystallized intelligence, work together to reflect one type
of crystallized intelligence (e.g., each subtest measuring a different but related part of
crystallized intelligence). Examinee responses to each of these subtests may be summed
to create a composite score for each person for crystallized intelligence. If the goal of a
norming study is to create composite score norms, the procedure to achieve this goal
involves the following steps.

1. Create the linearly transformed z-scores or normalized scale scores that align
with the goal(s) of the test. In large-scale testing programs, these are typi-
cally normalized scale scores derived from the inverse cumulative normal density
function.
2. Smooth the newly created standard scores with the help of either a curve-fitting
program or a function. You can use programs such as SPSS or SAS or Origin
to facilitate the smoothing process, although some manual adjustments may be
required for the final norms.
3. Sum the smoothed standard scores for each subtest to create the composite score
for all examinees.
4. Modify the “normalized scale score conversion syntax” previously presented as:
COMPUTE new_IQ_composite_variable_name=100+15*variable name
for the sum of the four subtests that were normalized.
EXECUTE.

Following the steps above results in normalized composite scores for all examinees.
Typically, these scores will not need additional smoothing since the smoothing process
was conducted at the level of each subtest.
424  PSYCHOMETRIC METHODS

11.10 Age- and Grade-Equivalent Scores

The goal of creating an age- (or grade-) equivalent scale is to communicate the meaning
of a child’s or person’s test performance in terms of what is typical of a child or person at a
particular age or grade. Such scores are used primarily at ages (or grades) where ability or
achievement increases rapidly with age (e.g., in developmental studies of reading ability
or growth with young children). The following steps adapted from Angoff (1984, p. 20)
provide a general framework for constructing age (or grade) equivalent scores within the
context of a norming study. Children (i.e., very young through adolescence) serve as our
example since age-equivalent scores are most often applied to this group.

1. A representative sample is acquired over the target age ranges of children to be


scaled. Age bands or groups are formed using children who are typically grouped
within six months of a particular birthday. Test items should include a difficulty
range from very easy for very young children to difficult (even for older children
outside of the age band range).
2. The mean (or median) test score of the children at each age band or range is
identified and plotted against the midpoint of the age band. Using the median
may be preferable depending on the shape of the distribution of raw scores over
the range of age.
3. A smooth curve is applied to the points in step 2 in a manner that minimizes the
distance from each point to the curve while retaining the mathematical/statistical
relationship among the points. Smoothing techniques are used to ensure that
there are no gaps in the scores along the fitted distribution. Computer-assisted
smoothing involves a combination of algorithms (e.g., loglinear or cubic spine;
Kolen & Brennan, 2004) and hand smoothing of norms. Smoothing is a somewhat
subjective exercise that requires a compromise between (1) the distances between
the score points and the curve being fitted to the data and (2) the relationship
among the points (i.e., from a statistical perspective). Often, hand smoothing
and analytic- (computer-) driven smoothing are conducted together.
4. The smoothed value of each mean score is assigned at the age band designation.
The designations are age equivalents (i.e., they are the chronological ages for
which the given test) performances are average (Angoff, 1984, p. 20).
5. Year and month values are acquired through interpolation on the curve.

Problems with Using Age- and Grade-Equivalent Scores


Although seemingly appealing to use, age- and grade-equivalent scores are problematic
for several reasons. The problems with grade-equivalent scores mirror those of age-equivalent
scores. In the following examples, I refer to age-equivalent scores. The first problem is
the ambiguity in age-equivalent scores when attempting to correlate mental ability (or
Norms and Test Equating  425

achievement measured through mental testing) and chronological age. For example, a
perfect correlation or association does not exist between age and ability (represented by
score performance). This fact is easily demonstrated by regressing (1) age on test score
or regressing (2) test score (performance) on age. These two regression lines will be dif-
ferent; therefore, mental age (as linked to ability) will be different as well. The result is
that the same mental age may be assigned to a child with different test scores. Finally,
the lower the correlation (and this may occur for several reasons) between age and test
performance, the greater the challenge in interpreting age and test performance.
The second problem with age-equivalent scores is that the curve estimated in step 3
above does not capture the variability along different score values on the curve. The prob-
lem is that the age-equivalent score can yield a distorted view of a child’s advancement
or lack thereof. Depending on the level of the test’s reliability, this issue may be highly
problematic (e.g., if the score reliability is low). Consider the situation where the correla-
tion between age and test score is low and the variation about the regression line is high.
A child’s score under these circumstances will place him or her more than two years above
his or her actual age. Alternatively, if the correlation between age and test score is high
and the variation about the regression line is low, a child located at the 95th percentile
will be classified as precisely two years advanced beyond his or her age. The point is that
the variation across the regression line for age and test score is not constant across, and
this distorts any interpretation (and therefore utility for users and the public).
The final reason that age-equivalent scores are problematic is that the concept of
mental score (defined as the same intellectual performance regardless of chronological
age) oversimplifies the study of individual differences. For example, although a 5-year-
old may have the same intelligence score as a 10-year-old, important differences in these
children exist. To this end, all that can be stated is that the 5-year-old is bright or that
her score locates her above the 99th percentile in comparison with other children her age.

11.11 Test Score Linking and Equating

The literature on test score scaling, linking, and equating is extensive, and this section of
the chapter provides an overview of these techniques. The primary focus is on test score
equating supplanted by information on score linking. Many techniques and procedures
have been developed for equating test scores. Holland and Dorans (2006) classify the
procedures and techniques according to (1) common-population versus common-item
data collection designs, (2) observed-score versus true-score equating procedures, and
(3) linear versus nonlinear techniques. In this chapter, we focus on linear and nonlinear
observed score techniques and a true score equating technique based on item response
theory.
Test score linking describes the transformation from a score on one test to a score
on another test; test score equating is a special type of score linking (Dorans, Moses,
& Eignor, 2011). The term link is used to describe the transformation from a score
on one test to a score on another test. In score linking, techniques exist that allow for
426  PSYCHOMETRIC METHODS

(1) predicting scores on one test from other information about examinees, (2) aligning
scales, and (3) equating scores (Holland & Dorans, 2006). Scale aligning and score
equating are often confused because equating is a type of scale aligning that requires
exceptionally strong requirements of the test forms (i.e., scores) being linked. In scale
aligning, the goal is to transform the scores from two different tests onto a common
scale. Figure 11.4 illustrates the different uses of scale linking and the scores produced
from each technique.
This section of the chapter focuses on the strongest form of linking between test
scores—score equating. For a thorough exposition on score linking, equating, and cali-
bration, readers are encouraged to see Kolen and Brennan (2004) and von Davier (2011).
Many testing and assessment programs have different versions of the same test that
produce scores useful in an interchangeable manner, even though the exact items on each
test version differ. The goal of equating is to “produce a linkage between two test forms
such that the scores from each test form can be used as if they had come from the same
test” (Dorans et al., 2011). At the heart of score equating is the idea that in order for scores
to be equivalent, they must be exchangeable. In order for scores on two tests to be truly
exchangeable, the following conditions are required (Dorans et al., 2011).

1. The two tests should measure the same construct, latent trait, or ability.
2. The two tests should exhibit equal estimates of score reliability.
3. The equating transformation function used for mapping the scores of test Y onto
those of test X should be the inverse of the equating transformation for mapping
scores of X to those of Y.
4. It should make no difference to the examinee regarding which of the two tests
he or she takes.
5. The equating function used to link the scores of X and Y should be the same
regardless of choice of population or subpopulation from which it was derived.

Figure 11.4.  Three categories of test score linking methods and associated goals.
Norms and Test Equating  427

An example of exchangeability is as follows. Consider administration of certification


and/or licensure examinations. In this scenario, multiple forms of a test with the same
measurement goal may be administered during the course of a single year. Administering
multiple versions of the same test during the year may result in different score distribu-
tions because different examinees take different versions of the test. If the distributions
of scores are not the same, a common practice is to establish equivalent scores on the
test forms. Deriving equivalent scores on different forms of a test under the scenario just
described is called horizontal equating. In practice, if two tests (e.g., test X and test Y)
have been equated, when a person takes test X, we can derive the equivalent score on test
Y—as if the person had also taken test Y. Using another example, we find that in large-scale
educational testing, different school districts sometimes use different tests or batteries of
tests. At times it is useful or even necessary to equate the scores from the different forms
or test batteries across multiple districts using horizontal equating techniques.
Next, consider the situation where a test is used to measure several developmental
levels of achievement. For example, a test of reading achievement is to be used at differ-
ent levels of a child’s development or possibly at different grade levels. If a child is for-
mally classified or enrolled at a particular grade level (e.g., based on his or her age) but
is tested at the previous grade level because of lagging progress, the child’s score at the
grade or developmental level in which he or she is tested can be equated to scores at the
actual grade level in which he or she is enrolled. This type of score equating is called ver-
tical equating. In this chapter, an introduction to horizontal equating is provided with
examples. Readers interested in applications of vertical linking and equating are referred
to Kolen and Brennan (2004) and von Davier (2011).
The technique of equating involves establishing scores that are equivalent on differ-
ent tests or measurement instruments. In Figure 11.5, we see that linking test scores across
forms in test equating can be conducted using common population-based methods (i.e.,
where random sampling and assignment to test forms is possible) or anchor test meth-
ods. These two categories are further divided into true and observed score techniques
using classical test theory or item response theory. Common population methods include
random assignment of examinees to test forms, whereas the anchor test method incorpo-
rates a common set of items that both examinee groups take and the groups are not nec-
essarily randomly assigned to take the different test forms (in fact, random assignment
is usually not used). Under these circumstances, this approach is called a nonequivalent
anchor test (NEAT) design.

Conduct of an Equating Study


In an equating study, examinees are randomly assigned to take test form X or Y (a.k.a.
form A or B) or both. In this section, we review three designs used in conducting equat-
ing studies. The three designs include:

1. Different tests or instruments administered to groups of examinees formed


according to random assignment.
428  PSYCHOMETRIC METHODS

Observed Score

Common Group/
Classical Test Theory
Population

True score

Item Response Theory

Test Equating

Item Response Theory


(Observed Score)

Chain Equating
Anchor /
Observed
Reference
Score
Test
Poststratification Equating

Observed Score—Levine Method


(common items/nonequivalent groups)

Classical Test Theory

True score

Item Response Theory

Figure 11.5.  Types of test score linking methods used in test equating.

2. A counterbalanced approach where group 1 takes form X followed by form Y. To


prevent an order effect, the exams are administered in all possible orders where
order is assigned randomly.
3. Different tests administered to different groups of examinees. Additionally, all
examinees take an anchor test “Z” (e.g., a common set of items consisting of
approximately 30% of the length of each of the two tests; Angoff, 1984). For
example, for a 30-item test, the anchor test would include 10 items. Each test is
administered to one of the two groups, and random assignment is not required.

Table 11.5 (Crocker & Algina, 1986, p. 458) summarizes these three designs.

11.12 Techniques for Conducting Equating: Linear Methods

Test score equating techniques are classified according to three primary categories. For
example, equating is conducted using (1) linear, (2) equipercentile, or (3) item response
Norms and Test Equating  429

Table 11.5.  Three Designs for Equating Studies


Group
Design 1 2
I X Y
II X:Ya Y:X
III X,Zb Y,Z
Note. From Crocker and Algina (2006). Copyright 2006 by South-
Western, a part of Cengage Learning, Inc. Reprinted by permission.
www.cengage.com/​permissions.
a
signifies that group 1 takes form X followed by form Y.
b
signifies that group 1 takes form X and anchor form Z.

theory-based methods. For each technique, specific assumptions are required in order
to produce accurate results from the score equating exercise. An equating function is a
transformation of raw scores on test X to the scale of raw scores on test Y. When equating
is successful, the equating function estimated from any other population is very similar.
This is true even though the equating function was estimated from a random sample of
examinees from the population. The role of random assignment is critical in equating
studies. To understand why, recall that our goal in equating is to compare the perfor-
mance of examinees (or groups) who have taken different tests. To accomplish this goal,
we must make adjustments to the test scores (or group statistics) so that the resulting
differences in scores reflect differences in the examinees or groups. So, the adjustment we
seek in the test scores must only be a function of differences in the tests—unaffected by
the attributes of the group of examinees used to make the adjustment.
In practice, equating proceeds according to two steps. In the first step (known as raw
score–to–raw score equating), the equating function is derived that links raw scores on
the “new” test (X) to those of an “old” test (Y). Step 2 involves conversion of the newly
equated X-scores to the scale to be used for reporting.

11.13 Design I: Random Groups—One Test Administered


to Each Group

The linear equating technique assumes that the only differences between the distribution
of scores on test X and test Y are the means and standard deviations. Linear equating
involves identifying equivalent scores by identifying pairs of scores on one form X and
one form Y that have identical z-scores. If the z-scores are identical, the percentile ranks
will also be the same for scores on tests X and Y. In the example that follows, we proceed
according to Design I in Table 11.5, where 200 examinees are randomly assigned to take
test forms X and Y. We use Equation 11.5 to transform X to Y*.
Next we consider an example application of Equation 11.5. The mean of test form
X for the test of crystallized intelligence (for group 1) is M = 34, and the standard devia-
tion is s = 8. Group 2 takes form Y of the crystallized intelligence test, and the summary
statistics are: M = 36 and s = 85. Now we can apply Equation 11.5 to estimate how M =
34 a score of 33 on test X will equate to a score on test Y*.
430  PSYCHOMETRIC METHODS

Table 11.6.  Sample Statistics for Design


II Equating Study
Sample statistics
Group Form M s
1 X 50.5 10.0
Y 52.0 9.5
2 X 48.5 10.5
Y 51.0 10.0
Note. Group 1: n = 100; Group 2: n = 100. Test administra-
tion is counterbalanced.

Equation 11.5. Transformation of X to Y*

β β
β α
α α

• Y* = equated Y-score.
β
• = slope of the conversion line (i.e., the standard devia-
α
tion of Y/the standard deviation of X).
• X = score on test X.
• β = mean of test Y.
• α = mean of test X.

Equation 11.6. Application of transformation of X to Y*

β β
β α
α α
Norms and Test Equating  431

We see from application of Equation 11.6 that the equated Y*-score for an X-score of
33 is 34.98. Equating procedures are affected by random error, so it is important to know
what the size of this error is to evaluate the accuracy of our equated scores. The standard
error of equating (Lord, 1980) is defined as the standard deviation of converted scores
on the scale of Y corresponding to a fixed value of X, in which each converted Y-score is
taken from a conversion line that results from an independent sampling of groups A and
B from a population that is normally distributed in X and Y. Equation 11.7 illustrates the
standard error of equating for a fixed X-score of 33 based on Design I where there are 200
examinees taking both test forms A and B. Equation 11.8 illustrates an application of the
standard error of equating with example data.
Finally, in addition to differences in means and standard deviations, if the two distri-
butions for test X and Y differ on their degree of either skewness or kurtosis (or both), the
linear method of equating is not appropriate to use. In this case, equipercentile equating

Equation 11.7. Standard error of equating for Y*: Design I

Y
Y

• Y = variance error in the Y equated scores.


• Y = variance of the Y-scores.
• Nt = total sample size based on both study groups.
• = square of the z-score corresponding to a particular
score X.

Equation 11.8. Application of standard error of equating for Y*:


Design I

Y
Y
432  PSYCHOMETRIC METHODS

is more suitable because the technique makes no assumptions regarding the shape of the
score distributions for X and Y (i.e., the equipercentile method makes no assumptions
about the equality of the first four moments of each distribution; mean, standard devia-
tion, skewness, kurtosis).

11.14 Design II: Random Groups with Both Tests


Administered to Each Group, Counterbalanced
(Equally Reliable Tests)

Under Design II, application of the linear equating technique assumes that the only dif-
ferences between the distribution of scores on test X and test Y for the groups are the
means and standard deviations. Additionally, both scores on both tests are equally reli-
able. In Design II (see Table 11.6), different groups of examinees take different forms
of the test in different randomly assigned orders. The goal of a Design II equating study
involves identifying equivalent scores by identifying pairs of scores on one form X and
one form Y that have identical z-scores (and percentile ranks). We use Equation 11.9a

Equation 11.9a. Transformation of X to Y* for counterbalanced


equating study

Yb Yb Y Y

Xb Xb

Yb Yb X X

Xb Xb

• Y* = equated Y-score.

• Yb Yb = slope of the conversion line (i.e., the


Xb Xb square root of the pooled variance
of the Y-scores/the pooled standard
deviation of X-scores).
• X = score on test X.
Y Y
• Yb = pooled mean of test Y for order
administrations 1 and 2.
• X X = pooled mean of test X for order
Xa
administrations 1 and 2.
Norms and Test Equating  433

to transform X to Y*. Note that the slope of the conversion line is different in Design II
than was the case in Design I. Specifically, in Design II, the pooled variance is calculated
for each test form for the two testing occasions. The square root of the ratio of these two
variances is used for the slope estimate.
To illustrate Equation 11.9a in an equating study under Design II, consider the case
where the following score distributions result from the test administrations. The sum-
mary statistics are provided in Table 11.6.
In this example, our goal is to calculate an equated score (Y*) for an X-score of 52.
Equation 11.9a applied in Equation 11.9b using an X-score of 52 accomplishes our goal.
The standard error of equating for Design II (Lord, 1980) is, as in Design I, defined
as the standard deviation of converted scores on the scale of Y corresponding to a fixed
value of X, in which each converted Y-score is taken from a conversion line that results
from an independent sampling of groups A and B from a population that is normally dis-
tributed in X and Y. Equation 11.10 illustrates the standard error of equating for a fixed
X-score of 52 based on Design II, where there are 200 examinees taking both test forms
A and B.
Application of Equation 11.10 is illustrated in Equation 11.11 using a fixed X-score
of 52.
In Equation 11.11, we see that the standard error of equating for Design II is much
smaller than was the case in Design I (although the distributions of X- and Y-scores were
not exactly the same). However, by incorporating counterbalancing as in Design II, a

Equation 11.9b. Transformation of X to Y* for counterbalanced


equating study

Yb Yb Y1 Y2 Yb Yb X X

Xb Xb Xb Xb

Note. Be sure to use the proper order of operations when applying


Equation 11.8. For example, begin by solving the unique parts of
the equation under the radical, then those in parentheses before
using addition and subtraction.
434  PSYCHOMETRIC METHODS

Equation 11.10. Standard error of equating for Y*: Design II

X XY
Y Y XY

• Y = variance error in the Y equated scores.


• Y = variance of the Y-scores.
• rXY = correlation between scores on forms X and Y.
• X = square of the z-score corresponding to a particular
score X.
• Nt = sample size for both groups combined.

Equation 11.11. Application of standard error of equating for


Y*: Design II

X XY
Y Y XY

favorable reduction in error variance is usually achieved for equating X- and Y-scores as
compared to using Design I. In fact, if the sample size is the same for Designs I and II,
the standard error of equating will always be smaller. Additionally, the standard error of
equating will be substantially smaller when the two test forms are highly correlated (e.g.,
.80 or higher). From a practical perspective, this means that if you are using Design I,
you will need more examinees than when using Design II. Equation 11.12 (Angoff, 1984;
Crocker & Algina, 1986) illustrates the number of examinees required to achieve equality
in X- and Y-scores when (a) the correlation between test forms is .80 and (b) score X is a
z-score of zero (i.e., 0).
Norms and Test Equating  435

Equation 11.12. Sample size requirement for creating equal accu-


racy between Designs I and II

NA 2( z X2 + 2)
=
N B (1 - rXY ) éë z X2 (1 + rXY ) + 2ùû
2(0 + 2)
=
(1 - .80) [0(1 + .8) + 2]
= 10

11.15 Design III: One Test Administered to Each Study Group,


Anchor Test Administered to Both Groups
(Equally Reliable Tests)

In equating Design III, each test or instrument to be equated is administered to different


groups of examinees. Design III differs from Designs I and II in that an anchor test is
administered to both groups. An anchor test is one that links the two groups of examin-
ees to the two test forms using a common set of test items. The number of the common
item set is 30% of the number of items on each test form (e.g., on a 30-item test, the
anchor test will consist of 10 items). Anchor test U is included to adjust for differences
that may be found to exist between the two examinee groups taking forms A and B. The
anchor test is typically shorter than the length of the primary test forms A and B. Given
the introduction of an anchor test (U), random assignment of examinees to groups tak-
ing each test form is a requirement. Random assignment of examinees is critical under
Design III because under the randomized experimental design:

1. The slope, intercept, and standard error of estimate for the regression of X on U
in subgroup 1 are equal to the slope, intercept, and standard error of estimate for
the regression of X on U in the population.
2. The slope, intercept, and standard error of estimate for the regression of Y on U
in subgroup 1 are equal to the slope, intercept, and standard error of estimate
for the regression of Y on U in the population (Crocker & Algina, 1986, p. 460).

If random assignment of groups to test forms A and B is not possible, Design III
may still be used. However, the results obtained from applying the regression equations
under Assumptions 1 and 2 above must be evaluated prior to applying the method. For
example, the larger the discrepancy between the groups on the anchor test score (U), the
less likely the assumptions will hold. The results from such discrepancy will be inaccu-
rate score equating. Next, an example for Design III is provided.
436  PSYCHOMETRIC METHODS

Table 11.7.  Sample Statistics for Design III Equating Study


    Form Anchor
Group Statistic X Y U
1 M 50.5 51.00
s 10.0 9.00
bXU 0.85

2 M 48.5 49.00
s 10.5 9.50
bYU 1.25

Total M 50.00
  s     9.25
Note. M = mean; s = standard deviation; bXU = regression slope of X on U; bYU =
regression slope of Y on U. Total represents the mean and standard deviation on the
anchor test for both groups.

To illustrate equating Design III, consider the summary statistics for two tests of
short-term memory (Table 11.7).
Next we use Equation 11.13a (modified from Crocker & Algina, 1986, pp. 460–461)
to estimate a Y*-score for an X-score of 55.
Now we use the sample statistics in Table 11.7 and Equation 11.13a to solve for Y*
for a score of 55 on crystallized intelligence in Equation 11.13b.

11.16 Equipercentile Equating

Equipercentile equating is an observed score technique that produces equivalent scores on


tests X and Y if their respective percentile ranks in any given group are equal. Because the
method is only based on aligning scores based on percentile ranks, equipercentile equating is

Equation 11.13a. Transformation of X to Y* for anchor test


equating study

Y* = a(X – c) + d

• Y* = equated Y-score for score X.


• X = selected X-score.

• a =

• c =
• d =
Norms and Test Equating  437

Equation 11.13b. Transformation of X to Y* for anchor test equating


study

Y * = a( X - c) + d
sY22 + bYU
2
( sU2 - sU2 2 ) 110.25 + 1.56(85.56 - 90.25)
a= 2
=
s X2 1 + bXU
2
1
( sU2 - sU21 ) 100 + .722(85.56 - 81)
110.25 - 7.31 102.94
= = = .996
100 + 3.28 103.28
c = 50.5 + .85(50 - 51) = 49.65
d = 48.5 + 1.25(50 - 49) = 49.75
and
Y* = a(X − c) + d = .996 (55 − 49.65) + 49.75 = 55.07

highly flexible. Since the technique does not require the assumptions for linear equating, the
equipercentile method is classified as a more general, nonlinear technique. For example, the
equipercentile method makes no assumptions about the equality of the means and standard
deviations of the two test score distributions. The equipercentile method is “more general”
and can accommodate score distributions that are nonlinear. When the assumptions of the
linear method are met, linear equating is a special case of the equipercentile method. The
primary shortcoming of the equipercentile method is that the standard error of equating is
larger than the standard errors based on the linear equating techniques previously presented.
Nevertheless, in some situations linear equating methods are inappropriate when the assump-
tions are untenable. In such cases, the equipercentile method provides a useful alternative.
To illustrate equipercentile equating, we use two test forms (Y and X) for crystallized
intelligence test 1. Each group of examinees takes only one form—Y or X. The sample
size is 500 examinees in each study group. The score distributions for each form are
normally distributed but have different means, though approximately the same standard
deviations. Specifically, the mean of test form X is 12, and the standard deviation is 5.4.
The mean of test form Y is 13 and the standard deviation is 5.5. Although the means and
standard deviations are similar, Table 11.8 reveals that the percentile ranks for the raw
scores are quite different at certain locations along the score scale, so we use this data to
illustrate the logic and steps in conducting equipercentile equating.
The first step in equipercentile equating is to determine the percentile ranks for the
score distributions for each of the two forms. Table 11.8 provides the distributions and
midpercentile ranks for forms X and Y.
Percentile rank raw score curves are illustrated for each test form in Figure 11.6.
Figure 11.7 illustrates the smoothed Y- and X-scores resulting from the equipercen-
tile equating method. Results were obtained using The RAGE_RGEQUATE program
(Kolen & Brennan, 2004). Equipercentile equating can also be conducted using the program
438  PSYCHOMETRIC METHODS

Table 11.8.  Midpercentile Ranks on


Crystallized Intelligence Test Forms 1 and 2
Form
Score Y X

2 1 1
3 3 3
4 8 8
5 15 16
6 18 21
7 23 26
8 27 31
9 32 36
10 38 43
11 44 49
12 49 54
13 55 60
14 63 66
15 67 72
16 71 76
17 78 81
18 84 86
19 88 90
20 92 92
21 95 96
22 98 98
23 99 99
24 99 99
25 99 99

Figure 11.6.  Plot of percentile ranks for two 25-item tests of crystallized intelligence.
Norms and Test Equating  439

Y X

Figure 11.7.  Equated and smoothed Y- and X-scores. Pre- (loglinear) and postsmoothing (cubic-
spline) techniques were applied. The RAGE_RGEQUATE program is available from Dr. Michael
Kolen (www.uiowa.edu/˜c07p358).

EQUIPERCENT (Price, Lurie, & Wilkins, 2001). The program is available on the com-
panion website for this book (www.guilford.com/price2-materials) and is provided in the
SAS and SPSS language. The program uses the SAS and SPSS MACRO language and can
process multiple versions of tests simultaneously. The program does not incorporate any
type of smoothing (preequating or postequating). Smoothing algorithms are often used
to refine the equipercentile technique. For example, in practice when using equipercen-
tile equating, either presmoothing of the raw score distribution is conducted or only
postsmoothing after equating has been completed. In raw score presmoothing, the goal is
to reduce some of the sampling variability that raw score frequency distributions display.
These techniques include loglinear presmoothing (von Davier, 2011) or cubic-spline
postsmoothing (Kolen & Brennan, 2004).

11.17 Test Equating Using IRT

This section introduces you to test score equating using IRT. The information is intended
as an introduction to the IRT approach to equating and focuses on enabling you to
understand the concepts and advantages of using IRT for test score equating. For those
interested in a comprehensive treatment on the topics of IRT-based scaling, linking an
equating of test scores, see Kolen and Brennan (2004) and von Davier (2011).
IRT posits that an underlying latent trait (e.g., a proxy for a person’s ability) can
be explained by the responses to a set of test items used to capture measurements on
some social, behavioral, or psychological attribute. The latent trait is represented as a
continuum (i.e., a continuous distribution) along a measurement scale. The Rasch and
one-, two-, and three-parameter logistic IRT models are frequently in use today. Equat-
ing can be conducted within any of these scaling models. Unidimensional IRT models
440  PSYCHOMETRIC METHODS

incorporate the working assumption of unidimensionality, meaning that responses to a


set of items are represented by a single underlying latent trait or dimension (i.e., the
items explain different parts of a single dimension). A second assumption of standard
IRT models is local independence, meaning that there is no statistical relationship (i.e., no
correlation) between persons’ or examinees’ responses to pairs of items on a test once the
primary trait or attribute being measured is held constant (or accounted for).
IRT provides a natural framework for scaling test responses and for equating scores on
different test forms. Equating test scores using IRT is possible for equating Designs I, II, and
III. However, IRT is particularly useful for Design III when random assignment to test forms is
not possible. When random assignment is employed (and the test forms are equally reliable),
linear and equipercentile equating techniques under Designs I, II, and III often provide accu-
rate and reliable results. Yet, certain shortcomings remain for the linear and equipercentile
methods. At this point, you may want to review Chapter 10 on IRT prior to reading this section to
refresh your knowledge of the assumptions, mechanics, and application of IRT.

IRT Equating and Scaling


When an IRT model fits the data, direct comparison of the ability parameters of examinees
who take different tests is possible because of the invariance property. Technically, equating
test scores is unnecessary in IRT because the item responses can be scaled so that the scores
on test forms are linked in a single step. However, it is essential that item and ability param-
eters derived from the two test forms are on the same scale. Therefore, in IRT, the task is to
properly scale the test responses rather than conduct an equating study (as described earlier
using the linear and equipercentile methods). However, to remain consistent with the ter-
minology in this chapter, we retain the term equating within an IRT context. Although there
are circumstances where the linear and equipercentile equating methods work very well, IRT
improves on these techniques in several ways. Central to understanding the advantages of IRT
in equating is the issue of equity. Lord (1980; Hambleton et al., 1991, p. 125) detailed the cen-
tral issues in equating relative to equity and the equating exercise within an IRT framework.

1. Traits that measure different traits cannot be equated.


2. Raw scores on unequally reliable tests cannot be equated (e.g., because equating
scores from a reliable test with scores from an unreliable test obviates the need
for equating at all).
3. Raw scores with varying levels of difficulty cannot be equated because the tests
will not be equally reliable at different levels of ability.
4. Fallible scores on tests X and Y cannot be equated unless the tests are strictly parallel.
5. Perfectly reliable tests can be equated.
6. It should not matter whether test X or Y is the reference test (i.e., the two tests
are symmetrical).
7. The property of invariance requires that the equating process be sample independent.
Norms and Test Equating  441

You can see from the points above that the linear and equipercentile methods often fall
short on several of these points. IRT offers a framework for improving the equating exer-
cise by addressing many of the above issues.
In Chapter 10 comparisons of CTT and IRT were presented. Arguably, the most
important difference between the two theories and the results they produce is the prop-
erty of invariance. In IRT, invariance means that the characteristics of item parameters
(e.g., difficulty and discrimination) do not depend on the ability distribution of exam-
inees, and conversely, the ability distribution of examinees does not depend on the item
parameters. In Chapter 7, CTT item indexes introduced included the proportion of
examinees responding correctly to items (i.e., proportion-correct) and the discrimination
of an item (i.e., the degree to which an item separates low- and high-ability examinees).
In CTT, these indexes change in relation to the group of examinees taking the test (e.g.,
they are sample dependent). However, when the assumptions of IRT hold and the model
adequately fits a set of item responses (i.e., either exactly or as a close approximation), the
same IRF/ICC (item response function/item characteristic curve) for the test items is observed
regardless of the distribution of ability of the groups used to estimate the item parameters.
For this reason, the IRF is invariant across populations of examinees. This situation is
illustrated in Figure 11.8.
Practically speaking, the invariance property ensures that examinees who respond to
different items on different test forms for which the item parameters are known will have abil-
ity estimates on the same scale (i.e., person ability estimates are linked). As presented in
Chapter 10, person ability and item parameter estimates are unknown and must be esti-
mated when using IRT. The property of invariance of the item response function relative
to linear transformations introduces an indeterminacy of the scale of ability. Indeterminacy
of the scale of ability occurs because estimation of the person ability parameter involves
an equation with two unknowns (i.e., a + b; the slope and intercept of a line). The result
is that during the process of parameter estimation using a set of item response data, the
item and ability parameter estimates are not able to be uniquely determined using maxi-
mum likelihood estimation (see Chapter 10 and the Appendix for more detail). A solu-
tion to this challenge is to set (i.e., standardize) either (a) the person ability estimates
(θ̂), for example, to a mean of 0 and a standard deviation of 1, or (b) the item difficulty
parameters to a mean of 0 and a standard deviation of 1. As previously stated, item and
ability parameters are invariant only up to a linear transformation (i.e., item and ability
parameter estimates of the same items and the same examinees will be linearly related in
two groups; Hambleton et al., 1991).
Next, we turn to an example application of IRT equating using a linear transfor-
mation for relating person ability estimates from two groups of examinees taking two
test forms (X and Y). The sample size for the two examinee groups is N = 500. For
simplicity of explanation, the following illustration assumes that item parameters and
person ability estimates are based on the one-parameter logistic model. For details
about placing the item parameters and person ability estimates on the same scale
for the two-parameter IRT model, see de Ayala (2009) and Hambleton et al. (1991,
p. 127).
442  PSYCHOMETRIC METHODS

– – –

Figure 11.8.  Invariance of item response function across different ability distributions. A test
item has the same IRF/ICC regardless of the ability distribution of the group. For an item location/
difficulty of 0.0, the low-ability group will be less likely to respond correctly because a person in
the low-ability group is located at −1.16 on the ability scale whereas a person in the high-ability
group is located at 0.0 on the ability scale.

Equating Ability Score Estimates: Comparing Different Examinees


Who Take Different Forms of the Same Test
Consider the situation where two groups of examinees take two different forms of a test
composed of different items but measuring the same construct (e.g., forms X and Y).
Furthermore, anchor or a common set of items are embedded in each test form, so that
all examinees take these common items. In conducting our IRT scaling, if we standardize
on ability (as described in the previous section), two examinees (each taking a different
test form) will have the same ability, but their scores will be on different scales. Next,
say examinees respond to the same test item of the same difficulty (recall that the item
Norms and Test Equating  443

difficulty parameter in IRT was expressed as b). In this case, two examinees of the same
ability have responded to the same item, but the scale of measurement is different for
each examinee group (i.e., because we chose to standardize ability at the outset of our
IRT analysis rather than standardize on item difficulty). We can compute the difference
between the item difficulty parameter estimates for group 1 and group 2 ( )
to obtain a scaling or adjustment factor for transforming scores from group 1 to scores
for group 2 (Hambleton et al., 1991; Crocker & Algina, 1986). The adjustment or scaling
factor m is derived based on averaging over all values of m for all of the anchor items (or
all of the common items on the two tests). Table 11.9 illustrates this scenario for a test
comprised of 15 items (with 5 anchor items) taken by two groups of examinees.
Next, Table 11.10 illustrates the estimated ability scores (standardized to a m = 0;
s = 1 metric) for examinees taking test forms X and Y.
Finally, Table 11.11 provides the equated ability scores for forms X and Y after apply-
ing the adjustment calculated in Table 11.9 based on averaged values of =
0.112. For example, consider an ability score on form X of 1.48 (for a raw score of 14).
The equated ability Y-score estimate is 1.41 for the same raw score of 14 after applying to
form the Y ability score (i.e., 1.30 + .112 = 1.41).

11.18 IRT True Score Equating

Recall that our goal in equating test scores using IRT is to equate scores on two uni-
dimensional tests that measure the same person ability (q). In IRT, an examinee’s true

Table 11.9.  Item Difficulty Estimates for Groups 1 and 2


  Group 1 Group 2
Item
1 −3.13 −6.91
2 −2.39 −1.48
3 −0.68 −0.66
4 −1.54 −0.63
5 −0.32 −0.21
6 0.05 −0.07 0.12
7 0.08 −0.02 0.10
8 0.10 −0.02 0.12
9 −0.02 −0.10 0.08
10 0.17 0.03 0.14
11 0.66 0.24
12 1.16 0.90
13 1.21 0.97
14 1.68 1.18
15 2.00 1.87  
average : 0.112
Note. Shaded rows signify anchor items.
444  PSYCHOMETRIC METHODS

Table 11.10.  Ability (Latent Trait)


Scores by Group
Ability scores
Group 1 Group 2
Raw score (form X) (form Y)
1 −3.00 −2.00
2 −1.84 −1.79
3 −1.46 −1.36
4 −1.15 −1.01
5 −0.76 −0.73
6 −0.48 −0.49
7 −0.22 −0.27
8 0.00 −0.07
9 0.23 0.13
10 0.45 0.32
11 0.68 0.53
12 0.92 0.76
13 1.19 1.00
14 1.48 1.30
15 1.84 1.60

Table 11.11.  Raw Score to Equated Latent


Trait (Ability) Score Conversion Table
Ability scores
Group 1 Group 2
Raw score (form X) (form Y)
1 −3.00 −1.89
2 −1.84 −1.68
3 −1.46 −1.25
4 −1.15 −0.90
5 −0.76 −0.62
6 −0.48 −0.38
7 −0.22 −0.16
8 0.00 0.04
9 0.23 0.24
10 0.45 0.43
11 0.68 0.64
12 0.92 0.87
13 1.19 1.11
14 1.48 1.41
15 1.84 1.71
Norms and Test Equating  445

score is estimated based on a mathematical model reflecting the relationship between an


examinee’s ability and his or her response to a test item or set of items. Recall that fitting
an IRT model to a set of test data yields an observed score pattern of item responses of
examinees. To achieve our goal in true score equating, we must first transform exam-
inees’ observed scores to true scores. Once we derive true scores, equating using IRT
provides a defensible framework for meeting the stringent requirements of (1) equity,
(2) invariance, and (3) symmetry. In this section, an introduction to true score equating
is provided using the test characteristic curve (TCC) method (Stocking & Lord, 1983).
True score equating based on Stocking and Lord’s method can be implemented using the
program EQUATE (Baker, 1990) or IRTEQ (Han, 2008). Additionally, the IRTEQ pro-
gram provides an ability to conduct IRT true score equating using the method attributed
to Haebara (1980). The TCC method uses information from item parameter estimates
(i.e., item discrimination and difficulty) to equate true scores. Other methods of true
score equating are possible. Interested readers should see Hambleton et al. (1991) for
details.

11.19 Observed Score, True Score, and Ability

Two true scores are considered equivalent if they correspond to the same ability score (q).
For example, a number-correct or raw score of 7 on test form X may be equivalent to a score
of 8 on test form Y. The observed score (or number-correct score) is an unbiased estimator of
true score (i.e., e(X) = t). Because IRT is the nonlinear regression of observed score on true
score, ability and true score are related by a monotonically increasing function (i.e., an item
response function or item characteristic curve). For this reason, true score can be mapped
onto the number-correct scale. Also, transformation of the ability scale from a m = 0/s = 1
metric to a number-correct score facilitates the interpretability of results. If so desired, the
number-correct score can be divided by the number of items on a test to yield a proportion-
correct score (e.g., sometimes used on criterion-referenced-type tests). Alternatively, for
normative scores, other score transformation metrics are often employed (e.g., an IQ metric
of m = 100/s = 15 or GRE score metric m = 500/s = 100).
In IRT, the probability of a correct response is given by the item response function;
using this information, we can insert estimates of these probabilities based on fitting an
IRT model to a set of examinee response data. The implications are that we can transform
person ability (q) and item parameters b and c without changing the probability of a cor-
rect response to an item. Transformation of the ability scale in IRT to the true score scale
is based on the sum of the item characteristic curves (Equation 11.10). For a stepwise
presentation of transformation of the ability scale to true score scale beginning with the
number-correct raw score, see Hambleton et al. (1991, p. 84).
Continuing with our true score equating example, we can use the item difficulty
parameter estimates in Table 11.9 in Equation 11.14 to calculate the number-correct score
on tests Y and X—for a person ability (q) of 1.0. Equation 11.15 illustrates this step.
446  PSYCHOMETRIC METHODS

Equation 11.14. True score scale reflected as a test characteristic


curve

• T = true score.
• = sum of the item response functions/item char-
acteristic curves.
• Pj(q) = probability of a particular item given person
ability.
• q = sum of the probabilities for each item.

Equation 11.15. Number of correct true scores on test X and Y


for a person ability of 1.0

E(1.0-BG1 ) E(1.0- BG15 )


TX = +  +
1 + E(1.0-BG1 ) 1 + E(1.0- BG15 )
E(1.0-3.13) E(1.0-2.0)
= +  +
1 + E(1.0-3.13) 1 + E(1.0-2.0)
= 6.14

and

(1.0 − BG 1) (1.0−BG 15 )
E E
TY = (1.0 − BG 1)
+ + (1.0−BG 15 )
1+ E 1+ E

(1.0 −6.91) (1.0−1.87)


E E
= (1.0 −3.13)
+ + (1.0 −2.0)
1+E 1+E

= 8.32
Norms and Test Equating  447

Figure 11.9.  Relationship between ability and the true scores on two tests.

Figure 11.9 depicts the equated true scores of (X = 6.14) and (Y = 8.32) based on a
person ability (q) of 1.0 using the TCC method of score equating.

11.20 Summary and Conclusions

This chapter intoduced two types of scores used in psychological measurement and test-
ing. Examples were provided regarding the transformation of raw scores to standard
scores (including scale scores) and the advantages standard scores provide in commu-
nicating the results of test scores within the context of testing in general. Next, norms
were defined, and the specifics of planning a norming study were provided. The role of
the normal distribution was explained in relation to deriving and using normalized scale
scores.
Test score equating involves establishing scores that are equivalent (based on the
condition of exchangeability) on different tests or measurement instruments. Three types
of equating were discussed: (1) linear, (2) equipercentile, and (3) IRT-based or latent
trait. The distinction between score linking and equating was described. Score linking
was described as the transformation from a score on one test to a score on another test,
whereas score equating was defined as a special type of score linking with additional req-
uisite assumptions about equity, symmetry, and invariance of scores.
This chapter is foundational to understanding (1) types of score distributions and
their transformations to standard score metrics in a way that aids in communicating test
score results, (2) planning a norming study in a way that yields meaningful normative
448  PSYCHOMETRIC METHODS

scores, and (3) different designs and score transformation (linking) techniques for test
score equating. The exercise of developing standard scores or standardized scale scores
(e.g., norms) involves careful planning and execution of a norming study. Similarly, plan-
ning and conducting an equating study involves selecting the appropriate design and
score transformation technique to ensure that the resulting equated scores are appropri-
ate for their intended use. The material and examples in this chapter are good preparation
for designing and implementing a norming or horizontal equating study.

Key Terms and Definitions


Anchor test. A test that links the two groups of examinees to the two test forms using a
common set of test items.
Composite score norms. Norms created by summing the total scores (not individual
items) for several subtests. For example, subtest total scores may be summed to create
a composite score for each person on a test.
Equating. Establishing scores that are equivalent on different tests or measurement
instruments.
Equating function. A transformation of raw scores on test X to the scale of raw scores on
test Y (or Y to X).
Equipercentile equating. A nonlinear, observed score equating technique that produces
equivalent scores on tests X and Y if their respective percentile ranks in any given
group are equal.
Horizontal equating. Occurs when multiple forms of a test with the same measurement
goal are administered during the course of a single year. Administering multiple
versions of the same test during the year may result in different score distributions
because different examinees take different versions of the test. If the distributions of
scores are not the same, a common practice is to establish equivalent scores on the
test forms.
Local norms. Norms for a specific group of examinees where generalization to outside
groups is not conducted.
Nonequivalent anchor test design. Common population methods include random
assignment of examinees to test forms, whereas the anchor test method incorporates
a common set of items that both examinee groups take. The groups are not necessar-
ily randomly assigned to take the different test forms (in fact, random assignment is
usually not used).
Normative sample. A sample of examinees whose performance is analyzed and then
used as a reference for other individual persons taking the test.
Norming. The process of creating norms based on a normative sample.

Norm-referenced testing. A method for evaluating and interpreting an examinee’s score


and comparing it to scores of examinees on the same test.
Norms and Test Equating  449

Norms. Test performance data of a group of examinees used as a reference for evaluat-
ing, interpreting, or placing in context of persons’ test scores (Cohen & Swerdlik,
2010, p. 111).
Percentile rank scale. A type of normative scale that provides the percentage of exam-
inees in a specific group scoring below the midpoint of each score or score interval.
Raw score scale. A score metric that has no meaning without supporting data that trans-
lates into meaningful information.
Scale aligning. The goal is to transform the scores from two different tests onto a common
scale.
Scale scores. A score scale with a specified metric that facilitates explanation of exam-
inee performance relative to a reference group (also known as derived scores).
Standard score. A raw score converted from one scale to another scale, where the latter
scale employs an arbitrary mean and standard deviation.
Test score linking. The transformation from a score on one test to a score on another test.

Unadjusted linear transformation. Relocating the raw score mean at the desired scale
score location in a way that ensures a uniform change in the size of the score units to
yield the desired scale score standard deviation. Only the mean and standard devia-
tion of the raw score distribution is changed.
Vertical equating. A test of reading achievement used at different levels of a child’s
development or possibly at different grade levels. If a child is formally classified or
enrolled in a particular grade level (e.g., based on his or her age) but is tested at
the previous grade level because of lagging progress, the child’s score at the grade
or developmental level in which he or she is tested can be equated to scores at the
actual grade level at which he or she is enrolled.
Appendix

Mathematical and Statistical Foundations

A.1 Contemporary Goals of Psychological Measurement

Measurement refers to rules for assigning numbers to objects. Researchers are able to
represent quantities of attributes numerically (through scaling) or to determine whether
objects fall into the same or different categories given a particular attribute (classifica-
tion). Although numerical scaling has dominated psychometric methods in the past,
innovations in software and increased computing technology (speed and power) have
opened analytic possibilities previously unrealizable. However, before capitalizing on the
new analytic possibilities afforded by computers and software, a review and update of
the mathematical and statistical foundations related to psychometric methods is essential
and is, therefore, the motivation for this Appendix.
A primary goal of psychometric methods is the measurement and scaling of attri-
butes. Attributes are identifiable qualities or characteristics represented by either numer-
ical elements or categorical classifications of objects of interest that can be measured.
During the process of scale or instrument development, careful consideration regarding
what terms define or constitute an attribute of an object is a crucial step. For example,
different words may mean different things to different people within or between different
cultures. Consider the case of the construct of intelligence and the manner in which it
has been defined and extensively used within a particular theoretical framework in the
United States. This theoretical framework is often inaccurate or lacks evidence of valid-
ity for people residing in other nations or even in the same nation! Nevertheless, given a
particular theoretical framework, the attributes that represent intelligence are evaluated
by examining the relationships among variables. The variables are mapped onto a specific
theoretical dimension using measurement operations and/or protocols. In this way, mea-
surements obtained theoretically reflect one unitary attribute.

451
452  Appendix

A second goal of psychometrics focuses on the scaling of objects (e.g., people) into
classification schemes related to their preferences. Such outcomes are often based on a
person’s preference for certain products or services. A third goal is the measurement and
scaling of a person’s physiological–psychological response or threshold to a stimulus as
in a sensory perception measured by psychophysical scaling. Louis Thurstone’s law of
comparative judgment constituted the seminal work in this area by linking the stimulus of
objects (the psychophysical tradition) onto linear scales that tap such areas as sociability,
affective values, and the quality of written constructed responses to questions. Thurstone
is also credited with originating the mental testing tradition within psychometric methods.
Figure A.1 provides a taxonomy of psychometric methods from the 18th century forward.

A.2 Precision, Objectivity, and Communication

The elements precision, objectivity, and communication provide a continuous frame-


work for linking the measurement process to measurement methods (Figure A.2). Taken
together, precision, objectivity, and communication provide an interrelated framework
relative to the overall goals of psychological measurement.
Precision characterizes the degree of mutual agreement among a series of individual
measurements on things such as traits, values, or attributes. The degree to which preci-
sion exists in a measurement is verified empirically through evidence of reproducibility or
repeatability—that is, the degree to which further measurements on the same attribute

Statistical methods
Biometry & Sociometry
(Quetelet, 1796–1874)

Psychophysics Mental measurement


Measurement of sensory experience Statistics & test/scale development
(Weber; Fechner, 1795–1878) (Galton; Pearson; Thurstone, 1822–1911)

Experimental psychology
Individual differences in
stimulus/response

Common ground
Psychological scaling methods
(attributes, objects or persons, & stimuli)
(analytic methods include the study of individual or between
subject differences and within subject change over time)

Figure A.1.  Taxonomy of psychometric methods. Data from Hald (1998).


Mathematical and Statistical Foundations  453

Figure A.2.  Integral components in psychometric method.

are the same or highly similar. Repeatability is one characteristic assessed by indexes of
measurement reliability (a topic covered in Chapter 7). Measurement precision (or reli-
ability) is not to be confused with accuracy, which is the degree of conformity or agree-
ment, a quantity exhibited in relation to its actual (true) value. Accuracy of measurement
is assessed by quantitative evidence that is summarized by indexes of validity (the topic
covered in Chapters 3 and 4).
As an example, consider a set of scores obtained from a sample of 12th-grade students
on two parallel forms of a test designed to measure knowledge of mathematical concepts. An
examination of the responses on the two forms yielded a relationship such that those students
scoring high on form A also scored high on form B. Similarly, those students scoring low on
form A also scored low on form B. Thus, repeatability or consistency is exhibited between
forms A and B. Such repeatability is also known as score reliability. Similarly, consider the case
where a researcher is interested in whether two different instruments of the same length and
format measure severe clinical depression in a consistent and repeatable manner. A sample of
persons exhibiting severe clinical depression responds to the items comprising the two instru-
ments. An analysis of the responses on the two instruments demonstrated that those patients
scoring in the top quartile on the first instrument scored below the 50th percentile on the
second instrument. In this case, repeatability or consistency was not exhibited between the
two instruments designed to measure severe clinical depression.
Importantly, if a set of scores lacks repeatability or reliability, they provide no use-
ful information because there is no way for researchers to make inferences related to an
individual’s ability, achievement, attitude, or other attribute. The results of a measure-
ment can exhibit accuracy but lack precision, or they may lack precision but be accurate.
Evidence for accuracy and precision of obtained scores or classifications exists when the
outcomes of a measurement method or process demonstrate that the numerical score or
classification represents what it was theoretically intended to represent. Further, empiri-
cal evidence should be available to support the elements of accuracy and precision. When
accuracy and precision exist in the measurement process, at least one piece of evidence
for validity of the scores is substantiated (AERA, APA, & NCME, 1999; Kane, 2006).
Evidence for the objectivity of a particular scaling method (and the resulting scores
obtained) is demonstrated by the independent replication of results using a specific
454  Appendix

measurement method by different researchers. For example, the consistency between


results obtained from two researchers working independently but using the same mea-
surement protocol under the same conditions can be statistically tested and then evalu-
ated. Specifically, objectivity within a psychometric context is a property of measurement
that can be tested independently from the individual researcher who proposes or develops
them. For the results of a measurement method to be objective, results must be commu-
nicated from person to person, and then be demonstrated for third parties. These charac-
teristics of communication advance understanding the nature of the world as objectively
as possible regardless of the research context. Therefore, a lack of independent replication
indicates a lack of objectivity. Indeed, establishing objectivity through measurement is a
major, if not the major, challenge for quantitative psychologists, psychometricians, and
behavioral scientists regarding their contribution to the advancement of knowledge.
Closely related to objectivity is communication. Two key elements of communica-
tion are efficiency and clarity. Efficiency in communication is defined as simple but com-
plete transfer of information, whereas clarity is defined as accuracy in the information.
Efficiency and clarity provide researchers a common set of linguistic tools so that impor-
tant discussions may occur when they are working on the same problem. For example,
a principal objective of any field of study is to establish, through theories, general prin-
ciples by means of which empirical phenomena can be explained or predicted.

A.3 Rules of Correspondence

A systematic way to enhance communication is to use rules of correspondence (Carnap,


1950). Rules of correspondence connect theory to empirical data by defining theoretical
constructs (e.g., intelligence) in terms of observable data (Figure A.3). Rules of corre-
spondence, which allow for the assignment of numbers to quantities, provide a natural
framework for using operational or epistemological definitions. Communication is fur-
ther enhanced by using standardized measures and scaling and/or classification proce-
dures that allow for clear and concise comparisons between different research and/or
measurement problems. For example, in large-scale research of ability and achievement,
derivation of normative scores for each respective grade level is essential owing to the
rapid developmental changes that occur in children. The development and use of stan-
dardized scale scores is an excellent example of how measurement and mathematics are
used to meet this challenge. The process of using scale scores is called standardization
and involves prescribed rules such as clarity, practicality, ease of administration (i.e., not
necessarily requiring a high level of training or skill), and score or classification results
that are independent of the examiner.
Figure A.3 (introduced in Chapter 1) illustrates how rules of correspondence are
applied to map three subcomponents (e.g., fluid, crystallized, and short-term memory) of
generalized intelligence (Flanagan et al., 2000; Carroll, 1993; Cattell, 1943; Hebb, 1942)
onto measurement space. The literature on the generalized theory of intelligence is sub-
stantial and is not covered here; instead, only a brief overview is provided. In G theory,
Mathematical and Statistical Foundations  455

f1 i item 1

Fluid intelligence test 1

fi1 item 10

fi2 item 1

Fluid Intelligence (Gf) Fluid intelligence test 2


fi2 item 20

fi3 item 1

Fluid intelligence test 2


fi3 item 20
General Intelligence (G)
ci1 item 1
crystallized intelligence test 1

ci1 item 25

ci2 item 1
crystallized intelligence test 2

ci2 item 25
Crystallized Intelligence (Gc)

ci3 item 1
crystallized intelligence test 3

ci3 item 15

ci4 item 1
crystallized intelligence test 4

ci4 item 15

stm1 item 1
short-term memory test 1

stm1 item 20

stm2 item 1
Short-Term Memory (Stm) short-term memory test 2

stm2 item 10

stm3 item 1

short-term memory test 3

stm3 item 15

Figure A.3.  Rules of correspondence applied to GfGc intelligence theory.

there are two major subcomponents labeled as fluid intelligence (Gf) and crystallized intel-
ligence (Gc). Fluid intelligence is defined as process oriented and crystallized intelligence
as knowledge or content oriented. Additionally, a general memory component is recog-
nized as part of the generalized theory of intelligence. In Figure A.3, the GfGc classification
scheme is represented by a set of cognitive, affective, and conative (i.e., the connection of
cognition and affect) trait complexes, along with the development of domain knowledge.

A.4 Theoretical Model and Data for this Text

Three components of the theory of generalized intelligence—fluid (Gf ), crystallized (Gc),


and short-term memory (Gsm)—are used in examples throughout the book to provide
456  Appendix

connections between a theoretical model and actual data. The related dataset includes
a randomly generated set of item responses based on a sample size N = 1,000 persons.
The data file is available in SPSS (GfGc.sav), SAS (GfGc.sd7), or delimited file (GfGc.
dat) formats and are downloadable from the companion website (www.guilford.com/
price2-materials).
In GfGc theory, fluid intelligence is operationalized as process oriented and crys-
tallized intelligence as knowledge or content oriented. Short-term memory is com-
posed of recall of information, auditory processing, and mathematical knowledge (see
Table A.1). In Figure A.3, the GfGc model is represented by a set of cognitive, affec-
tive, and conative (i.e., the connection of cognition and affect) trait complexes, along
with the development of domain knowledge. In Figure A.3, the small rectangles on
the far right represent individual items, which are summed to create linear composites
represented as the second, larger set of rectangles. The ovals in the diagram represent
latent constructs as measured by the second- and first-level observed variables. Table
A.1 (introduced in Chapter 1) provides an overview of the subtests, level of measure-
ment, and descriptions of the variables for a sample of 1,000 persons or examinees in
Figure A.3.

A.5 Variables and Their Application

The tasks specific to psychological measurement are varied and often challenging.
Examples of tasks in psychological measurement include but are not limited to (1)
developing normative scale scores for measuring intelligence and short-term memory

Table A.1.  Subtests in the GfGc Dataset


Number
Name of subtest of items Scoring
Fluid intelligence (Gf )
Quantitative reasoning—sequential Fluid intelligence test 1 10 0/1/2
Quantitative reasoning—abstract Fluid intelligence test 2 20 0/1
Quantitative reasoning—induction
and deduction Fluid intelligence test 3 20 0/1
Crystallized intelligence (Gc)      
Language development Crystallized intelligence test 1 25 0/1/2
Lexical knowledge Crystallized intelligence test 2 25 0/1
Listening ability Crystallized intelligence test 3 15 0/1/2
Communication ability Crystallized intelligence test 4 15 0/1/2
Short-term memory (Gsm)      
Recall memory Short-term memory test 1 20 0/1/2
Auditory learning Short-term memory test 2 10 0/1/2
Arithmetic Short-term memory test 3 15 0/1
Note. Scaling key: 0 = no points awarded; 1 = 1 point awarded; 2 = 2 points awarded. Sample size is N = 1,000.
Mathematical and Statistical Foundations  457

ability across the lifespan, (2) developing a scale accurately reflecting a child’s reading
ability in relation to his or her socialization process, and (3) developing scaling mod-
els useful for evaluating mathematical achievement. Often these tasks are complex and
involve multiple variables interacting with one another. This section provides the defini-
tion of a variable, including the different types and the role they play in measurement
and probability.
To begin, consider how individual test items in each respective subtest in Figure A.3
and Table A.1 are used to acquire specific information from persons. Person responses
to individual items are summed to create a total test (also called a subtest in models
such as Figure A.3) score for each person in a sample of persons. The resulting sum of
a collection of items is known as a total score or linear composite. Linear composites
are depicted as the large rectangles in Figure A.3, labeled by test or subtest name. Any
scaling model that produces reliable and accurate scores on measures representing con-
structs can be used to study causal relationships with different constructs defined in
other theoretical models. For example, Figure A.3 can be expanded to include posited
relationships of the Gf and Gc components with other constructs such as personality or
educational achievement.
A variable is a measurable factor, characteristic, or attribute of an individual, system,
or process. Variables represent something that varies between individuals on a quality or
characteristic that can assume two or more different values. For example, the individual
test items (i.e., small rectangles) comprising a subtest in Figure A.3 are variables. Simi-
larly, the linear composites (i.e., subtests) in the figure are also variables. In mathematical
statistics, random variables are measurable functions that are classified as discrete or
continuous. A discrete random variable takes values from a countable set of specific val-
ues, each with some probability greater than zero (Probstat, n.d.). For example, in Figure
A.3 a discrete random variable is the sum of a set of individual item scores acquired from
a randomly sampled group of persons. Note that these are the scores actually observed
rather than all scores that are theoretically possible based on the model in Figure A.3. The
sum of items labeled fi1 item 1 through fi1 item 10 yields the subtest or test total score
labeled fluid intelligence test 1. For discrete variables, each score within the set of test
scores takes on a value from a countable set of actually observed values or scores—each
with some probability greater than zero. Conversely, a continuous random variable takes
on values from a theoretically uncountable or unlimited set. Therefore, the probability of
a single value is zero, but the probability of a set of values is greater than zero. Using a set
of scores, we can model the probability of scores based on continuous distributions based
on random samples of persons from populations.
Random variables are measurable functions obtained from probability spaces (i.e., a
theoretical distribution or density) that are then mapped onto a measurable space. The
measurable space is composed of the actual observations (i.e., sample space) of interest in
a study. The observations are assigned a probability distribution based on their behavior
or shape. In this way, probability theory provides a crucial link in the development of
statistical and psychometric models under varying conditions. For example, a researcher
may be interested in the fluid intelligence scores on the quantitative reasoning subtest
458  Appendix

2 in Figure A.3 based on a sample of persons exhibiting clinical depression disorder at


a single point in time. In this case, the researcher is using measurement and probability
within a cross-sectional research design. Another example is dynamic or longitudinal,
meaning that the focus of the investigation is the behavioral change of depressed persons
over time based on a behavioral intervention. In the longitudinal case, scores on the quan-
titative reasoning subtest 2 in Figure A.3 can be measured over time and then analyzed
for change. A constant is a specific unchanging number. Once defined, the value that a
constant assumes within a particular psychometric model or analysis does not change.
Examples of constants include sex/gender of subjects, the intercept (i.e., the mean) in a
linear regression equation, the grade level of a group of subjects, ethnicity, occupation
level, and region of the state or world. In the case where a variable is continuous (e.g., on
at least an ordinal level of measurement), commonly used examples of constants include
linear and nonlinear transformations to change score distributions.
Traditionally, in experimental research the independent (e.g., a predictor) vari-
able (X) is one that is under some form of direct manipulation by the researcher. In
nonexperimental research this is not the case. The dependent (i.e., the criterion or
outcome) variable (Y) is influenced by the independent variable. Dependent variables
are either directly observable or indirect (i.e., unobservable or latent) characteristics of
human behavior or a response to stimulus of some type. Using derived variables within
the context of psychometric scaling models involves examining associational patterns
and distributional characteristics of variables using both observable and latent unidi-
mensional (i.e., a single construct) and multidimensional (i.e., multiple constructs in a
single model) approaches.

A.6 Types of Data

Categorical (Discrete) Data


As mentioned in Chapter 1, data represented by exclusive categories are discrete. For
example, the numbers of cases within variables such as ethnicity, gender, or geographic
region are defined as frequency counts of events existing within certain classes or cat-
egories. Other examples include the number of infectious disease outbreaks within a
certain region of the world, classification categories of judgment or preference, or the
number of military personnel exhibiting posttraumatic stress disorder after returning
from combat. In education and social science, examples include the number of children
within a particular social class who do not successfully move from one grade to the next,
or the occurrence of attention-deficit disorder or autism in children living in certain
regions of a particular country. Finally, one well-known example of classification from
biology involves organizing plant and animal life into species, genus, and order.
Developing measurement and analytic models for categorical data is highly
useful and yet is a topic that, until recently, has been underutilized in the field of
psychometrics.
Mathematical and Statistical Foundations  459

Continuous Data
Data that can take on any values within a particular mathematical range are continuous.
In measuring length, for example, it is possible for an object (e.g., a board, wire, or rod)
to be 6 feet, 1 inch, or 6 feet, 2 inches, long or any conceivable length in between these
two points on the scale. Therefore, continuous data have no gaps in their units of scale.
Additional examples include weight, chronological age, and temperature. However, even
though a variable is continuous in theory, the process of measurement always reduces it to a
discrete level due to the accuracy and precision of the instrumentation used and the integrity
of the data acquisition/collection method. Therefore, continuous scales are in fact discrete
ones with varying degrees of precision or accuracy. Returning to Figure A.3, any of the
linear composites (i.e., variables representing total test or subtest score) may appear to
be continuous but are actually discrete because a person can only obtain a numerical
value based on the sum of his or her responses across to each item, then summed to a
total score for the subtest (e.g., it is not possible for a person to obtain a score of 15.5 on
a total test score). In this case, total test scores are often treated as continuous measures
with a certain level of precision. To this end, although continuous measures are only
approximate, such a level of approximation provides sufficient precision to be useful for
the application of psychometric methods.

A.7 Statistical Notation and Operations Used


in This Appendix

In preparation for the remaining parts of this Appendix, the requisite symbols and opera-
tions are presented. The following symbolic notation and operations are used throughout
this text.

• N = size of a population.
• n = size of a sample from a population.
• å = summation of variables, where an example of variables is items or tests.
• Xi = variable X indexed by a lowercase i, where i represents an individual score for
a person, on variables such as items or tests.
• Xij = doubly scripted variable.
5
• ∑ X I = limits of summation, for example, X1 + X2 + X3 + X4 + X5. Application involves
1 starting with i = 1 proceeding by 1 until the fifth variable or number is reached.
5
• ∑ ( X I2 − YI + 6) = c omplex terms can be included in summation operations; in this
1 example, the complex equation in parentheses is conducted six
times, once for each score for X and Y.
N N
• ∑∑ X IJ = here, i represents the ith person and j is the jth test or subtest; the dou-
1 J=1 ble summation signifies the sum of all persons on all tests or subtests.
460  Appendix

The subscript below the summation symbols represents the last person
and last test, respectively.
N
• ∑ C = NC = s ummation of a constant equals the product of n times a constant (c =
I =1 constant).
N N
• ∑ CX I = C∑ X I = s ummation of scores or variables multiplied by a constant equals
I =1 I=1 the sum of the scores times a constant (c = constant).
N N N
• ∑ ( X I + YI ) = ∑ X I + ∑ YI = w
 hen applying summation to more than one term,
I =1 I=1 I=1 summation can be distributed to each term.
• m = mean of a population.
• X = mean of a sample.
• s = standard deviation of a population.
• s = standard deviation of a sample.
• s2 = variance of a population.
• s2 = variance of a sample, also represented as var(X).
• p = proportion or percentage.
• q = 1 – p.
• P = probability.
• E = event or outcome in probability theory.
• F (X) = function of X; also the integral.
• f(X) = frequency of event or score X, also expressed as a frequency-based prob-
ability function of X.
X
• ∫ F ( X) = frequency of event or score x, expressed as an indefinite integral for a
−∞ continuous random variable with limits x and negative infinity.
X
• ∫ F ( X)DX = f requency of event or score x expressed as a definite integral for a con-
−∞ tinuous random variable with limits x and negative infinity.
• Lx = likelihood of the observed data or score based on the height of a distribution
function (e.g., normal) at a specific score.
• e = expectation operator.
• Q = random variable theta representative of a parameter in Bayesian probability
and item response theory.

A.8 Counting and Measuring

Using the number system for counting, measuring, and summarizing is so common today
that it seems hardly worth mentioning. In early cultures, the number system was created
for use as a symbolic and systematic way to describe or communicate about the real world
in an objective, precise, and consistent manner. The branch of mathematics that focuses
Mathematical and Statistical Foundations  461

on the study of numbers or integers has a long history and remains important. However,
theoretical research on the number system contrasted with its application to counting
and measuring constitute two very different foci. Although using the number system is
familiar to most people, a systematic introduction, including forms of counting, and the
relationship between collections of numbers and probability theory are important.
A logical starting point is to define the term data (as opposed to a datum, being a
single number). The term data is defined as a collection of numbers, words, images, and
so on that began with early philosophers and is now accepted as convention within the
social, behavioral, physical, and biological sciences. Descriptions of numerical data can be
communicated with a degree of precision and objectivity in two primary categories. First,
events or things (e.g., physical, psychological, or sociological attributes) that are counted
are summarized based on frequency of occurrence or the number of times an event (e.g.,
an attribute) is observed as having occurred. Second, events or things measured through
some scaling procedure yield scale (or scalar) values on a particular metric relevant to
the measurement task of interest. Psychometric and statistical modeling deals with both
forms of numerical data—frequency counts of events (both ordered and unordered), and
interval-level scale values such as normative data on psychological tests. Frequency dis-
tributions provide a tabular summary of how many times values (e.g., real numbers) on
a discrete variable occur for a set of subjects or examinees. The proportion of examinees
receiving a particular score is defined as the relative frequency of a score. The term rela-
tive represents the position that a subject(s) occupies within the placement of the cumula-
tive (total) frequency distribution. Relative frequencies, discussed next, are foundational
to the classical, frequentist, or sampling theory approach to probability theory.

A.9 Elements of Probability and Random Variables

A discrete random variable is represented by a countable number of values. The relative


frequency represents the probability that a discrete random variable X takes on a particular
countable value with probability p at each trial of an experiment. For example, a single trial
is defined as one flip of a balanced coin with only two possible outcomes—a head (a value of
1) or a tail (a value of 0). Typically, there are many trials within a particular experiment. The
outcomes of the trials are considered to be mutually exclusive (one trial does not depend on
or preclude in any way another or successive trial). To every event, a probability Ei between
0 and 1 is assigned to the outcome E of a trial. A value of 0 is assigned to impossible events,
and 1 is assigned to certainty that an event will occur. By mutual exclusivity, in a single trial,
the following two rules of probability are applicable. First, the multiplicative theorem of
probability in Equation A.1 implies that the probability of several events occurring succes-
sively is the product of their separate probabilities. The multiplicative rule assumes that the
generating events (or trials) are independent or uncorrelated with one another.
To illustrate the multiplication rule, consider the question, “What is the probability
of obtaining a heads on the first and second flips of a fair coin in succession?” Answering
this question requires applying the multiplication rule of probability within the context
462  Appendix

Equation A.1. Probability multiplication theorem

P(Ei)*P(Ej)

• P = probability.
• Ei = outcome of an event indexed by i.
• Ej = outcome of an event indexed by j.
• The probability of several independent events occurring
jointly is the product of their separate probabilities.

of an infinitely repeated number of trials (i.e., long-run or classical probability). In the


long run, the proportion of heads will come up one-half of the time (i.e., .5). However,
the second toss will yield heads in only one-half of those trials in which the first toss
yielded heads. So, the proportion of all trials in which both tosses will yield heads is
(1/2)*(1/2) = 1/4, or .25. The same outcome will occur in the case of simultaneous (i.e.,
joint) events.
Second, the additive theorem of probability in Equation A.2 states that the prob-
ability of occurrence of any one of several particular events is the sum of their individual
probabilities, provided that the events are mutually exclusive.
For example, the probability of drawing an ace of hearts from a standard deck of
playing cards is 1/52 or .019. The same is true for drawing an ace of spades, clubs, or
diamonds. However, the probability of drawing an ace of hearts or an ace of diamonds or
an ace of clubs or an ace of spades is 1/52 + 1/52 + 1/52 + 1/52 = 4/52 or .07. Similarly,
the probability of obtaining a 3 or a 6 on a single toss of a fair die is 1/6 + 1/6 = 2/6 or
.33. Finally, because an outcome exists on every trial, the sum of the probabilities for all

Equation A.2. Probability addition theorem

P(Ei) + P(Ej) = pi + pj = 1

• P = probability.
• Ei = outcome of an event indexed by i.
• Ej = outcome of an event indexed by j.
• The probability of occurrence of any one of several par-
ticular events is the sum of their individual probabilities,
provided the events are mutually exclusive.
Mathematical and Statistical Foundations  463

possible outcomes must be exactly 1. Also, the probability that an event does not occur
is 1 minus the probability that the event does occur.
Next, consider a simple coin toss experiment (also known as a Bernoulli trial experi-
ment) with outcome 0 = tails; 1 = heads. This experiment can be described by the proba-
bility function f(xi) specifying the probabilities with which X can assume only the values
0 or 1. We assign a value xi to the ith outcome, and then we order the xi in ascending fash-
ion. The discrete random variable X is defined as that quantity which takes on the value
xi with probability p at each trial. To illustrate, if x1 = 0, and x2 = 1, and p1 = 1 − p, p2 = p,
and if p is not assigned a value, then over the long run of many independent trials p = .5.
Accordingly, this means that one-half of the time the outcome will be heads and one-half
of the time tails. Assumptions required for this type of experiment in order to have valid
outcomes are that the conditions of the coin tossing are represented by the intrinsic val-
ues of the coin being fair or balanced and the manner in which it is repeatedly tossed. The
total probability is unity (i.e., value of 1.0), and irrelevant events such as the coin rolling
out of sight or falling off of the surface are assigned probabilities of 0. Now consider the
following example using the intelligence test data from Figure A.3 and Table A.1. Next
we can use the frequency of scores provided in Table A.2 to answer the question, “What
is the probability that a score of 40 is obtained on crystallized intelligence test 1 based
on this sample?” The relative frequency (i.e., based on long-run frequency probability
theory) distribution for this subtest and sample is provided in Table A.2.
If we treat the intelligence test data as interval-level or continuous data, applying Equa-
tion A.4 to the data in Table A.2 yields a probability of .037 (e.g., a score of 40 occurred 37
times out of the 1,000 examinees or 37/1,000 @ .037) of obtaining a score of 40.
A continuous random variable is represented by values over some continuous region.
Also, a continuous random variable X as defined on the domain of real numbers is char-
acterized in Equation A.3 by its probability distribution function.
The symbols –¥ and ¥ represent the limits of the lower and upper bounds of the func-
tion of the variable. A familiar example is the standard normal (i.e., Gaussian) distribu-
tion. For further explanation of the differential and integral calculus applied to statistical

Equation A.3. Probability distribution function

F(x) = P(X £ x)  –¥ < x < ¥

• F = the function of x.
• P(X £ x) = the probability that a continuous random
variable X is less than or equal to some value
of x in its domain.
• –¥ < x < ¥ = the range of the function of x.
464  Appendix

Table A.2.  Frequency Distribution of Crystallized


Intelligence Test 1 from Figure A.3
Score Frequency Percent
4 1 0.1
7 1 0.1
8 2 0.2
9 1 0.1
10 2 0.2
11 1 0.1
12 3 0.3
13 6 0.6
14 3 0.3
15 4 0.4
16 3 0.3
17 7 0.7
18 9 0.9
19 15 1.5
20 10 1
21 18 1.8
22 11 1.1
23 6 0.6
24 25 2.5
25 20 2
26 24 2.4
27 28 2.8
28 16 1.6
29 28 2.8
30 28 2.8
31 28 2.8
32 32 3.2
33 39 3.9
34 29 2.9
35 42 4.2
36 51 5.1
37 38 3.8
38 47 4.7
39 51 5.1
40 37 3.7
41 49 4.9
42 69 6.9
43 46 4.6
44 45 4.5
45 40 4
46 35 3.5
47 24 2.4
48 17 1.7
49 6 0.6
50 3 0.3
Note. N = 1,000.
Mathematical and Statistical Foundations  465

methods, see Calculus and Statistics by Michael Gemignani (1998) and Advanced Calculus
with Applications in Statistics by Andre Khuri (2003).
Equation A.3 illustrates that X is less than or equal to some value x of its domain.
If F(x) is an absolutely continuous function, the continuous analog of the discrete prob-
ability function is the density function in Equation A.4.
Also, by the absolute-continuity property, the continuous function can be represented
by a cumulative probability distribution (density) function for F(x) in Equation A.5.
The integral symbol ò is defined as the summation of all quantities and differs from
the symbol S in that ò represents the summation of a vast number of small quantities
(i.e., dx) of infinitely small magnitudes—as in calculating the total area under the normal
(Gaussian) curve (see Figure A.3). The process of numerical integration allows us to
calculate totals that otherwise we would be unable to estimate. In contrast, the symbol S
represents the summation of a number of finite or discrete quantities. These two methods
of numerical summation have implications for how psychometric scales are developed
and how analytic methods are applied. The actual area of the curve may be calculated

Equation A.4. Continuous probability function

DF( X)
F( X ) =
DX

• F(x) = function of x or the definite integral expressed as a


number.
• dF(x) = all of the incremental elements composing the
function of x.
• dx = all of the incremental elements of x.

Equation A.5. Continuous probability distribution function

X
( )= ∫ ( )
−∞

• F(x) = sum of the function of f(x)dx over the entire


X
range of x from negative infinity to 1.
• ∫ F ( X) = area under the curve f(x) as a cutoff on the X
−∞ axis by the limits and –¥.
466  Appendix

by making the intervals infinitely small (no distance between the intervals) and then
computing the area using calculus methods such as Simpson’s rule or the trapezoid rule
(Gemignani, 1998).
To make the idea of integration more concrete, consider an example from Figure A.3
and Table A.1 using a score of 40 obtained on crystallized intelligence test 1 based on the
sample of 1,000 persons. Phrased in a probabilistic way, we want to know, “What is the
probability that at least one person will score between 39 and 41, given that the range of
scores is between 4 and 50?” Using 3/66 (i.e., an area of .04545) as the cumulative prob-
ability distribution function for f(x) in our example, the definite integral or the probability
that a random variable (1 score) will fall within the interval 39 and 41 is derived in Equa-
tion A.6 and illustrated in Figure A.4.
Finally, if a random variable is defined only on some interval of the real line, then
values outside that interval represented by the cumulative probability distribution func-
tion for f(x) to either the left or right are defined as being either 0 or 1, respectively.

Equation A.6. Integration of a continuous function

41
P([41,39]) = ∫ F ( X) D X
39

41
3
= ∫ 66 X
2
DX
39

3  X 2   41
=   
66  3  39 

3 3  41
= [X ]  
66 39 

3
= (413 − 393 )
66
= 3520
3520
= ≅ 3.52
1000
3.52
= ≅ .0352
100
Mathematical and Statistical Foundations  467

Normal

60

40
Frequency

20

Mean =35.23
Std. Dev. =8.609
0 N =1,000
0.00 10.00 20.00 30.00 40.00 50.00 60.00
Crystallized intelligence test 1

X
Figure A.4.  A histogram of ( ) = ∫ ( ) based on score data from crystallized intelligence
−∞
test 1. The horizontal line illustrates the intersection of a frequency of 37 on the Y-axis and a score of
40 on the X-axis. Probability of 1 person obtaining a score of 40 is ~.385.

A.10 Maximum Likelihood Estimation

Now that probability density (distribution) functions have been introduced to maximum
likelihood estimation (MLE), the method of maximum likelihood provides a general method
of estimating parameters based on the general linear model that leads to the ordinary least
squares function in the linear regression model (assuming normally distributed errors; see
Chapter 2 for linear regression basics). MLE is used extensively in many statistical and psy-
chometric techniques. Extensive use is made of maximum likelihood because the method,
under many circumstances, often produces parameter estimates that exhibit smaller bias
(i.e., the expected value of all possible estimates equals the population parameters) and
smaller variance (i.e., values obtained from randomly different samples have small variance)
than other estimation methods (e.g., ordinary least squares or generalized least squares). An
exception to this statement is that in some scenarios maximum likelihood is not necessar-
ily the optimal method to use (e.g., very small sample sizes or non-normal distributions).
Therefore, the distributional characteristics unique to a specific set of data must be evaluated
uniquely prior to deciding on using a particular method of parameter estimation.
468  Appendix

Maximum likelihood is useful for a wide array of statistical problems such as estimat-
ing parameters in IRT (introduced in Chapter 10) and logistic regression (see Chapter 4).
For example, MLE (or slight modifications of it) is useful in situations where the goal is to
estimate an unobservable or latent trait or attribute from sample data as in IRT. The goal
of MLE is to locate population parameters that will most probably generate a particular
sample estimate (under certain assumptions such as those in the normal distribution). For
example, the likelihood is conceptualized as the relative probability of drawing a certain
score from a distribution with known mean and variance. The distribution may be univari-
ate or multivariate normal or any other distribution. Equation A.7a provides the compo-
nents necessary for estimating the likelihood of a score with a known population mean and
variance. Next, an example is provided using population information.
To understand how Equation A.7a works, let’s assume that we have intelligence test
data that are normally distributed with a population mean of 100 and variance of 225. Next,
suppose we want to know the likelihood of obtaining a score of 115 on the intelligence test.
Inserting the mean and variance into Equation A.7 and carrying out the operations yields a
likelihood of .04402. Figure A.5 illustrates this result of applying Equation A.7a.
In Figure A.5, the likelihood is represented by the Y-axis and indexes the height of
the normal curve at a particular score.
Recall that the goal of MLE is to locate population parameters that have the greatest
probability of yielding a set of sample data. Recall from Equation A.1 that independent
events (the examinee scores in the present case) can be multiplied to ascertain a measure
of the joint probability. To accomplish this goal, Equation A.7a is expanded as in Equa-
tion A.7b. Application of Equation A.7b yields a single summary likelihood value that
represents a summary index of fit based on individual scores comprising a sample.

Equation A.7a. Likelihood expressed as a density function of a


continuous normal distribution

-.5( YI -m )2
1
LI = E s
2

2ps2

• Li = likelihood of score i.
• m = population mean.
• s2 = population variance.
• p = 3.14159265.
1
• = scaling term that allows the area under the
2ps2 curve to integrate (sum) to 1.
( YI -m )2
• = squared distance of an individual score from the
s2
population mean.
Mathematical and Statistical Foundations  469

Equation A.7b. Multiplication of the examinee-specific score—­


likelihood values resulting in a summary likelihood value for a sample

N ìï 1 -.5( YI -m )2 ü
ï
L = Õí E s
2
ý
î 2ps
2
I =1 ï ïþ

• L = likelihood based on a sample of examinee


scores.
N

• Õ = multiplication operator indexing each examinee in


i =1
the sample.
• m = population mean.
• s2 = population variance.
• p = 3.14159265.
• 1 = scaling term that allows the area under the curve
2ps2 to integrate (sum) to 1.
( YI -m )2
• = squared distance of an individual score from the
s2
population mean.

In practice, the values of likelihood are small and difficult to work with. For this
reason, the logarithm of the likelihood is used in practice. For example, the logarithm of
the value of .04402 in Figure A.4 is −1.35. An additional benefit of using the logarithm
scale is that the logarithm of examinee scores can be summed to yield a composite log
likelihood value. For example, using Equation A.2 (the additive model of probability for
independent events) provides a framework summing individual likelihoods, resulting in
an additive (linear) model. Equation A.7c illustrates how Equation A.7b is changed to
include the logarithm.

A.11 Bayesian Probability

In some cases, measurement and statistical problems are very difficult to address
within the frequentist or sampling theory probability framework. Under such circum-
stances the Bayesian probability and inference provides a powerful alternative. The
history and development of Bayesian statistical methods (Hald, 1998; Bayes, 1763)
are substantial and are closely related to frequentist statistical methods. In fact, Gill
(2002) notes that the fundamentals of Bayesian statistics are older than the current
(i.e., classical or frequentist) paradigm. In some ways, Bayesian statistical thinking
can be viewed as an extension of the traditional (i.e., frequentist) approach in that
470  Appendix

Equation A.7c. Summation of the examinee-specific log likelihood


values resulting in a summary log likelihood value for a sample

N ìï 1 -.5( YI -m )2 ü
ï
LOG L = å LOG í E s
2
ý
ïî 2ps
2
I =1 ïþ

• log L = logarithm of the likelihood based on a sample of


examinee scores.
N
• ∑ LOG = summation operator indexing the logarithm of
I =1 each examinee score in the sample.
• m = population mean.
• s 2
= population variance.
• p = 3.14159265.
1
• = scaling term that allows the area under the
2ps2 curve to integrate (sum) to 1.
( YI - m)2
• = squared distance of an individual score from the
s2
population mean.

Likelihood
0.080
0.070

0.060

0.050 Li = .04402

0.040
0.030

0.020

0.010
0.000

55 70 85 100 115 130 145

IQ

Figure A.5.  The likelihood for an IQ score of 115 based on a normal distribution with mean-
100 and variance-225.
Mathematical and Statistical Foundations  471

it formalizes aspects of the statistical analysis that are left to uninformed judgment
by researchers in classical statistical analyses (Press, 2003). The formal relationship
between Bayesian (subjective) and classical (direct) probability theory is provided in
Equation A.8.
The goal of parametric statistical inference is to make statements about unknown
parameters that are not directly observable from observable random variables—the
behavior of which is influenced by these unknown parameters. In the Bayesian statisti-
cal approach, researchers view any unknown quantity (e.g., a population parameter)
as random and these quantities are assigned a probability distribution (e.g., normal,
poisson, gamma, multinomial, binomial). The analytic focus is on the probability dis-
tribution that gives rise to or generates the observed data. In this way, population param-
eters are modeled as being random and then assigned a joint probability distribution
with the observed data thereby allowing researchers to summarize their current state
of knowledge about the model parameters. The result obtained in a Bayesian analysis
is a full probability model for population parameters and observed data. The utility
of the Bayesian approach is emphasized in Chapter 10 on IRT, where a probabilistic
model for responses to test items is presented. In comparison, in frequentist, or direct
probability and statistical theory, population parameters are assumed to be fixed (non-
random) and the data are viewed as being random—provided that random sampling has
occurred.
In the Bayesian framework, the sampling-based approach to estimation provides a
solution for the random parameter vector q by estimating the posterior density (distribu-
tion) of a parameter. This posterior distribution is defined as the product of the likeli-
hood function (accumulated over all possible values of q ) and the prior density (i.e.,
distribution) of q (Press, 2003; Gelman, Carlin, Stern, & Rubin, 2004).
To illustrate Bayes’s theorem graphically, suppose that you are interested in the pro-
portion of people in the United States who have been diagnosed with bipolar disorder.
You denote this proportion as q, and it can take on any value between 0 and 1. Next,
using information from a national database, 30 out of 100 people are identified as having
bipolar disorder. Two pieces of information are required—a range for the prior distribu-
tion and the likelihood, which is derived from the actual frequency distribution of the
observed data. Using Equation A.9, Bayes’s theorem multiplies the prior density and the
likelihood to obtain the posterior distribution.
The process of Bayesian statistical estimation approximates the posterior density
or distribution of say, y, p(q |y) m p(q )L(q |y), where p(q ) is the prior distribution of q,
and p(q |y) is the posterior density of q given y. Continuing with our bipolar example,
the prior density or belief (i.e., the solid curve) is for q to lie between .35 and .45 and is
unlikely to lie outside the range of .3 to .5 (Figure A.6).
The dashed line represents the likelihood, with q being at its maximum at approxi-
mately .3, given the observed frequency distribution of the data. Applying Bayes’s theorem
involves multiplying the prior density by the likelihood. If either of these two values
is near zero, the resulting posterior density will also be negligible (i.e., near zero, for
example, for q < .2 or q > .6). Finally, the posterior density (i.e., the dotted-dashed line)
472  Appendix

Equation A.8. Relationship between Bayesian and direct or classi-


cal probability

p(q|x) ∝ p(x|q) ∝ Lx(q)

• ∝ = “proportional to”; meaning that the object to the


left of the symbol differs only by a multiplicative
constant in relation to the object to the right.
• P = probability.
• q = random variable theta.
• Lx = likelihood of observed data x.
• x = observed data x.
• p(q|x) = probability of the parameter (a random variable)
given the observed data (not random but fixed).
• p(x|q) = probability of the observed (fixed) data given the
parameter (a random variable—the sample data).
• Lx(q) = likelihood of the observed data times the param-
eter (random variable).

covers a much narrower range and is more informative than either the prior or the likeli-
hood alone.
The proportionality symbol in Equation A.9 is interpreted as follows: If the posterior
density (distribution) is proportional to the likelihood of the observed data times the prior
imposed upon the data, the posterior density differs from the product of the likelihood times
the prior by a multiplicative constant. When the prior density for the data is multiplied

0 .1 .2 .3 .4 .5 .6 .7

Figure A.6.  Bayesian example of bipolar incidence.


Mathematical and Statistical Foundations  473

times the likelihood function, the result is improper, or “off” by a scaling constant. A
normalizing constant only rescales the density function and does not change the relative
frequency of the values on the random variable. Equation A.9 exemplifies the principle
that updated knowledge results from or is maximized by combining prior knowledge
with the actual data at hand. Finally, Bayesian sampling methods do not rely on asymp-
totic distributional theory and therefore are ideally suited for investigations where small
sample sizes are common (Price, Laird, Fox, & Ingham, 2009; Lee, 2004; Dunson, 2000;
Scheines, Hoijtink, & Boomsma, 1999).
An illustration of Bayes’s theorem is now provided to estimate a single-point prob-
ability with actual data using Equation A.10.

Equation A.9. Bayes’s theorem

Posterior ∝ Likelihood ´ Prior

• ∝ = “ proportional to”; meaning that the object to the left of


the symbol differs only by a multiplicative constant in
relation to the object to the right.
• Proportionality is required in order to ensure that the pos-
terior density has its integral (i.e., that the area under the
curve equals to a value of 1).
• Simply multiplying the likelihood and the prior does not
ensure that the result will integrate to a value of 1.
• Therefore, to obtain the posterior density, the right-hand
side must be scaled by multiplying it by a suitable constant
to ensure integration to a value of 1.

Equation A.10. Bayes’s theorem

P( A|B)P(B)
P(B|A) =
P( A|B)P(B) + P( A|B)P(B)

• P(B|A) = probability of event B occurring given A.


• P(B) = probability of event B occurring.
• P(A|B) = probability of event A occurring given B.
• B = complementary event to B.
474  Appendix

Consider the scenario where the proportion of practicing psychologists in the U.S.
population is .02, the proportion of practicing psychologists in the United States who
are female is .40, and the proportion of females among nonpracticing psychologists in
the United States is .60, then P (female | practicing psychologist) = .40, P (practicing
psychologists) = .02, P (female | not practicing psychologist) = .60, and P (not practicing
psychologists) = .98. Given these probabilities and applying Equation A.10 as below in
Equation A.11, the probability that a psychologist is female and is in current practice in
the United States is .0134.
Notice that the result obtained (P = .0134) is very different from the proportion of
practicing psychologists in the United States who are female (i.e., P (female | practicing psy-
chologist) = .40). In Bayesian terminology, the unconditional probabilities P(B) and P(B) in
Equation A.10 are proportions (probabilities) and represent prior probabilities (i.e., what
is currently known about the situation of interest). The probabilities are
the probabilities actually observed in the sample, the product P(A|B) * P(B) is the likeli-
hood, and P(B|A) is the posterior probability. Alternatively, from a frequentist perspective,
the probability that a psychologist is female and is currently practicing is calculated
using the multiplication probability rule: P(A|B) * P(B) = (.40) * (.02) = .008. Notice
that this is the likelihood given the observed frequency (probability) distribution.

Equation A.11. Estimating a point probability using Bayes’s


theorem

(.40)(.02)
P(PSYCHOLOGIST|FEMALE) =
(.40)(.02) + (.60)(.98)

• P(A|B) = proportion of practicing female psychologists in


the United States (.40).
• P(A) = proportion of practicing female psychologists in
the United States (.40).
• P(B) = proportion of practicing psychologists in the
United States (.02).
• P(B|A) = probability of practicing female psychologists
occurring.
• B = complementary event to B (.98).
• P( A|B) = proportion of nonpracticing female psychologists
in the United States (.60).
• P(B) = proportion of nonpracticing psychologists in the
United States (.98).
Mathematical and Statistical Foundations  475

A.12 Bayesian Applications to Psychometrics

Bayesian ideas have been incorporated into psychometric methods as a means of model-
ing the distribution that gives rise to a particular set of observed scores among individuals
who have differing levels of identical true scores. Bayesian methods have been particularly
useful in statistical estimation, decision theory, and item response theory. Regarding
test theory and development, this probabilistic approach is very different from classical
or empirical probabilistic methods where the distribution of observed scores represents
an empirical probability distribution. In the Bayesian approach, the process of estimating
a person’s true score proceeds by making a priori assumptions about subjects’ unknown
true score distribution based on sampling distribution theory. For example, a function
such as the normal distribution function can be used as prior or subjective information in
the model and the probability of an observed score given the true score as the likelihood
distribution. Finally, the posterior distribution is derived across differing levels of sub-
jects’ true scores through open-form iterative numerical maximization procedures such as
MLE (introduced in Section A.10), iteratively reweighted least squares (IRLS—for ordinal
data), and restricted maximum likelihood (REML) and quasi-maximum likelihood or mar-
ginal maximum likelihood (MML). Using IRT as an illustration, we find that the method
of maximum likelihood estimation leads to parameter estimates (i.e., in IRT for items
and persons) that maximize the probability of having obtained a set of scores used in the
estimation process. Specifically, the MLE method (and variants of it) uses the observed
score data as the starting point for the iterative parameter estimation/­maximization pro-
cess. The resulting parameter estimates have optimal item or score weights, and per-
son ability estimates, and are unbiased asymptotical estimates (i.e., q̂). Chapter 10,
on IRT, provides more detail on the process of open-form iterative numerical estimation
procedures.

A.13 Density (Distribution) Functions


and Associated Parameters

The type of distribution that describes the way a variable maps onto a coordinate system
(i.e., 2-D or 3-D) has implications for the development and application of psychometric
scaling models and methods. The following section provides an overview of some distri-
butions commonly encountered in psychometrics and psychophysics.
Properties of random variables are numerically derived in terms of a density func-
tion. Five distribution functions of random variables common to psychometrics are (1) rect-
angular, (2) logistic, (3) logarithmic, (4) gamma, and (5) normal (Figure A.7).
These distributions are determined by two parameters: location (i.e., either the
arithmetic, geometric, or harmonic mean) and scale (i.e., the variance). The location
parameter positions the density function on the real number line X-axis, whereas the
dispersion (variance) parameter maps the spread or variation of the random variable. The
arithmetic mean (i.e., expectation or expected long-run value) of a random variable is
476  Appendix

represented by the continuous density function in Equation A.12 and for the discrete case
in Equation A.13. Although Equations A.12 and A.13 appear to be essentially the same,
Equation A.13 is helpful in understanding the principle of continuity underlying a set of
measurements or scores taking on a range of real numbers. For example, although Equa-
tion A.13 is used extensively for the calculation of the mean of a set of scores, Equation
A.13 reminds us that, theoretically, usually a continuous underlying process is assumed
to give rise to the observed score values.
When the random variable is discrete, then Equation A.13 applies.

1.2 Rectangular 1.2

Distribution function
1 1
Probability density

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 1 2 3
Prob. density Distribution function

Logistic

1.2

0.8
Probability density

0.6

0.4

0.2

0
20 10 0 10 20
Distribution function

Figure A.7.  Types of distributions.


Mathematical and Statistical Foundations  477

Log-normal

0.9

0.8

0.7

0.6
Probability density

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25

Distribution function

Gamma

1.2

0.8
Probability density

0.6

0.4

0.2

0
0 5 10 15 20

Distribution function

Figure A.7.  (continued)


478  Appendix

Normal
Probability Density

-5 -4 -3 -2 -1 0 1 2 3 4 5

Poisson

1.2

0.8
Probability mass function

0.6

0.4

0.2

0
0 5 10 15 20 25 30 35 40 45 50

Distribution function

Figure A.7.  (continued)

To provide an example, we return to the frequency distribution for the crystallized


intelligence test 1 in Table A.1. Application of Equation A.13 yields an expected or mean
value of 35.23. The cumulative frequency distribution for these data is provided in Figure
A.3 as a histogram. The expected value (i.e., expectation) is the mean (i.e., integral) of a
random variable with respect to its probability density. For discrete variables, the mean
is the weighted sum of the observed real numbers. An expected value is best understood
Mathematical and Statistical Foundations  479

Equation A.12. Expected value of a continuous random variable

¥
m X = e( ) = ò X F X(X ) DX

• mx = mean of the population.


• e = expectation operator.
• e (X) = expected value of X or the mean.

• ∫ XFX ( X)DX = function of the random variable X over the
−∞ interval of real numbers ranging as the area
under the curve from positive to negative
infinity.

Equation A.13. Expected value of a discrete random variable

m X = e( X ) = å XK PX ( XK )
K

• mx = mean of the population.


• e = expectation operator.
• e(x) = expected value of x.
• xk = any observed real number in the
distribution.
• px(xk) = probability mass function; it is ³ 0 and
£ 1.
• ∑ P X ( X K) = 1 = sum of the probability mass functions for a
K set of real numbers equals 1.
• ∑ XK PX( XK) = sum of each real number in a set of
K data weighted by its relative frequency
(probability).

within the context of the law of large numbers. For example, the expected value may or
may not occur in a set of empirical data. So, it is helpful to interpret the expected value
of a random variable as the long-run average value of the variable over many indepen-
dent repetitions of an experiment. Next, some properties are provided that are useful
when working with expectations. Specifically, Equations A.14 and A.15 illustrate alge-
braic properties used in conjunction with the expectation operator when manipulating
scores or variables.
480  Appendix

Equation A.14. Algebraic property 1 for expectation operator

E(aX + bY) = aE(X) + bE(Y)

Equation A.15. Algebraic property 2 for expectation operator

P(T | X)

• E(XY) = expectation of the product of variables X times Y.


• E(X)E(Y) = expectation of X times the expectation of Y.

In the case where variables X and Y are independent, Equation A.14 applies.
As an extended example, we use a rigid rod analogy (from the topic of statics in phys-
ics) to conceptualize the properties of a variable. Using Equation A.16, the function, f(x)
measures the density of a continuously measured rod mapped onto the X-axis. The kth
moment of the rod about the origin of its axis is provided in Equation A.16.
If actual relative frequencies of a variable are used (as is most always the case)
totaling N, then a = 0. This relationship means that the value of the arithmetic mean
depends on the value of a, the point from which it is measured. In Equation A.16, µ′ k
is the first moment or the mean, a term first used by A. Quetelet (1796–1874; Hald,
1998). Subsequently, Karl Pearson adopted this terminology for use in his work on the
coefficient of correlation. The term moment describes a deviation about the mean of a
distribution of measurements or scores. Similarly, a deviate is a single deviation about
the mean, and as such, deviates are defined as the first moments about the mean of a
distribution. The variance is the second moment of a real-valued random variable. The

Equation A.16. Expected value of the first central moment for a


continuous random variable

¥
mK¢ = òX F( X )DX, and m ¢K = A = 0
K

• m¢K = first moment or the mean.



• ∫X
K
F ( X)DX = function of the random variable X over the
−∞ integral ranging from positive to negative
infinity.
Mathematical and Statistical Foundations  481

variance is defined as the average of the square of the distance of each data point from
the mean. For this reason, another common term for the variance is the mean squared
deviation. The skewness of a distribution of scores is the third moment and is a mea-
sure of asymmetry (i.e., left or right shift in the shape of the distribution) of a random
variable. Finally, the fourth moment, kurtosis, is a measure of asymmetry where scores
in a distribution display an excessively tall and peaked shape. Once the first through
fourth moments are known, the shape of a distribution for any set of scores can be
determined.

A.14 Variation and Covariation

At the outset of this appendix (and in Chapter 1), attributes were described as iden-
tifiable qualities or characteristics represented by either numerical elements or clas-
sifications. Studying differences between persons on attributes of interest constitutes a
diverse set of research problems. Whether studying individual differences on an indi-
vidual or group level, variation among attributes plays a central role in understanding
differential effects. In experimental studies, variability about group means is often the
preference. Whether a study is based on individuals or groups, research problems are
of interest only to the extent that a particular set of attributes (variables) exhibit joint
variation or covariation. If no covariation exists among a set of variables, conduct-
ing a study of such variables would be useless. To this end, the goal of theoretical and
applied psychometric research is to develop models that extract the maximum amount
of covariation among a set of variables. Subsequently, covariation is explained in light
of theories of social or psychological phenomena. Ultimately, this information is used to
develop scales that can extract an optimum level of variability between people related
to a construct of interest.
The variance of a random variable is formally known as the second moment about
the distribution of a variable and represents the dispersion about the mean. The variance
is defined as the expected value of the squared deviations about the mean of a random
variable and is represented as var(X) or s2X. The variance of a continuous random variable
is given in Equation A.17.
In the case where constants are applied to the variance, we have the properties shown
in Equation A.18.
Alternatively, Equation A.19 provides a formula for the variance of a distribution
of raw scores. In Equation A.19, each participant’s score is subtracted from the mean of
all scores in the group, squared, and then summed over all participants, yielding a sum
of squared deviations about the mean (i.e., sum of squares—the fundamental unit of
manipulation in the analysis of variance).
The variance is obtained by dividing the sum of squares by the sample size for the
group (N), yielding a measure of the average squared deviation of the set of scores. The
square root of the variance is the standard deviation, a measure of dispersion represented
in the original raw score units of the scale. When calculating the standard deviation for
482  Appendix

Equation A.17. The variance of random variable X


VAR( ) = ∫[ − ( )]2 ( )
−∞


= ∫ − [ ( )]2
2
( )
−∞

= ( 2
) − [ ( )]2

• var(X) = variance of X or the expected value of


∞ the squared deviations about its mean.
• ∫ [ − ( )] = value x minus the expected value of
2

−∞ X (mean of X squared) over the inter-


val ranging from positive to negative
infinity.
• E(X ) – [E(X)] = the expected value of X2 (mean square
2 2

of X) minus expected value or mean of


X squared.

a sample, the denominator in Equation A.19 is changed to reflect the degrees of freedom
(i.e., N – 1) rather than N and is symbolized as s rather than s. The reason for using N – 1
in calculating the variance for a set of scores sampled from a population compared to N
is because of chance factors in sampling (i.e., sampling error). Specifically, we do not
expect the variance of a sample to be equal to the population variance (a parameter vs. a
statistic). In fact, the sample variance tends to underestimate the population variance. As
it turns out, dividing the sum of squares by N – 1 (in Equation A.19) provides the neces-
sary correction for the sample variance to become an unbiased estimate of the popula-
tion variance. An unbiased estimate means that there is an equal likelihood of the value
falling above or below the value of the population variance. Finally, in large samples, the
variance of a sample (s2) and the population converge to unity.
When variables are scored dichotomously (i.e., 0 = incorrect/1 = correct), computa-
tion of the variance is slightly different. For example, the item-level responses on our
example test of crystallized intelligence 1 are scored correct as a 1 and incorrect as a 0
for each of the 25 items. Computation of the variance for dichotomous variables (i.e.,
proportion of persons correctly and incorrectly responding to a test item) is given in
Equation A.20.
The standard deviation and variance are useful for describing or communicating the
dispersion of a distribution of scores for a set of observations. Both statistics are also useful
in conducting linear score transformations by using the linear equation Y = a(X) + b
Mathematical and Statistical Foundations  483

Equation A.18. The variance of random variable X

VAR (C) = 0

VAR (CX ) = C2 VAR (X )


VAR (X + C) = VAR (X )

s = VAR (X)

• var(c) = 0 = v ariance of any constant is zero since


a constant is nonrandom.
• var(cX) = c var(X) = variance of X times a constant equals
2

the constant squared times the vari-


ance of X; a change in scale of X by
c units changes the variance by the
constant squared.
• var(X + c) = var(X) = variance of X plus a constant equals
the variance of X; when the origin of
the X-axis changes, the variance is
unchanged.
• s = VAR (X ) = square root of the variance of X
equals the standard deviation.

Equation A.19. The variance of a set of raw scores

s2 =
å 2
=
å - 2

N N

• ∑
X2
= sum of the squared raw scores divided by the
N number of measurements in the population
(N) or sample (N − 1).
• ∑
( X − X )2
= the sum of each raw score minus the mean
N of the raw score distribution squared divided
by the number of measurements in the
population (N) or sample (N − 1).
484  Appendix

Equation A.20. The variance of random dichotomous/discrete


variable X

s2 = p(1 – p)

• s2 = variance (standard deviation squared).


• p(1 – p) = variance of a proportion based on frequencies of
responses for an item or variable.

(e.g., see Chapter 11 on norming). Linear transformations are those in which each raw
score changes only by the addition, subtraction, multiplication, or division of a constant.
The original raw-score metric is changed to a standard score metric such as Z(m = 0,
s = 1), T(m = 50, s = 10), IQ(m = 100, s = 15). Such transformations are useful when
creating normative scores for describing a person’s relative position to the mean of a
distribution (i.e., norms tables). Common forms of transformed scores used in psycho-
metrics include (1) normalized scores, (2) percentiles, (3) equal-interval scales, and
(4) age and/or grade scores. For example, a researcher may want to transform a raw
score of 50 from an original distribution exhibiting a mean of 70 and a standard devia-
tion of 8 to an IQ-scale metric (i.e., mean of 100/standard deviation of 15). Equation
A.21 can be used to accomplish this task. Using data on the crystallized intelligence
test 1 in Table A.1 and Figure A.3, conversion of a raw score of 40 to a standard (i.e.,
z-score) in the distribution with a mean 35.23 and standard deviation of 8.60 is given
by Equation A.21.
Next, Equation A.22 illustrates a linear score transformation that changes the orig-
inal raw score of 40 to an IQ score metric with a mean of 100 and standard deviation
of 15.

Equation A.21. A raw to standard score transformation for


population

X - m 40 - 35.23
Z= = = .55
s 8.60

• X = raw score of interest.


• s = standard deviation of the raw-score distribution.
• z = standard score.
• m = mean of raw-score distribution.
Mathematical and Statistical Foundations  485

A.15 Skewness and Kurtosis

The third moment about a distribution of scores is the coefficient or index of skewness.
The measure of skewness indexes the degree of asymmetry (degree of left/right shift on the
X-axis) of a distribution of scores. Equation A.23 provides an index of skewness useful
for inferential purposes (Glass & Hopkins, 1996; Pearson & Hartley, 1966). Note that the
index in Equation A.22 can be adjusted for samples or populations in the manner that
the z-score is calculated. For example, one can use the sample standard deviation or the
population standard deviation depending on the research or psychometric task.
The fourth moment is the final moment about a distribution of scores providing the
ability to describe the shape in its complete form. The fourth moment about a distribu-
tion of scores is the coefficient or index of kurtosis. Kurtosis indexes the degree of asym-
metry as reflected in a distribution’s degree of platykurtic shape (flatness), mesokurtic
(intermediate flatness), or leptokurtic (narrowness) on the y-axis of a distribution of
scores. Equation A.24 provides an index of kurtosis useful for inferential purposes (Glass
& Hopkins, 1996; Pearson & Hartley, 1966).

Equation A.22. A linear score transformation for population

XT = s T (Z o ) + X t = 15(.55) + 100 = 108.25

• Xt = transformed score.
• st = standard deviation of the transformed score variance.
• zo = z-score transformation of original observed score
based on the mean and standard deviation of the origi-
nal raw-score distribution.
• X t = mean of transformed score distribution.

Equation A.23. Measure of skewness index

g1 =
å I ZI3
N

• g 1 = measure of skewness described as the mean of


cubed z-scores for a set of scores.
• ∑ I I = the sum of the original scores transformed to
3
Z
z-scores cubed.
• N = sample size. 
486  Appendix

Equation A.24. Measure of kurtosis index

g2 =
å I ZI4 - 3
N

• g2 = measure of skewness described as the mean of


cubed z-scores.
• ∑ I ZI = sum of the original scores transformed to z-scores
4

to the fourth power.


• N = sample size.

The program below provides the SAS source code for computing assorted descrip-
tive statistics for fluid intelligence, crystallized intelligence, and short-term memory total
scores. The program also produces two output datasets that include the summary statis-
tics that can be used in additional calculation if desired.

SAS program for computing assorted descriptive statistics

LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;

PROC MEANS maxdec=3 NMISS RANGE USS CSS T SKEWNESS KURTOSIS;


VAR cri_tot fi_tot stm_tot;
OUTPUT OUT=X.descriptive_out1;
OUTPUT OUT=X.descriptive_out2 mean=mcri_tot mfi_tot mstm_tot
n=ncri_tot nfi_tot nstm_tot skewness kurtosis ;

TITLE1 'ASSORTED DESCRIPTIVE STATS FOR GfGc DATA';


TITLE2 'USS IS RAW SUM OF SQUARES/CSS IS SUM OF SQUARES ADJ FOR
THE MEAN';
run;
PROC PRINT DATA=X.descriptive_out1;
PROC PRINT DATA=X.descriptive_out2;
RUN;
QUIT;

A.16 Multiple Independent Random Variables

The previous section provides an explanation for the process whereby functions are used
to derive the density or distribution of a single random variable. These elements can be
Mathematical and Statistical Foundations  487

extended to the case of multiple independent variables, each with its own respective den-
sity functions. To illustrate, the joint density function for several independent variables
is provided in Equation A.25.
In Equation A.25, F(x1) represents the density function of a single variable. Deriving
the variance of the sum of independent random variables is required when, for example,
the reliability of a sum of variables (i.e., a composite) is the goal. Equation A.26a pro-
vides the components for calculating the variance of a linear composite (e.g., the sum of
several variables or subtests) in order to derive an estimate of the variance of a composite.
Equations A.26b and A.26c provide an example using data from variables fluid intelli-
gence tests 1 and 2 from Figure A.1.

Equation A.25. A joint density function for several independent


variables

XP X1
( 1 ,…, P) = ò  ò ( 1,…, P) 1… P
-¥ -¥

• F(x1,..., xp) = function of independent variables meet-


ing the independence assumption that
F1(x1)···Fp(xp).
• f(u1,..., up) = joint density function over the range of the
integral.

Equation A.26a. Variance based on the sum of several indepen-


dent variables

s2Y = å s2I + 2å r IJ sI sJ I ¹ J

• s2Y = variance of the independent variables i through


j.
• å sI
2
= sum of the variances for item, test, or variable i.
• rijsisj = covariance of items, tests, or variables i and j.
• 2årijsisj = two times the sum of the covariance of vari-
ables ij.
488  Appendix

Equation A.26b. Composite variance based on fluid intelligence


items 1–5 (Test 1) and fluid intelligence items 1–5 (Test 2)

s2Y = 4.66 + 2(5.86) = 16.38

• .88 = variance of test 1.


• 3.78 = variance of test 2.
• 5.86 = covariance between tests 1 and 2.

Equation A.26c. Composite variance based on fluid intelligence


tests 1 and 2

s2Y = 44.29 + 2(26.4) = 97.09

• 27.96 = variance of test 1.


• 16.33 = variance of test 2.
• 13.20 = covariance between tests 1 and 2.

A.17 Correlation and Covariance

Central to psychometric methods is the idea of mathematically expressing the relation-


ship between two or more variables. In fact, most methods in statistics and psychometrics
originate from the mathematical relationship between two or more variables. The coef-
ficients of correlation and covariance provide researchers a flexible and powerful way to
examine and test bivariate and multivariate relationships. A comprehensive understand-
ing and appreciation of correlation and covariance is so basic to psychometrics and sta-
tistics that any textbook would be highly inadequate without it.
The most widely used correlation coefficient is the Pearson product–moment coef-
ficient of correlation (Pearson, 1902; Hald, 1998). The Pearson r is the coefficient of
choice when the relationship between X and Y is linear and both variables are measured
on an interval or ordinal scale. This coefficient is foundational to many advanced ana-
lytic methods such as multiple correlation, multiple linear regression, partial correlation,
principal components, and factor analysis. The correlation coefficient r is an index that
expresses the magnitude and direction of association between two variables (e.g., vari-
ables as measurements of attributes or scores). In the bivariate case (i.e., only one X and
one Y), r represents the amount of concomitant variation between X and Y. The Pearson
r derived using deviation scores is given in Equation A.27a and the corresponding
Mathematical and Statistical Foundations  489

Equation A.27a. Pearson correlation coefficient

r=
å xy
(å x 2 )(å y2 )
• åxy = sum of the product of the paired x and y scores.
• ∑ X = sum of the squared x scores.
2

• ∑ Y2 = sum of the squared y scores.

correlation matrix for the first five items on fluid intelligence tests 1 and 2 from Figure
A.3 are provided in Table A.3.
The covariance between any pair of items is given in Equation A.27b and is expressed
as the correlation between two items times their respective standard deviations. The
matrix presented in Table A.4 is a variance–covariance matrix because the item variances
are included along the diagonal of the matrix.

Computer Program and Example Data


The SPSS syntax and SAS source code that produce output datasets as matrices in Tables
A.3 and A.4 using the data files GfGc.sav/GfGc.sas are provided below. The dataset may be
downloaded from the companion website (www.guilford.com/price2-materials).

SPSS program syntax for Tables A.3 and A.4

CORRELATIONS VARIABLES= fi1_01 fi1_02 fi1_03 fi1_04 fi1_05 fi2_01


fi2_02 fi2_03 fi2_04 fi2_05
/MATRIX=OUT(*).
MCONVERT /MATRIX=IN(*) OUT("K:\eq_A.23_4_covb.sav").

Equation A.27b. Covariance of items (i and j)

covij = rijsisj

• covij = covariance based on i and j scores.


• rij = correlation between each item.
• si = standard deviation of item i.
• sj = standard deviation of item j.
490  Appendix

Table A.3.  Pearson Correlation Matrix for Items 1–5 on Fluid Intelligence Tests 1
and 2
FI FI FI FI FI FI FI FI FI FI
test 2, test 2, test 2, test 2, test 2, test 1, test 1, test 1, test 1, test 1,
  item 1 item 2 item 3 item 4 item 5 item 1 item 2 item 3 item 4 item 5
FI test 2, item 1 1 0.24 0.22 0.25 0.29 0.22 0.19 0.26 0.20 0.19
FI test 2, item 2 — 1 0.36 0.43 0.39 0.28 0.27 0.30 0.24 0.28
FI test 2, item 3 — — 1 0.37 0.36 0.25 0.27 0.28 0.24 0.26
FI test 2, item 4 — — — 1 0.47 0.31 0.29 0.39 0.33 0.32
FI test 2, item 5 — — — — 1 0.29 0.31 0.32 0.26 0.34
FI test 1, item 1 — — — — — 1 0.35 0.45 0.32 0.40
FI test 1, item 2 — — — — — — 1 0.42 0.27 0.33
FI test 1, item 3 — — — — — — — 1 0.30 0.37
FI test 1, item 4 — — — — — — — — 1 0.38
FI test 1, item 5 — — — — — — — — — 1
Note. Standard deviation values are equal to 1 in a correlation matrix and are provided along the diagonal of the
matrix.

Table A.4.  Covariance Matrix for Items 1–5 on Fluid Intelligence Tests 1 and 2
FI FI FI FI FI FI FI FI FI FI
test 2 test 2 test 2 test 2 test 2 test 1 test 1 test 1 test 1 test 1
  item 1 item 2 item 3 item 4 item 5 item 1 item 2 item 3 item 4 item 5
FI test 2 item 1 0.10 0.03 0.03 0.04 0.04 0.06 0.05 0.07 0.06 0.05
FI test 2 item 2 — 0.16 0.06 0.08 0.07 0.09 0.10 0.11 0.08 0.10
FI test 2 item 3 — — 0.16 0.07 0.07 0.08 0.10 0.10 0.08 0.09
FI test 2 item 4 — — — 0.23 0.11 0.12 0.12 0.16 0.14 0.14
FI test 2 item 5 — — — — 0.23 0.12 0.14 0.13 0.11 0.15
FI test 1 item 1 — — — — — 0.69 0.26 0.33 0.23 0.29
FI test 1 item 2 — — — — — — 0.80 0.33 0.20 0.26
FI test 1 item 3 — — — — — — — 0.77 0.23 0.29
FI test 1 item 4 — — — — — — — — 0.73 0.29
FI test 1 item 5 — — — — — — — — — 0.78
Note. Bold numbers are variances of an item and are provided along the diagonal of the matrix.

SAS program source code for Tables A.3 and A.4

LIBNAME X 'K:\Guilford_Data_2011';
DATA temp; set X.GfGc;
RUN;

PROC CORR NOMISS COV ALPHA OUTP=X.corr_cov_out;

VAR fi1_01 fi1_02 fi1_03 fi1_04 fi1_05 fi2_01 fi2_02 fi2_03 fi2_04
fi2_05;
TITLE 'COVARIANCES AND CORRELATIONS';
RUN;
Mathematical and Statistical Foundations  491

PROC PRINT DATA=corr_cov_out;


RUN;
QUIT;

As introduced earlier, the term moment describes deviations about the mean of a
distribution of scores. Similarly, a deviate is a single deviation about the mean, and such
deviates are defined as the first moments about the mean of a distribution. The second
moments of a distribution are the moments squared, whereas the third moments are the
moments cubed. Because standard scores (such as z-scores) are deviates with a mean of
zero, standard scores are actually first moments about a distribution, and therefore the
multiplication of two variables, say X and Y, results in the calculation of the product-
moment correlation coefficient.

Covariance
The covariance is defined as the average cross product of two sets of deviation scores and
therefore can also be thought of as an unstandardized correlation. The equation for the
covariance using raw scores is provided in Equation A.28.
An important link between the correlation coefficient r and the covariance is illus-
trated in Equation A.29.

A.18 Assumptions Related to r

The Pearson r is not well suited for describing a nonlinear relationship (i.e., a joint dis-
tributional shape that does not follow a straight line of best fit) between two variables.
Using r in these situations can produce misleading estimates and tests of significance.
Figure A.8 illustrates this nonlinearity of regression using the fluid intelligence test total
score data. Note in the figure how across-the-age-span scores on fluid intelligence are

Equation A.28. Covariance

å å - -
sXY = =
N N

• X = deviation score on a single measure.


• Y = deviation score on a single measure.
• xy = raw score on any two measures.
• X = mean on measure X.
• Y = mean on measure Y.
• sxy = covariance.
492  Appendix

Equation A.29. Relationship between the correlation and


covariance

sX Y
RX Y =
s X sY

• sx = square root of the variance for score x.


• sy = square root of the variance for score y.
• sxy = covariance.

slightly curvilinear, and as a person’s age increases their score plateaus. In Figure A.9, a
polynomial regression line (r-square = .46) describes or fits the data better than a straight
line (r-square = .42).
The SPSS syntax for producing the graph in Figure A.8 is provided below using the
dataset GfGc.SAV.

CURVEFIT
/VARIABLES=fi_tot WITH AGEBAND
/CONSTANT
/MODEL=LINEAR CUBIC
/PLOT FIT.

60.00

50.00 Observed
Cubic

40.00

30.00

20.00

10.00

0.00
0 20 40 60 80 100
Age in years

Figure A.8.  Nonlinear regression of fluid intelligence total score (Y) on age (X).
Mathematical and Statistical Foundations  493

Observed
60.00 Linear
Cubic

50.00

40.00

30.00

20.00

10.00

0.00
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00
Age in 10-year bands

Figure A.9. Comparison of linear versus nonlinear trend of fluid intelligence total score (Y)
on age (X).

In IRT, nonlinear regression is a central component of the model. For example, in


IRT, a set of scores follows a monotonically increasing curve (e.g., in Figure A.7, the logis-
tic curve). A monotonic curve or function moves in only one direction as a score value
increases (or decreases) along the x-axis. Practically speaking, this means that the rank
order of subjects’ placement based on how they score is unaffected by the shape of the
regression line—even if their scores are transformed from one scale metric onto another.
In this case, the magnitude of r is only slightly influenced.

A.19 Homoscedastic Errors of Estimation

Another way to evaluate the relationship between two variables is to examine the pat-
tern of the errors of estimation. Errors of estimation between X and Y should be approxi-
mately equal across the range of X and Y. Using the intelligence test example, we find that
unevenly distributed errors may arise when the estimation (or prediction) error between
ability scores (X) and actual scores (Y) is not constant across the continuum of X and
Y. Ultimately, heteroscedastic (i.e., a large amount of variability) errors of estimation
are often due to differences among subjects on the underlying latent trait or construct
representing X or Y. Such differences among subjects (and therefore measurements on
494  Appendix

variables) are manifested through the actual score distributions, which in turn affect
the accuracy of the correlation coefficient. Again using our fluid intelligence test total
score data, Figure A.10 illustrates that the errors of regression are constant and normally
distributed. For example, notice that points in the graph (i.e., errors) are consistently
dispersed throughout the range of age and score for the subjects.

SPSS REGRESSION syntax for producing the plot in Figure A.10

REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT fi_tot
/METHOD=ENTER AGEYRS
/SCATTERPLOT=(*ZRESID ,*ZPRED)
/RESIDUALS NORMPROB(ZRESID).

A.20 Normality of Errors

The normality of the distribution of a set of scores is an assumption central to tests of sta-
tistical significance and confidence interval estimation. Both X and Y variables should be
evaluated for normality (i.e., excessive univariate skewness and kurtosis) using standard
data screening methods. Recommended cutoff values for excessive univariate skewness
and kurtosis are provided in Tabachnick and Fidell (2007, pp. 79–81). These authors

2
Regression standardized residual

–2

–4
–2 –1 0 1 2
Regression standardized predicted value

Figure A.10.  Comparison of linear versus nonlinear trend of fluid intelligence total score (Y)
on age (X).
Mathematical and Statistical Foundations  495

recommend using conventional and conservative alpha levels of .01 and .001 to evaluate
skewness and kurtosis with small to moderate samples. When the sample size is large (i.e.,
> 100), the shape of the distribution should be examined graphically since with a large
sample size the null hypothesis of normality will usually be rejected. Should the assump-
tion of normality be untenable, options available to researchers include transforming the
variable(s) or applying nonparametric or nonlinear analytic techniques. The primary con-
cern in conducting score transformations, however, is the issue of interpreting the results
of an analysis after the analysis is complete. Transformations may lead to difficult inter-
pretation and often do not lead to any improvement in meeting the assumption of normal-
ity. Another option is to consider using a nonparametric (i.e., assumption-free) analytic
method for the analysis. Choosing the best analytic model and technique given the data is
perhaps the wisest choice, particularly with the statistical software now available.

A.21 Other Measures of Correlation and Association

When two variables do not meet the linearity assumption and equal-interval level of
measurement requirement, the Pearson r is mathematically expressed by three special
formulas: Spearman’s rank order correlation rS, the point–biserial correlation rpbis, and the
phi coefficient rf.

Spearman’s Rank Order Correlation rS


When variables do not meet the assumptions of linearity and an equal interval level of
measurement, other indexes of correlation are available for use. In the case of data scaled
on an ordinal level with very uneven intervals and small sample size, the Spearman’s rank
order correlation coefficient is appropriate and is illustrated in Equation A.30.

Equation A.30. Spearman correlation coefficient

rs =
å (Ri - R )(Si - S )
å (Ri - R )2 å (Si - S )2

• Ri = rank of the ith x value.


• Si = rank of the ith y value.
• R = mean rank of the R values.
• S = mean rank of the S values.
Note. Since the Spearman r is based on correlating ranks of scores,
averaged ranks are used in the case of ties.
496  Appendix

Table A.5.  SPSS Output for Spearman’s Correlation


STM_TOT_CAT AGE IN YEARS
Spearman’s rho STM_TOT_CAT Correlation Coefficient 1.000 -.189**
(low, med, high) Sig. (2-tailed) . .000
N 1000 1000
AGE IN YEARS Correlation Coefficient -.189** 1.000
Sig. (2-tailed) .000 .
N 1000 1000
**. Correlation is significant at the 0.01 level (2-tailed).

To provide an example using SPSS, the syntax below is used to derive the Spearman
correlation coefficient using the short-term memory total score categorized into low,
medium, and high categories, with a person’s age in years. Here age is treated as an ordinal
rather than interval measure to illustrate that as age increases short-term memory decreases.
Table A.5 provides the Spearman Correlations for relationship between memory and age.

SPSS syntax for Spearman correlation coefficient using data file GfGc.SAV

NONPAR CORR
/VARIABLES=STM_TOT_CAT AGEYRS
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.

SAS program for Spearman correlation coefficient using data file GfGc.SD7

LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;

PROC CORR NOMISS COV OUTS=X.spear_corr_out;


VAR stm_tot_cat ageyrs;
TITLE 'SPEARMAN CORRELATION';
RUN;

PROC PRINT DATA=x.spear_corr__out;


RUN;
QUIT;

Point–Biserial Correlation rpbis

The point–biserial correlation is used to assess the correlation between a dichotomous vari-
able (e.g., a test item with a 1 = correct/0 = incorrect outcome) and a continuous variable
(e.g., the total score on a test or another criterion score). The point–biserial coefficient
is not restricted to the underlying distribution of each level of the dichotomous variable
Mathematical and Statistical Foundations  497

Equation A.31. Point–biserial correlation coefficient

XS - X m P
Rpbis = SY . Q

• X S = mean score on a continuous variable for a group that


is successful on a dichotomous variable.
• m = mean score on a continuous variable for a group that
X
is unsuccessful on a dichotomous variable.
• sY = overall standard deviation of the scores on the con-
tinuous variable.
• q = proportion of individuals in the unsuccessful group,
1 – p.
• p = proportion of individuals in the successful group.

or test item being normal. Therefore, it is more useful than the biserial coefficient (pre-
sented next) where a coefficient assumes a normal distribution underlying both levels of
the dichotomous variable. In test development and revision, the point–biserial is useful
for examining the contribution of a test item to the total test score. Recommendations for
using the point–biserial correlation in item evaluation are provided in Allen and Yen (1979,
pp. 118–127). The formula for the point–biserial correlation is illustrated in Equation A.31.
The corresponding standard error of rpbis is given in Equation A.32.

Biserial Correlation rbis


The biserial correlation coefficient is used when both variables are theoretically con-
tinuous and normally distributed but one has been artificially reduced to two discrete

Equation A.32. Standard error of the point–biserial correlation


coefficient

PQ
− RPB
2
Y
RPB =
N

• y = o rdinate of the standard normal curve corresponding to


the point of division (i.e., cutoff) between segments con-
taining p and q proportions.
498  Appendix

Equation A.33. Biserial correlation coefficient

x s - x m pq
rbis = ×
sY z

• X S = mean score on a continuous variable for a group that


is unsuccessful on a dichotomous variable.
• X m = mean score on a continuous variable for a group that
is unsuccessful on a dichotomous variable.
• sY = overall standard deviation of the scores on the con-
tinuous variable.
• pq = proportion of individuals in the successful group
times the proportion of individuals in the unsuccess-
ful group.
• z = ordinate of the standard normal curve corresponding
to p.

Equation A.34. Standard error of the biserial correlation


coefficient

1 PQ
SRBIS =
N Y

• y = ordinate of the standard normal curve corresponding to p.

categories. For example, the situation may occur where a cutoff score or criterion is used
to separate or classify groups of people on certain attributes. Mathematical corrections are
made for the dichotomization of the one variable, thereby resulting in a correct Pearson
correlation coefficient. Equation A.33 provides the formula for the biserial correlation.
The corresponding standard error of rbis is given in Equation A.34.
The BILOG syntax below provides the output presented in Table A.6 (introduced in
Chapter 6). The results in Table A.6 are from phase I output of the program (Du Toit, 2003).

POINT BISERIAL AND BISERIAL.BLM - CRYSTALLIZED INTELLIGENCE


TEST 2 ITEMS 1-25
>COMMENTS
>GLOBAL NPARM=2, LOGISTIC, DFNAME='C:\rpbispoly.DAT';
>LENGTH NITEMS=25;
Mathematical and Statistical Foundations  499

Table A.6.  BILOG-MG Point–Biserial and Biserial Coefficients for the 25-Item
Crystallized Intelligence Test 2
PEARSON r
Name N # Right PCT LOGIT (pt.–biserial) Biserial r
ITEM0001 1000 0.00 0.00 99.99 0.00 0.00
ITEM0002 1000 995.00 99.50 -5.29 0.02 0.11
ITEM0003 1000 988.00 98.80 -4.41 0.09 0.30
ITEM0004 1000 872.00 87.20 -1.92 0.31 0.49
ITEM0005 1000 812.00 81.20 -1.46 0.37 0.54
ITEM0006 1000 726.00 72.60 -0.97 0.54 0.72
ITEM0007 1000 720.00 72.00 -0.94 0.57 0.76
ITEM0008 1000 826.00 82.60 -1.56 0.31 0.45
ITEM0009 1000 668.00 66.80 -0.70 0.48 0.62
ITEM0010 1000 611.00 61.10 -0.45 0.52 0.67
ITEM0011 1000 581.00 58.10 -0.33 0.51 0.64
ITEM0012 1000 524.00 52.40 -0.10 0.55 0.69
ITEM0013 1000 522.00 52.20 -0.09 0.67 0.85
ITEM0014 1000 516.00 51.60 -0.06 0.62 0.77
ITEM0015 1000 524.00 52.40 -0.10 0.53 0.67
ITEM0016 1000 482.00 48.20 0.07 0.56 0.71
ITEM0017 1000 444.00 44.40 0.22 0.60 0.76
ITEM0018 1000 327.00 32.70 0.72 0.57 0.74
ITEM0019 1000 261.00 26.10 1.04 0.49 0.66
ITEM0020 1000 241.00 24.10 1.15 0.46 0.64
ITEM0021 1000 212.00 21.20 1.31 0.53 0.75
ITEM0022 1000 193.00 19.30 1.43 0.47 0.68
ITEM0023 1000 164.00 16.40 1.63 0.46 0.69
ITEM0024 1000 122.00 12.20 1.97 0.37 0.59
ITEM0025 1000 65.00 6.50 2.67 0.34 0.65
Note. This table is a portion of BILOG-MG phase I output.

>INPUT NTOTAL=25, NGROUPS=1, NIDCHAR=9;


>ITEMS INUMBERS=(1(1)25);
>TEST TNAME=CRIT2;
(9A1,25A1)
>CALIB NQPT=10, CYCLES=15, CRIT=0.005, NEWTON=2, PLOT=1;

Phi Coefficient F
The phi coefficient is appropriate for use when two variables are qualitative (i.e., cat-
egorical) and/or dichotomous (as in test items scored 1 = correct/0 = incorrect). As
an example of how the phi coefficient may be useful, consider the situation where a
researcher is interested in whether there is statistical dependency between the variables
sex and short-term memory (categorized as low, medium, and high). To examine this
relationship, the cell frequency counts within categories are required to examine this
500  Appendix

Table A.7.  Frequency Counts of Sex by Short-Term Memory


Category
Low Medium Total
Male 99 368 467
(a) (b) (a + b)
Female 168 365 533
(c) (d) (d + e)
Total 267 733 N = 1,000
(a + d) (d + e)
Note. The coefficient may be calculated using
BC − AD (368)(168) − (99)(365)
RΦ = yielding Φ = = .116
(A +C )( B + D )(A + B )(C + D ) (465)(733)(467)(533)
Effect size interpretations: 0.1 = small, 0.3 = medium, 0.5 = large (Cohen, 1988).

relationship. Table A.7 illustrates how the phi coefficient is used to examine the asso-
ciation between the variables sex and short-term memory using actual cell frequency
counts within categories from the dataset PMPT.SAV. The phi coefficient is given in
Equation A.35.

SPSS syntax and partial output for phi coefficient using data file GfGc.SAV

CROSSTABS
/TABLES=SEX BY STM_LOW_HIGH_CAT
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ CC PHI UC CORR
/CELLS= COUNT EXPECTED ROW COLUMN SRESID
/COUNT ROUND CELL.

Symmetric Measures(c)
Value Approx. Sig.
Nominal by Phi -.116 .000
Nominal Cramer’s V .116 .000
Contingency Coefficient .116 .000
N of Valid Cases 1000
a Not assuming the null hypothesis.
b Using the asymptotic standard error assuming the null hypothesis.
c Correlation statistics are available for numeric data only.

SAS program and partial output for phi coefficient and related coefficients using data
file GfGc.SAV

LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;
Mathematical and Statistical Foundations  501

Equation A.35. Phi correlation coefficient

PX Y − PX PY
RΦ =
PX Q X PY Q Y

• Px = number of “yes” counts in the x variable expressed as


a proportion of the total.
• Py = number of “yes” counts in the y variable expressed as
a proportion of the total.
• Pxy = number of “yes” counts in the y variable multiplied by
the number of “yes” counts in the x variable expressed
as a proportion of the total.
• Qx = number of “no” counts in the x variable expressed as a
proportion of the total.
• Qy = number of “no” counts in the y variable expressed as
a proportion of the total.

PROC FREQ;
TABLES stm_low_high_cat*sex
/CHISQ ALL OUT=X.nparm_corr_output;
run;

PROC PRINT DATA=X.nparm_corr_output;RUN;


QUIT;

The FREQ Procedure

Table of STM_LOW_HIGH_CAT by SEX

STM_LOW_HIGH_CAT SEX(GENDER)

Frequency‚
Percent ‚
Row Pct ‚
Col Pct ‚1 ‚2 ‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚ 99 ‚ 168 ‚ 267
‚ 9.90 ‚ 16.80 ‚ 26.70
‚ 37.08 ‚ 62.92 ‚
502  Appendix

‚ 21.20 ‚ 31.52 ‚
ƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
ƒ
2 ‚ 368 ‚ 365 ‚ 733
‚ 36.80 ‚ 36.50 ‚ 73.30
‚ 50.20 ‚ 49.80 ‚
‚ 78.80 ‚ 68.48 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 467 533 1000
46.70 53.30 100.00

Statistics for Table of STM_LOW_HIGH_CAT by SEX


Statistic DF Value Prob
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Chi-Square 1 13.5467 0.0002
Likelihood Ratio Chi-Square 1 13.6885 0.0002
Continuity Adj. Chi-Square 1 13.0245 0.0003
Mantel-Haenszel Chi-Square 1 13.5332 0.0002
Phi Coefficient –0.1164
Contingency Coefficient 0.1156
Cramer’s V –0.1164

Finally, when the goal is to statistically test the association based on a cross-­tabulation
analysis of two variables, a 2 × 2 contingency table can be created. Using Equation A.36
provides a way to conduct a statistical test of association using the important functional
connection between F and c 2.
By using Equation A.36, a researcher can test the phi coefficient against the null
hypothesis of no association using the chi-square distribution. The degrees of freedom
for the chi-square test is df = (r – 1)(k – 1), where r is the number of rows and k is the
number of columns. Finally, when cell size is less than 10, Yates’s correction for conti-
nuity should be applied. Applying Yates’s correction is recommended in the case of small
cell size because the chi-square statistic is based on frequencies of whole numbers and is
represented in discrete increments, whereas the chi-square table is based on a continu-
ous distribution. Yates’s correction is applied by adding a value of .5 to each obtained fre­
quency that is greater than the expected frequency and increasing by .5 the frequencies

Equation A.36. Connection between F and c 2

c 2 = NF2

• N = sample size.
• F2 = square of the phi coefficient from Equation 2.28.
Mathematical and Statistical Foundations  503

that are less than expected. The cumulative effect yielded is a reduction in the amount of
each difference between obtained and expected frequency by .5.

A.22 Coefficient of Contingency C

In the case of larger contingency tables, as presented in Table A.8, Cramer’s contingency
coefficient (Conover, 1999) is used as a measure of association. The contingency coefficient
(i.e., symbolized as C or Cramer’s V) is the statistic of choice when two variables consist of at
least three or more categories and have no particular underlying distributional continuum.
The SPSS syntax below produces estimates of Cramer’s V and the contingency coef-
ficient from Table A.8. A partial listing of the output follows the syntax.

CROSSTABS
/TABLES=SEX BY STM_TOT_CAT
/FORMAT= AVALUE TABLES
/STATISTIC=CHISQ CC PHI CORR
/CELLS= COUNT EXPECTED ROW COLUMN SRESID
/COUNT ROUND CELL.

Symmetric Measures(c)
Value Approx. Sig.
Nominal by Phi .106 .003
Nominal Cramer’s V .106 .003
Contingency Coefficient .106 .003
N of Valid Cases 1000
a Not assuming the null hypothesis.
b Using the asymptotic standard error assuming the null hypothesis.
c Correlation statistics are available for numeric data only.

Table A.8.  Frequency Counts of Sex by Short-Term Memory Category


Low Medium High Both
Male 38 388 41 467
(a) (b) (c) (a + b + c)
Female 68 440 25 533
(d) (e) (f ) (d + e + f )
Both 106 828 66 N = 1,000
(a + d) (d + e) (c + f)
T
Note. The coefficient may be calculated using CRAMER’S C = (Conover, 1999), where
N(Q - 1)
R C (O IJ − E IJ)2 F F
T = ∑∑ , O = observed cell count, E = an expected cell count defined as R C , and q is
I =1 J =1 E IJ N
the smaller of the rows or columns used for the degrees of freedom. Effect size interpretations: 0.1 =
small, 0.3 = medium, 0.5 = large (Cohen, 1988).
504  Appendix

A.23 Polyserial and Polychoric rpoly

The polyserial r is a generalization of the biserial r and is used when one variable is con-
tinuous and the other is categorical, but where the categories are greater than 2. The aim
when using rpoly is to estimate what the correlation would be if the two variables were
continuous and normally distributed. For example, a continuous variable such as a stan-
dardized test score might be correlated with a categorical outcome such as socioeconomic
status or an external criterion such as a national ranking having three or more discrete
levels. The point estimate versions of these statistics are special cases of the Pearson r
that attempt to overcome the artificial restriction of range created by categorizing vari-
ables that are assumed to be continuous and normally distributed. Also, two variables
may exist where one is composed of three categories but has been artificially reduced to
two categories and the other exists in three or more categories. This reduction may arise
when a cutoff score or criterion is used to separate or classify groups of people on certain
attributes. Equation A.37 provides the formula for polyserial r (Du Toit, 2003, p. 563).
Table A.9 provides an example of the polyserial correlation coefficient.
The PARSCALE program syntax that provides the contents of Table A.9 is provided
below.

Appendix A PGM.PSL - Crystallized Intelligence Test 3


>COMMENTS
>FILE DFNAME='c:\rpoly.dat';
>INPUT NIDCH=9, NTOTAL=14, NTEST=1, LENGTH=(14), NFMT=1;
(9A1,5X,14A1)
>TEST TNAME=cri3;
>BLOCK1 BNAME=SBLOCK1, NITEMS=14, NCAT=3,
ORIGINAL=(0,1,2), MODIFIED=(1,2,3),CADJUST=0.0;
>CALIB GRADED, LOGISTIC, SCALE=1.7, NQPTS=30,
CYCLES=(25,2,2,2,2),
NEWTON=5, CRIT=0.005, ITEMFIT=10;
>SCORE EAP, NQPTS=30, SMEAN=0.0, SSD=1.0, NAME=EAP, PFQ=5;

Equation A.37. Polyserial correlation coefficient

M J -1
1
Rpoly = RP, J
sJ
å H(Z JK)( TJ,K+1 - TJK)
K= 0

• Tjk = s coring function for item j and category k, sj is the


standard deviation of item scores y for item j, and zjk
is the z-score corresponding to the cumulative propor-
tion, pjk, of the kth response category to item j.
Mathematical and Statistical Foundations  505

Table A.9.  PARSCALE Program Phase I Output for 14 Items on the Crystallized
Intelligence Test 3
Pearson &
Total Score Polyserial
Item Response Mean/SD Correlation Initial Slope Initial Location
1 2.91 30.10 0.41 1.05 -3.08
0.314* 5.550* 0.73
2 2.95 30.10 0.35 1.20 -2.82
0.265* 5.550* 0.77
3 2.32 30.10 0.51 0.73 -1.07
0.600* 5.550* 0.59
4 2.80 30.10 0.48 1.14 -1.76
0.534* 5.550* 0.75
5 2.50 30.10 0.52 0.89 -1.02
0.792* 5.550* 0.66
6 1.87 30.10 0.51 0.72 1.28
0.576* 5.550* 0.58
7 2.28 30.10 0.58 0.95 -0.22
0.873* 5.550* 0.69
8 2.02 30.10 0.65 1.04 0.49
0.773* 5.550* 0.72
9 2.21 30.10 0.72 1.71 0.24
0.919* 5.550* 0.86
10 2.07 30.10 0.69 1.29 0.41
0.883* 5.550* 0.79
11 1.60 30.10 0.62 1.05 1.65
0.741* 5.550* 0.73
12 1.66 30.10 0.58 0.95 1.49
0.847* 5.550* 0.69
13 1.59 30.10 0.66 1.25 1.48
0.763* 5.550* 0.78
14 1.34 30.10 0.53 1.02 1.24
  0.666* 5.550* 0.72    

The polychoric correlation is used when both variables are dichotomous or ordinal,
or both, but both are assumed to have a continuous underlying metric (i.e., theoretically
in the population). The polychoric correlation is based on the optimal scoring (or canoni-
cal correlation) of the standard Pearson correlation coefficient ( Jörskog & Sörbom, 1999a,
p. 22; Kendall & Stuart, 1961, pp. 568–573). Equation A.38 illustrates the polychoric
correlation coefficient (Du Toit, 2003, pp. 563–564).

A.24 Tetrachoric Correlation rtet

Often in test development, the underlying construct that a set of items with response
outcomes of correct = 1/incorrect = 0 is designed to measure is assumed to be normally
506  Appendix

Equation A.38. Polychoric correlation coefficient with consecutive


integer scoring

M J -1
RP, J å K= 0 H(Z JK )
Rpolychoric, J =
sJ

• Tjk = t he scoring function for item j and category k, sj is the


standard deviation of item scores y for item j, and zjk
is the z score corresponding to the cumulative propor-
tion, pjk, of the kth response category to item j.

distributed in the population of examinees. When this is the case, it is desirable to use a
correlation coefficient that exhibits the property of invariance (remains consistent) for
groups of examinees that have different levels of average ability (Lord & Novick, 1968,
p. 348). The tetrachoric correlation is appropriate in this case and is preferable to using the
phi coefficient. Tetrachoric correlation coefficients exhibit invariance properties that phi
coefficients do not. Specifically, the tetrachoric correlation is designed to remain invariant
for scores obtained from groups of participants of different levels of ability but that oth-
erwise have the same bivariate normal distribution for the two different test items. The
property of equality of bivariate distributional relationships between groups of examinees
is highly desirable. The correct use of the tetrachoric correlation assumes that the latent
distribution underlying each of the pair of variables in the analysis is continuous (Divgi,
1979). The tetrachoric correlation is used frequently in item-level factor analysis and
IRT to ensure the appropriate error structure of the underlying distribution is estimated.
Failure to correctly estimate the error structure has been shown to produce incorrect
standard errors and therefore incorrect test statistics (Muthen & Hofacker, 1988).
The equation for computing tetrachoric correlation is lengthy because of the inclu-
sion of various powers of r (Kendall & Stuart, 1961). Fortunately, several statistical com-
puting programs can perform the calculations, such as TESTFACT (Scientific Software
International, 2003a; specifically designed for conducting binary item factor analysis),
BILOG (Scientific Software International, 2003b), and Mplus (Muthen & Muthen,
2010), to name a few. For users unfamiliar with TESTFACT, BILOG, and Mplus, an
SPSS routine is available that uses the output matrix obtained from using the program
TETCORR (Enzmann, 2005). Also, one can use the Linear Structural Relations Program
(LISREL) to produce a polychoric correlation matrix that is very similar to the tetrachoric
correlation, only differing in the restriction that the means are 0 and the variances are 1
(Kendall & Stuart, 1961, pp. 563–573). Situations that call for avoiding using tetrachoric
r include (a) when the split in frequencies of cases in either X or Y is very one-sided (i.e.,
95–5 or 90–10) because the standard error is substantially inflated in these instances.
Furthermore, any cells with a frequency of zero counts should preclude the use of this
statistic. Equation A.39 provides the tetrachoric correlation.
Mathematical and Statistical Foundations  507

Equation A.39. Tetrachoric correlation coefficient

¥¥
1 æ X 2 + Y 2 - 2RXYö
RTET = L (H, K, R) = òò çè - 2(1 - R 2) ÷ø DXDY
exp
2p 1 - R 2 K H

• F(k) = p1, where F(z) is the area under the normal


curve from z to ¥.
• F(h) = p2, where F(z) is the area under the normal
curve from z to ¥.
• L(h,k,r) = likelihood or probability values set equal to
p11, the proportion of persons with correct
responses on both items.
• Equation A.38 is solved using numerical integration
through iterative procedures (Divgi, 1979).

To illustrate the differences that are produced between the tetrachoric, polychoric,
and Pearson correlation coefficients, Table A.10 compares the tetrachoric correlation, poly-
choric, and Pearson correlation for items 6 through 10 of the crystallized intelligence test 2.

TESTFACT program example syntax for Equation 2.34 producing the matrix in Table A.10

>TITLE
>EQUATION2_34.TSF - CRYSTALLIZED INTELLIGENCE SUBTEST 2,
ITEMS 6-10 FULL-INFORMATION ITEM FACTOR ANALYSIS
WITH TETRACHORIC CORRELATION COEFFICIENT
>PROBLEM NITEMS=5, RESPONSE=2;
>COMMENTS

Data layout:
COLUMNS 1 TO 5 --- ITEM RESPONSES

>NAMES ITEM1, ITEM2, ITEM3, ITEM4, ITEM5;


>RESPONSE '0','1';
>KEY 11111;
>TETRACHORIC NDEC=3, RECODE, LIST;
>FACTOR NFAC=2, NROOT=3, ROTATE=PROMAX, RESIDUAL,
SMOOTH;
>FULL CYCLES=20;
>TECHNICAL NOADAPT;
>SAVE SMOOTH, ROTATED, PARM, CORRELAT;
>INPUT SCORES, FILE=’D:\tetcorr\example.dat’;
(9A1,T1,5A1)
>STOP;
508  Appendix

PRE LISREL polychoric program example syntax used to produce the polychoric matrix in
Table A.10

PRELIS SYNTAX: Can be edited


SY='K:\table_for_eq_2.34_data.PSF'
SE 1 2 3 4 5
OFA NOR
OU MA=CM XT XM

Table A.10.  Tetrachoric, Polychoric, and Pearson Correlation Matrices


from Various Programs
Tetrachoric correlation matrix—TESTFACT, v.4.0
CRI2_06 CRI2_07 CRI2_08 CRI2_09 CRI2_10
CRI2_06 1.000 - - - -
CRI2_07 0.127 1.000 - - -
CRI2_08 0.375 0.146 1.000 - -
CRI2_09 0.400 0.314 0.416 1.000 -
CRI2_10 0.484 0.186 0.415 0.412 1.000
Note. Full information binary item factor analysis algorithm with adaptive quadrature.

Tetrachoric correlation matrix—Enzmann (2005) TETCORR program


CRI2_06 CRI2_07 CRI2_08 CRI2_09 CRI2_10
CRI2_06 1.000 - - - -
CRI2_07 0.320 1.000 - - -
CRI2_08 0.460 0.313 1.000 - -
CRI2_09 0.472 0.432 0.478 1.000 -
CRI2_10 0.543 0.319 0.473 0.464 1.000

Polychoric correlation matrix—LISREL, v.8.8 program


CRI2_06 CRI2_07 CRI2_08 CRI2_09 CRI2_10
CRI2_06 1.000 - - - -
CRI2_07 0.320 1.000 - - -
CRI2_08 0.460 0.313 1.000 - -
CRI2_09 0.472 0.432 0.478 1.000 -
CRI2_10 0.543 0.319 0.473 0.464 1.000

Pearson correlation matrix—SPSS


CRI2_06 CRI2_07 CRI2_08 CRI2_09 CRI2_10
CRI2_06 1.000 - - - -
CRI2_07 0.178 1.000 - - -
CRI2_08 0.289 0.175 1.000 - -
CRI2_09 0.297 0.245 0.309 1.000 -
CRI2_10 0.346 0.177 0.305 0.304 1.000
Mathematical and Statistical Foundations  509

The results from TESTFACT are different from the other matrices due to the advanced
methods of multidimensional numerical integration estimation included in the program.
Also, TESTFACT provides important linkages to item response theory and Bayes estimation
(for small items sets) and is therefore particularly useful in producing correlation matrices
for factor analysis of dichotomous items where an underlying normal distribution of a con-
struct is assumed to exist. For the computational details of TESTFACT, see Du Toit (2003).

A.25 Correlation Ratio h

The correlation ratio, eta (h), is applicable in describing the relationship between X and
Y in situations where there is a curvilinear relationship between two interval-level or
continuous quantitative variables (i.e., curvilinear regression). A classic example is the
regression of chronological age between ages 3 and 15 on a performance or ability score.
The correlation ratio of y on x is provided in Equation A.40a.
The standard error of the correlation ratio is given in Equation A.40b.

Equation A.40a. Correlation ratio

SSregression
h2Y .X = 1 -
SStotal

• SSregression = amount of variability of (Y¢) predicted from X.


• SStotal = sum of the error of prediction and the amount of
variability of (Y¢) predicted from X.

Equation A.40b. Standard error of the correlation ratio

1 - h2
sh =
N -1

A.26 Using Eta Square to Assess Linearity of Regression

As mentioned previously, departures from linearity between Y and X can have detrimental
effects in theoretical and applied research. A useful test for assessing the degree of non-
linearity in an X and Y relationship is by the F-test and is provided in Equation A.41a.
510  Appendix

Equation A.41a. F-ratio for testing nonlinearity of regression

h2Y .X - R 2
J -2
F=
1 - h2Y .X
N- J

• hY .X = correlation ratio for the regression of Y on X.


2

• r2 = r-square for the regression of Y on X.


• J = number of groups or categories Y is divided by.
• n = sample size.

If the F-test is statistically significant beyond a = .05, this is interpreted as meaning


that the departure from linearity is of statistical and practical concern. An application
of Equation A.41a to the example in Figure A.6, where a nonlinear relationship is illus-
trated between age in years and fluid intelligence, is provided in Equation A.41b using the
results of a regression analysis based on the data file PMPT.SAV.

Equation A.41b. F-ratio for testing nonlinearity of regression

.594 − .434
  .02667
F =  8 − 2 = = 62.79
 1 − .594  .00043
 
1000 − 8

An F-ratio of 62.79 exceeds F-critical (readers can verify this by referencing an F-table);
therefore the hypothesis that the regression of fluid intelligence on age is linear is rejected.
This result leads one to apply a nonlinear form of regression to estimate the relationship.

A.27 Multiple Correlation R

Extending the simple linear regression model to accommodate multiple predictor vari-
ables to estimate a criterion variable is straightforward. Furthermore, this extension is
applicable when the criterion is either continuous or categorical. The multiple predictor
equation in standard score (Z) form is provided in Equation A.42.
In the raw score case, b is substituted with b. However, both equations result in
an expected one-unit change in the criterion per unit change in the predictor, while
Mathematical and Statistical Foundations  511

Equation A.42. Multiple prediction equation

Zˆ Y = b1Z1 + b2 Z2 + b3 Z3

• Ẑ Y = predicted estimate of zy.


• z1, z2, z3 = predictors.
• b1, b2, b3 = standardized weights for predictors.

holding all other predictors constant as in the partial correlation explanations presented
next.

A.28 Partial Correlation: First Order

The partial correlation between two variables partitions out or cancels the effect of a third
variable upon the ones being evaluated. For example, the correlation between weight and
height of males where age is allowed to vary would be higher than if age were not allowed
to vary (i.e., held constant or partitioned out of the relationship). Another example is the
correlation between immediate or short-term memory and fluid intelligence where age is
permitted to vary. The first-order partial correlation is given in Equation A.42.

A.29 Partial Correlation: Second Order and Higher

Equation A.43 can be extended, as illustrated in Equation A.44, to calculate partial cor-
relations of any order. Notice that in Equation A.44, the combined effect of two variables
on the correlation of another set of variables is of interest. For example, a researcher may
want to examine the correlation between short-term memory and fluid intelligence while
controlling for the effect of crystallized intelligence and age.

Equation A.43. First-order partial correlation

R12 − R13R23
R12.3 =
(1 − R132 )(1 − R23
2
)

• r12.3 = correlation of variables 1 and 2 partialing out variable


3.
• r = coefficient of determination or correlation squared.
2
512  Appendix

Equation A.44. Second- and higher-order partial correlation

R12.3 - R14.3R24.3
R12.34 =
(1 - R14.3
2
)(1 - R24.3
2
)

• r12.3 = correlation between variable 1 and 2 partialing out


variable 3.
• r14.3 = correlation between variable 1 and 4 partialing out
variable 3.
• r24.3 = correlation between variable 2 and 4 partialing out
variable 3.
• r = coefficient of determination or correlation squared.
2

A.30 SemiPartial Correlation

Equation A.45 can be modified to express yet another version of partial correlation and
is often used in multivariate analyses such as multiple linear regression. Equation A.45
expresses the unique contribution of adding successive predictors into a regression
equation.

Equation A.45. Semipartial correlation

R12 R13 R23


R1(2.3)
(1 2
R23 )

Note the difference between Equations A.44 and A.45, notably the elimination of
the first half of the term in the denominator. Because of this change, the partial correla-
tion is always larger than the semipartial correlation. In regression problems where the
specific amount of influence each predictor variable in a set of variables exhibits on an
outcome, the semipartial correlation (as opposed to the partial correlation coefficient) is
the preferred statistic. Using the semipartial correlation allows a researcher to determine
the precise amount of unique variance each predictor accounts for in the outcome vari-
able (i.e., y). Table A.11 illustrates Pearson, partial, and semipartial coefficients based
on a regression analysis using total scores for fluid intelligence and short-term memory
as predictors of crystallized intelligence. The SPSS syntax that produced this output is
provided in Table A.11.
Mathematical and Statistical Foundations  513

Table A.11.  Regression Output That Includes Pearson, Partial, and


Semipartial Correlations
Coefficientsa
Model Unstandardized Standardized t Sig Correlations
Coefficients Coefficients
B Std Beta Zero- Partial Part
Error order
(Constant) 22.301 2.480 8.991 .000
sum of short-
term memory
tests 1 – 3 1.575 .097 .483 16.27 .000 .592 .458 .406
sum of fluid
intelligence tests 6.821 .000 .463 .211 .170
1–3 .202
.398 .058
a. Dependent variable: sum of crystallized intelligence tests 1–4
Note. Zero order = Pearson; Partial = first-order partial; Part = semipartial correlation.

SPSS REGRESSION syntax that produced Table A.11

REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA ZPP
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT cri_tot
/METHOD=ENTER stm_tot fi_tot.

Below is a SAS program using PROC REG that produces several estimates of partial
correlation coefficients presented as squared partial correlations (i.e., the estimates will
be the square root of those in Table A.11 above).

SAS program source code that produced partial correlation presented as squared
partial correlations

LIBNAME X 'K:\Guilford_Data_2011';
DATA TEMP; set X.GfGc;
RUN;

PROC REG;
MODEL cri_tot=stm_tot fi_tot/PCORR1 PCORR2 SCORR1 SCORR2;
TITLE 'SQUARED PARTIAL & SEMI PARTIAL CORRELATION';
RUN;

PROC PRINT DATA=X.part_corr_out;


RUN;
QUIT;
514  Appendix

Equation A.44 illustrates that the correlation between two measures is the covari-
ance divided by their respective standard deviations. Alternatively, the correlation is actu-
ally a standardized covariance. By rearranging Equation A.43 as sXY = rXYsXsY = rXYsXsY,
the covariance is the product of the correlation coefficient rXY and the two respective
standard deviations sXsY.

A.31 Summary and Conclusions

This Appendix presented the mathematical and statistical foundations necessary for a thor-
ough understanding of how psychometric methods work. First, three goals for researchers
developing and using psychometric methods were presented. The three goals were then
considered in light of three important components related to developing and using psycho-
metric methods: precision, communication, and objectivity. Importantly, an illustration was
presented regarding how concepts can be represented within a conceptual model by using
operational and/or epistemological definitions and rules of correspondence. Figure A.3
illustrates a conceptual model integrating concepts and rules of correspondence that pro-
vide a framework for applying mathematical rules and operations onto a measurable space.
Examples of tasks in psychological measurement include but are not limited to (1) devel­
oping normative scale scores for measuring short-term memory ability across the lifespan,
(2) developing a scale to accurately reflect a child’s reading ability in relation to his or her
socialization process, and (3) developing scaling models useful for evaluating mathemati-
cal achievement. Often these tasks are complex and involve multiple variables interacting
with one another. In this section, the definition of a variable was provided, including the
different types and the role they play in measurement and probability. Finally, some distri-
butions commonly encountered in psychometric methods were provided.
Attributes were described as identifiable qualities or characteristics represented by
either numerical elements or classifications. Studying individual differences among peo-
ple on their attributes plays a central role in understanding differential effects. In experi-
mental studies, variability about group means is often the preference. Whether a study is
based on individuals or groups, research problems are of interest only to the extent that a
particular set of attributes (variables) exhibit joint variation or covariation. If no covaria-
tion exists among a set of variables, conducting a study of such variables would be use-
less. Importantly, the goal of theoretical and applied psychometric research is to develop
models that extract the maximum amount of covariation among a set of variables.
Central to psychometric methods is the idea of mathematically expressing the rela-
tionship between two or more variables. Most analytic methods in psychometrics and
statistics involve the mathematical relationship between two or more variables. The coef-
ficient of correlation provides a mathematical and statistical basis for researchers to be
able to estimate and test bivariate and multivariate relationships. A considerable portion
of this Appendix provided a treatment of the various coefficients of correlation and when
their use is appropriate.
Mathematical and Statistical Foundations  515

Key Terms and Definitions


Accuracy. Degree of conformity or closeness of a quantity to its true or actual value.

Additive theorem of probability. The probability of occurrence of any one of several


particular events is the sum of their individual probabilities, provided that the events
are mutually exclusive.
Analysis of variance. A statistical model where the observed variance is portioned into
components based on explanatory variables.
Arithmetic mean. The average of a set of values or scores in a distribution.

Attributes. Identifiable qualities or characteristics represented by either numerical ele-


ments or categorical classifications of objects that can be measured.
Bayesian probability. A statistical model where probability is viewed as a measure of
a state of knowledge. Bayesian statistical methods are justified by rationality and
consistency and interpreted within the context of logic.
Communication. The process of transferring information from one entity to another or the
interchange of thoughts by speech, mathematical symbols, or writing.
Constant. A specific, unchanging number.

Continuous. Data values from a theoretically uncountable or infinite set having no gaps
in its unit of scale.
Covariation. The degree to which two variables vary together.

Cumulative probability distribution (density) function. A distribution by which a con-


tinuous function can be represented.
Datum. A single numerical value.

Decision theory. Identification of uncertainty relevant to a particular decision in relation


to an optimal decision.
Dependent variable. The value of a variable (Y ) that depends on the value of an inde-
pendent variable (X). Also known as a criterion or outcome variable.
Discrete. A specific set of values obtained from a countable or noninfinite set of specific
values.
Event. An observable outcome or set of outcomes to which a probability is assigned.

First moment. The mean of average of X.

Fourth moment. The kurtosis of a distribution of scores.

Frequency. The number of times an event or attribute is empirically observed as having


occurred.
Frequency distribution. A tabular summary of how many times values on a discrete vari-
able occur for a set of subjects or examinees.
Frequentist probability. Defines an event’s probability as the limit of its relative fre-
quency in a large number of trials.
516  Appendix

Improper solution. The occurrence of zero or negative error variances in matrix algebra
and simultaneous equations estimation.
Independent events. Given two events A and B, A does not affect the probability of B.
Independent trial. In probability theory, an event that is independent of another event is
a sample space.
Independent variable. A predictor or moderator variable (X) that is under some form of
direct manipulation by the researcher.
Item response theory. Application of mathematical models to empirical data for measur-
ing attitudes, abilities, and other attributes. Also known as latent trait theory, strong
true score theory, or modern test theory.
Joint density function. Multiplication of the conditional distributions for two variables (X
and Y ), resulting in marginal distributions for X and Y, respectively.
Kurtosis. A characteristic of a distribution where the tails are either excessively flat or
narrow, resulting in excessive “peakedness” or “flatness.” Also known as the fourth
moment or cumulant of a distribution.
Latent. Variables that are unobservable characteristics of human behavior such as a
response to stimulus of some type.
Linear score transformation. A change in a raw score by multiplying the score by a
multiplicative component (b) and then adding an additive component (a) to it.
Mean squared deviation. The average of the sum of the squared deviations for a ran-
dom variable.
Measurable space. A space comprised of the actual observations (i.e., sample space)
of interest in a study.
Metric. A standard of measurement or a geometric function that describes the distances
between pairs of points in space.
Moment. The value of a function of a real variable about a value such as c, where c is
usually zero.
Multiplicative theorem of probability. The probability of several particular events occur-
ring successively or jointly is the product of their separate probabilities.
Objectivity. A property of the measurement process demonstrated by the independent
replication of results using a specific measurement method by different researchers.
Pearson product–moment coefficient of correlation. A measure of strength of linear
dependence between two variables, X and Y.
Posterior distribution. In Bayesian statistics, the product of the prior distribution times
the likelihood.
Precision. The degree of mutual agreement among a series of individual measurements
on things such as traits, values, or attributes.
Probability distribution function. An equation that defines a continuous random vari-
able X.
Mathematical and Statistical Foundations  517

Probability function. The probabilities with which X can assume only the value 0 or 1.

Probability space. A space from which random variables or functions are obtained.

Product-moment correlation coefficient. A measure of the linear dependence between


two variables X and Y.
Proportionality. In Bayesian probability, if the posterior density (distribution) is propor-
tional to the likelihood of the observed data times the prior imposed upon the data,
the posterior density differs from the product of the likelihood times the prior by a
multiplicative constant.
Random variable. A function which has unique numerical values to all possible out-
comes of a random experiment under prescribed conditions. Technically, it is not a
variable but a function that maps observable events to numbers.
Relative frequency. The proportion of examinees receiving a particular score.

Reliability. Refers to the consistency of measurements based on repeated sampling of a


sample or population.
Repeatability. The degree to which further measurements on the same attribute are the
same or highly similar.
Sampling theory. Theory of obtaining estimates of certain properties of a population.

Second moment. The variance of a distribution of scores.

Skewness. A measure of asymmetry of a probability distribution of a random variable.

Standard deviation. A measure of dispersion of a sample, population, or probability


distribution.
Statistical estimation. Way of determining a population parameter based on a model
that is fit to data.
Sum of squares. Sum of the squared deviations from the mean of a random variable.

Third moment. The skewness of a distribution of scores.

Unbiased estimate. An estimator exhibiting a property such that the expected value and
the true value is zero.
Variable. A measurable factor, characteristic, or attribute of an individual, system, or
process.
Variance. A measure of dispersion of a random variable achieved by averaging the
deviations of its possible values from its expected value.
Yates’s correction for continuity. (Yates’s chi-square test). Adjusts the Pearson chi-square
test to prevent overestimation of statistical significance when analyzing data based
on samples with small cell sizes (< 10).
References

Adams, R. J., Wilson, M. R., & Wang, W. C. (1997). The multidimensional random coefficients
multinomial logit. Applied Psychological Measurement, 21, 1–24.
Aiken, L. R. (2002). Attitudes and related psychosocial constructs: Theories, assessment and research.
Thousand Oaks, CA: Sage.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
B. N. Petrov & F. Csaki (Eds.), Proceedings of the 2nd International Symposium on Information
Theory (pp. 267–281). Budapest: Akademiai.
Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sam-
pling. Journal of Educational Statistics, 17, 261–269.
Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data.
Journal of the American Statistical Association, 88, 669–679.
Allen, M. J., & Yen, W. M. (1979). Introduction to measurement theory. Belmont, CA: Wadsworth.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1985). Standards for educational and psychological
testing. Washington, DC: Authors.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and psychological
testing (2nd ed.). Washington, DC: Authors.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (2014). Standards for educational and psychological
testing (3rd ed.). Washington, DC: Authors.
Anastasi, A. (1986). Emerging concepts of test validation. Annual Review of Psychology, 37, 1–15.
Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage.
Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible para-
digms? Medical Care, 42, 1–16.
Angoff, W. H. (1984). Scales, norms and equivalent scores. Princeton, NJ: Educational Testing
Service.
Atkins v. Virginia, 536 U.S. 304.
Baker, F. (1990). EQUATE computer program for linking two metrics in item response theory. Madison:
University of Wisconsin, Laboratory of Experimental Design.
Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation technique (2nd ed.).
New York: Marcel Dekker.

519
520  References

Bayes, T. (1763). An essay towards solving a problem in the doctrine of chance. Philosophical
Transactions of the Royal Society of London, 53, 370–418.
Bennett, J. F., & Hayes, W. I. (1960). Multidimensional unfolding: Determining the dimensionality
of ranked preference data. Psychometrika, 25, 27–43.
Benson, J. (1988). Developing a strong program of construct validation: A test anxiety example.
Educational Measurement: Issues and Practice, 17, 10–17.
Berk, R. A. (1984). A guide to criterion-referenced test construction. Baltimore: Johns Hopkins Uni-
versity Press.
Birnbaum, A. (1957). Efficient design and use of tests of mental ability for various decision making
problems (Series Report No. 58-16, Project No. 7755-23). Randolph Air Force Base, TX: USAF
School of Aviation Medicine.
Birnbaum, A. (1958a). On the estimation of mental ability for various decision making problems
(Series Report No. 15, Project No. 7755-23). Randolph Air Force Base, TX: USAF School of
Aviation Medicine.
Birnbaum, A. (1958b). Further considerations efficiency in tests of mental ability (Technical Report
No. 17, Project No. 7755-23). Randolph Air Force Base, TX: USAF School of Aviation
Medicine.
Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical
theories of mental test scores (pp. 397–479). Reading, MA: Addison-Wesley.
Birnbaum, M. H. (Ed.). (1998). Measurement, judgment, and decision making (2nd ed.). San Diego,
CA: Academic Press.
Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of
educational objectives: The classification of educational goals: Handbook I. Cognitive domain.
New York: Longmans, Green.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two
or more nominal categories. Psychometrika, 37, 29–51.
Bock, D., Gibbons, R., & Muraki, E. (1988). Full information item factor analysis. Applied Psycho-
logical Measurement, 12(3), 261–280.
Bock, D., Gibbons, R., & Muraki, E. (1996). TESTFACT computer program. Chicago: Scientific
Software International.
Bock, R. D., & Aitkin, M. (1982). Marginal maximum likelihood estimation of item parameters:
Application of the EM algorithm. Psychometrika, 46, 443–445.
Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice. San
Francisco: Holden-Day.
Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model. Mahwah, NJ: Erlbaum.
Boring, E. G. (1950). A history of experimental psychology. New York: Appleton-Century-Crofts.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.
Psychometrika, 64, 153–168.
Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing.
Brennan, R. L. (1998). Misconceptions at the intersection of measurement theory and practice.
Educational Measurement: Issues and Practice, 17(1), 5–29.
Brennan, R. L. (2010). Generalizability theory. New York: Springer.
Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York: Guilford Press.
Browne, M. W., & Zhang, G. (2007). Developments in the factor analysis of individual time series.
In R. Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and
future directions (pp. 265–292). Mahwah, NJ: Erlbaum.
Bruce, V., Green, P. R., & Georgeson, M. A. (1996). Visual perception (3rd ed.). Mahwah, NJ:
Erlbaum.
Bush, R. R., & Mosteller, F. (1955). Stochastic models for learning. New York: Wiley.
Camilli, G. (1994). Origin of the scaling constant d = 1.7 in item response theory. Journal of Edu-
cational and Behavioral Statistics, 19, 293–295.
References  521

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–
multimethod matrix. Psychological Bulletin, 56, 81–105.
Card, N. A., & Little, T. D. (2007). Longitudinal modeling of developmental processes. Interna-
tional Journal of Behavioral Development, 31(4), 297–302.
Carnap, R. (1950). Logical foundations of probability. Chicago: University of Chicago Press.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor analytic studies. Cambridge, UK:
Cambridge University Press.
Cattell, R. B. (1943). The description of personality: Basic traits resolved into clusters. Journal of
Abnormal and Social Psychology, 38, 476–506.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1,
245–276.
Cattell, R. B. (1971). Abilities: Their structure, growth and action. Boston: Houghton Mifflin.
Cizek, G. J., & Bunch, M. B. (2006). Standard setting: A guide to establishing and evaluating perfor-
mance standards on tests. Thousand Oaks, CA: Sage.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Mahwah, NJ: Erlbaum.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum.
Cohen, R. J., & Swerdlik, M. (2010). Psychological testing and assessment: An introduction to test and
measurements (7th ed.). New York: McGraw-Hill.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Mahwah,
NJ: Erlbaum.
Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York: Wiley.
Coombs, C. (1964). A theory of data. New York: Wiley.
Coombs, C. H. (1950). The concepts of reliability and homogeneity. Educational and Psychological
Measurement, 10, 43.
Costa, P. T., & McCrae, R. R. (1992). The revised NEO Personality Inventory (NEO-PI-R) and NEO
Five-Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment
Resources.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Boston: Harcourt
Brace Jovanovich.
Crocker, L., & Algina, J. (2006). Introduction to classical and modern test theory. Belmont, CA:
Wadsworth.
Cronbach, L. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 6,
297–334.
Cronbach, L. (1970). Essentials of psychological testing (3rd ed.). New York: Harper.
Cronbach, L. J. (1971). Test validation. In R. L. Linn (Ed.), Educational measurement (2nd ed.,
pp. 443–507). Washington, DC: Macmillan.
Cronbach, L. J. (1980). Selection theory for a political world. Public Personnel Management, 9(1),
37–50.
Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions. Urbana: Uni-
versity of Illinois Press.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral
measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bul-
letin, 52, 281–302.
Cudeck, R. (2000). Exploratory factor analysis. In H. Tinsley & H. Brown (Eds.), Applied multivar-
iate statistical modeling and mathematical modeling (pp. 265–295). San Diego, CA: Academic
Press.
Darwin, C. (1859). On the origin of species by means of natural selection. London: Murray.
Dawes, R. M. (1972). Fundamentals of attitude measurement. New York: Wiley.
de Ayala, R. (2009). The theory and practice of item response theory. New York: Guilford Press.
522  References

Divgi, D. R. (1979). Calculation of the tetrachoric correlation coefficient. Psychometrika, 44(2),


169–172.
Dorans, N. J., Moses, T. P., & Eignor, D. R. (2011) Equating test scores: Toward best practices.
In A. A. von Davier (Ed.), Statistical models for test equating, scaling and linking (pp. 21–58).
New York: Springer.
Draper, N. R., & Smith, H. (1998). Applied regression analysis (3rd ed.). New York: Wiley
Interscience.
Dunn-Rankin, P., Knezek, G. A., Wallace, S., & Zhang, S. (2004). Scaling methods (2nd ed.).
Mahwah, NJ: Erlbaum.
Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcome. Journal of the
Royal Statistical Society B, 6, 355–366.
Du Toit, M. (2003). IRT from Scientific Software International. Chicago: Scientific Software
International.
Ebel, R. L., & Frisbie, C. (1991). Essentials of educational measurement (5th ed.). Englewood Cliffs,
NJ: Prentice-Hall.
Enders, C. K. (2011). Applied missing data analysis. New York: Guilford Press.
Enzmann, D. (2005). Retrieved from www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/
Enzmann/Software/Enzmann_Software.html.
Fabrigar, L. R., & Wegner, D. T. (2012). Exploratory factor analysis. New York: Oxford University
Press.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Brennan (Ed.), Educational measurement
(3rd ed., pp. 105–146). Washington, DC: American Council on Education.
Fischer, G. H. (1983). Logistic latent trait models with linear constraints. Psychometrika, 48, 3–26.
Fisher, R. A. (1935). The design of experiments. Oxford, UK: Oxford University Press.
Fiske, D. W. (1986). The trait concept and the personality questionnaire. In A. Angleitner & J. S.
Wiggins (Eds.), Personality assessment via questionnaires: Current issues in theory and measure-
ment (pp. 35–46). Berlin: Springer-Verlag.
Fiske, D. W. (2002). Validity for what? In Braun, H. I., Jackson, D. N., & Wiley, D. E. (Eds.), The
role of constructs in psychological and educational measurement (pp. 169–178). Mahwah, NJ:
Erlbaum.
Flanagan, D. P., McGrew, K. S., & Ortiz, S. O. (2000). The Wechsler scales and Gf–Gc theory. Needham
Heights, MA: Allyn & Bacon.
Flynn, J. R. (2007). What is intelligence? New York: Cambridge University Press.
Forrest, D. W. (1974). Francis Galton: The life and work of a Victorian genius. New York: Taplinger.
Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York: Springer.
Fraser, C., & McDonald, R. P. (2003). NOHARM: Normal Ogive Harmonic Analysis Robust Method
[Computer program]. Welland, ON: Niagra College. Available at www.niagrac.on.ca/~cfraser/
download.
Gable, R. K., & Wolfe, M. B. (1993). Instrument development in the affective domain: Measuring atti-
tudes and values in corporate and school settings (2nd ed.). Boston: Kluwer.
Gable, R. K., & Wolfe, M. B. (1998). Instrument in the affective domain (2nd ed.). Kluwer Academic
Publishers.
Gagné, R. M., & Driscoll, M. P. (1988). Essentials of learning for instruction (2nd ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Galton, F. (1869). Hereditary genius. London: Macmillan.
Galton, F. (1883). Inquiries into human faculty and its development. London: Macmillan.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis (2nd ed.). Boca
Raton, FL: Chapman & Hall/CRC.
Gemignani, M. C. (1998). Calculus and statistics. Mineola, NY: Dover.
Ghiselli, E. E. (1964). Theory of psychological measurement. New York: McGraw-Hill.
Gill, J. (2002). Bayesian methods: A social and behavioral sciences approach. Boca Raton, FL:
Chapman & Hall/CRC.
References  523

Glass, G. V., & Hopkins, K. D. (1996). Statistical methods in education and psychology (3rd ed.).
Needham Heights, MA: Allyn & Bacon.
Glenberg, A. M., & Andrzejewski, M. E. (2008). Learning from data: An introduction to statistical
reasoning (3rd ed.). Hillsdale, NJ: Erlbaum.
Glutting, J., McDermott, P., & Stanley, J. C. (1987). Resolving differences among methods of
establishing confidence limits for test scores. Educational and Psychological Measurement,
47, 607.
Gregory, R. J. (2000). Psychological testing: History, Principles and Applications (3rd ed.). Needham
Heights, MA: Allyn & Bacon.
Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009).
Survey methodology (2nd ed.). New York: Wiley.
Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York: McGraw-Hill.
Guilford, J. P. (1978). Fundamental statistics in psychology and education (4th ed.). New York:
McGraw-Hill.
Guion, R. (1977). Content validity: The source of my discontent. Applied Psychological Measurement,
1, 1–10.
Guion, R. (1998). Assessment, measurement and prediction for personnel decisions. Mahwah, NJ:
Erlbaum.
Gulliksen, H. (1950a). Intrinsic validity. American Psychologist, 5, 511–517.
Gulliksen, H. (1950b). The theory of mental tests. New York: Wiley.
Gulliksen, H. (1987). Theory of Mental Tests. Hillsdale, NJ: Erlbaum.
Guttman, L. (1941). The quantification of a class of attributes: A theory and method for scale con-
struction. In P. Horst (Ed.), The prediction of personal adjustment (pp. 321–348). New York:
Social Science Research Council.
Guttman, L. A. (1944). A basis for scaling qualitative data. American Sociological Review, 9,
139–150.
Guttman, L. (1946). An approach for quantifying paired comparisons and rank order. Annals of
Mathematical Statistics, 17, 144–163.
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese
Psychological Research, 22, 144–149.
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate data analysis
(5th ed.). Upper Saddle River, NJ: Prentice-Hall.
Haladyna, T. M. (2004). Developing and validating multiple-choice test items. Mahwah, NJ: Erlbaum.
Hald, A. (1998). A history of mathematical statistics from 1750 to 1930. New York: Wiley.
Hambleton, R. K., & Pitoniak, M. J. (2006). Setting performance standards. In R. L. Brennan
(Ed.), Educational measurement (4th ed., pp. 433–470). Westport, CT: American Council on
Education/Praeger.
Hambleton, R. K., & Plake, B. S. (1995). Using an extended Angoff procedure to set standards on
complex performance assessments. Applied Measurement in Education, 8, 41–56.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and practice. Boston:
Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory
(Vol. 2). Newbury Park, CA: Sage.
Han, C. (2008). IRTEQ computer program, version 1.2.21.55. www.umass.edu/remp/software/irteqt.
Hattie, J. A. (1985). A methodological review: Assessing unidimensionality of tests and items.
Applied Psychological Measurement, 9, 139–164.
Hebb, D. O. (1942). The effects of early and late brain injury upon test scores, and the nature of
normal adult intelligence. Proceedings of the American Philosophical Society, 85, 275–292.
Heise, D. R. (1970). Chapter 14, The semantic differential and attitude research. In G. F. Summers
(Ed.), Attitude measurement (pp. 235–253). Chicago: Rand McNally.
Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics, 32,
1–49.
524  References

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational
measurement (4th ed., pp. 189–220). Westport, CT: Praeger.
Holland, P. W., & Hoskins, M. (2003). Classical test theory as a first-order item response theory: Appli-
cation to true-score prediction from a possibly non-parallel test. Psychometrika, 68, 123–149.
Horn, J. L. (1998). A basis for research on age differences in cognitive abilities. In J. J. McCardle &
R. W. Woodcock (Eds.), Human cognitive abilities in theory and practice (pp. 8–20). Mahwah,
NJ: Erlbaum.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression (2nd ed.). New York: Wiley.
Hotelling, H. (1933). Analysis of complex statistical variables into principal components. Journal
of Educational Psychology, 24, pp. 417–441; 498–520.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
Hoyt, C. (1941). Test reliability obtained by analysis of variance. Psychometrika, 6, 153–160.
Huberty, C. J. (1994). Applied discriminant analysis. New York: Wiley.
Jannarone, R. J. (1997). Models for locally dependent responses: Conjunctive item response
theory. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern test theory
(pp. 465–480). New York: Springer.
Jörskog, K., & Sörbom, D. (1996). LISREL8: User’s reference guide. Chicago: Scientific Software
International.
Jörskog, K., & Sörbom, D. (1999a). LISREL8: New statistical features. Chicago: Scientific Software
International.
Jörskog, K., & Sörbom, D. (1999b). PRELIS2: User’s reference guide. Chicago: Scientific Software
International.
Kane, M. (2006). Validity. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64).
Westport, CT: Praeger.
Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160.
Katz, R. C., Santman, J., & Lonero, P. (1994). Findings on the Revised Morally Debatable Behaviors
Scale. Journal of Psychology, 128, 15–21.
Kelderman, H. (1992). Computing maximum likelihood estimates of loglinear IRT models from
marginal sums. Psychometrika, 57, 437–450.
Kelderman, H. (1997). Loglinear multidimensional item response model for polytomously scored
items. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern test theory
(pp. 287–303). New York: Springer.
Kelley, T. L. (1927). The interpretation of educational measurements. New York: World Book.
Kendall, M. G., & Stuart, A. (1961). The advanced theory of statistics: Vol. 2. Inference and relation-
ship. London: Charles Griffin.
Kerlinger, F. N., & Lee, H. (2000). Foundations of behavioral research (4th ed.). Belmont, CA: Cen-
gage Learning.
Khuri, A. (2003). Advanced calculus with applications in statistics (2nd ed.). New York: Wiley.
Kim, D., de Ayala, R. J., Ferdous, A. A., & Nering, M. L. (2007). Assessing relative performance of
local item independence (LID) indexes. Paper presented at the annual meeting of the National
Council on Measurement in Education, Chicago.
King, B., & Minium, E. (2003). Statistical reasoning in psychology and education (4th ed.). New
York: Wiley.
Kleinbaum, D. G., & Klein, M. (2004). Logistic regression (2nd ed.). New York: Springer-Verlag.
Kline, P. (1986). A handbook of test construction. New York: Methuen.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling and linking: Methods and practices (2nd
ed.). New York: Springer-Verlag.
Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement
for scale scores. Journal of Educational Measurement, 29, 285–307.
Kothari, C. R. (2006). Research methodology: Methods and techniques (3rd ed.). New Delhi, India:
New Age International.
References  525

Lattin, J., Carroll, D. J., & Green, P. E. (2003). Analyzing multivariate data. Pacific Grove, CA:
Brooks/Cole.
Lawshe, C. H. (1975). A quantitative approach to content validity. Personnel Psychology, 28,
563–575.
Lazarsfeld, D. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton-Mifflin.
Lee, P. M. (2004). Bayesian statistics: An introduction (3rd ed.). New York: Wiley.
Levy, P. S., & Lemeshow, S. (1991). Sampling of populations. New York: Wiley.
Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140),
1–55.
Linn, R. L., & Slinde, J. (1977). The determination of the significance of change between pre- and
post-testing periods. Review of Educational Research, 47, 121–150.
Lomax, R. (2001). Statistical concepts: A second course for education and the behavioral sciences (2nd
ed.). Mahwah, NJ: Erlbaum.
Lord, F. M. (1952). A theory of test scores [Monograph]. Psychometrika, 7(7), 1–84.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:
Erlbaum.
Lord, F. M., & Novick, M. (1968). Statistical theories of mental test scores. New York: Addison-Wesley.
Magnusson, D. (1967). Test theory. Reading, MA: Addison-Wesley.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174.
McAdams, D. P., & Pals, J. L. (2007). The role of theory in personality research. In R. Robins,
R. C. Fraley, & R. Kruger (Eds.), Handbook of research methods in personality psychology
(pp. 3–20). New York: Guilford Press.
McArdle, J. J. (2007). Five steps in the structural factor analysis of longitudinal data. In R. Robins &
R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and future directions.
Mahwah, NJ: Erlbaum.
McDonald, R. P. (1967). Non-linear factor analysis. [Psychometric Monograph No. 15]. Iowa City,
IA: Psychometric Society.
McDonald, R. P. (1982). Linear versus nonlinear models in item response theory. Applied Psycho-
logical Measurement, 6, 379–396.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Erlbaum.
McDonald, R. P. (1999). Multidimensional item response models. In Test theory (pp. 309–324).
Mahwah, NJ: Erlbaum.
McDonald, R. P., & Ahlawat, K. S. (1974). Difficulty factors in binary data. British Journal of Math-
ematical and Statistical Psychology, 27, 82–99.
Mertler, C. A., & Vannatta, R. A. (2010). Advanced and multivariate statistical methods (4th ed.).
Glendale, CA: Pryczak.
Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences
of measurement. In H. Wainer & H. I. Braun (eds.), Test validity (pp. 33–45). Hillsdale,
NJ: Erlbaum.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).
New York: Macmillan.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment.
Educational Measurement: Issues and Practice, 14(4), 5–8.
Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspectives of classical test
theory and structural equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 2, 255–273.
Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and
ability. In R. L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council
on Education and Macmillan.
Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Features of selected
methods. Applied Measurement in Education, 1, 261–275.
526  References

Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3: Item and scoring of binary items and one-, two-,
and three-parameter logistic models. Chicago: Scientific Software International.
Mokken, R. J., & Lewis, C. (1982). A nonparametric approach to the analysis of dichotomous item
responses. Applied Psychological Measurement, 6, 417–430.
Molenaar, I. W. (2002). Introduction to nonparametric item response theory (vol. 5). Thousand Oaks,
CA: Sage.
Molenaar, P. C. M. (2004). Five steps in the structural factor analysis of longitudinal data. In R.
Cudeck & R. C. MacCallum (Eds.), Factor analysis at 100: Historical developments and future
directions (pp. 99–130). Mahwah, NJ: Erlbaum.
Mosier, C. I. (1940). A modification of the method of successive intervals. Psychometrika, 5, 101–107.
Mulaik, S. A. (1987). A brief history of the foundations of exploratory factor analysis. Multivariate
Behavioral Research, 22, 267–305.
Muthen, B. O. (2007). Mplus computer program version 5.2. Los Angeles: Muthen & Muthen.
Muthen, B. O., & Hofacker, C. (1988). Testing the assumptions underlying tetrachoric correla-
tions. Psychometrika, 53(4), 563–578.
Muthen, B. O., & Muthen, L. (2010). Mplus computer program version 6.2. Los Angeles: Muthen &
Muthen.
Nandakumar, R., & Stout, W. (1993). Refinement of Stout’s procedure for assessing latent trait
unidimensionality. Journal of Educational Statistics, 18, 41–68.
Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological
Measurement, 14, 3–19.
Nunnally, J. C., & Bernstein, I. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
Osgood, C. E., Tannenbaum, P. H., & Suci, G. J. (1957). The measurement of meaning. Urbana:
University of Illinois Press.
Ostini, R., & Nering, M. L. (2006). Polytomous item response theory models. Thousand Oaks, CA: Sage.
Paxton, P. M., Curran, P., Bollen, K. A., Kirby, J. A., & Chen, F. (2001). Monte Carlo simulations in
structural equation models. Structural Equation Modeling, 8, 287–312.
Pearson, K. (1902). On the systematic fitting of curves to observations and measurements.
Biometrika, 1, 265–303.
Pearson Education, Inc. (2015). Stanford Achievement Test (10th ed.). San Antonio, TX: Author.
Pearson, E. S., & Hartley, H. O. (1966). Biometrika tables for statisticians. Cambridge, MA: Cambridge
University Press.
Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction
(2nd ed.). Fort Worth, TX: Harcourt Brace Jovanovich.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design and analysis: An integrated approach.
Mahwah, NJ: Erlbaum.
Peters, C. L. O., & Enders, C. (2002). A primer for the estimation of structural equation models
with missing data. Journal of Targeting, Measurement and Analysis for Marketing, 11, 81–95.
Peterson, N. G., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming and equating. In R. L. Linn
(Ed.), Educational measurement (3rd ed). New York: American Council on Education/Macmillan.
Press, J. (2003). Subjective and objective Bayesian statistics: Principles, models, and applications. New
York: Wiley.
Price, L. R., Laird, A. R., Fox, P. T., & Ingham, R. (2009). Modeling dynamic functional neuroimag-
ing data using structural equation modeling. Structural Equation Modeling: A Multidisciplinary
Journal, 16, 146–172.
Price, L. R., Lurie, A., & Wilkins, C. (2001). EQUIPERCENT Computer Program. Applied Psycho-
logical Measurement, 25(4), 332–332.
Price, L. R., Raju, N. S., & Lurie, A. (2006). Conditional standard errors of measurement for com-
posite scores. Psychological Reports, 98, 237–252.
References  527

Price, L. R., Tulsky, D., Millis, S., & Weiss, L. (2002). Redefining the factor structure of the Wechsler
Memory Scale–III: Confirmatory factor analysis with cross-validation. Journal of Clinical and
Experimental Neuropsychology, 24(5), 574–585.
Probstat. (n.d.). Retrieved from http://pirun.ku.ac.th/~b5054069.
Raju, N. S., Price, L. R., Oshima, T. C., & Nering, M. (2007). Standardized conditional SEM: A case
for conditional reliability. Applied Psychological Measurement, 31(3), 169–180.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Dan-
ish Institute of Educational Research.
Raudenbush, S. W. (2001). Toward a coherent framework for comparing trajectories of individual
change. In L. Collins & A. Sayer (Eds.), Best methods for studying change (pp. 33–64).
Washington, DC: American Psychological Association.
Raykov, T. (1997). Estimation of composite reliability for congeneric measures. Applied Psychologi-
cal Measurement, 21, 173–184.
Raykov, T. (1998). Coefficient alpha and composite reliability with interrelated nonhomogeneous
items. Applied Psychological Measurement, 22(4), 375–385.
Raykov, T., & Marcoulides. G. A. (2011). Introduction to psychometric theory. New York: Routledge.
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied
Psychological Measurement, 9(4), 401–412.
Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.
Rogosa, D. R., Brandt, D., & Zimowski, M. (1982). A growth curve approach to the measurement
of change. Psychological Bulletin, 92, 726–748.
Roskam, E. E. (1997). Models for speeded and timed-limited tests. In W. J. van der Linden & R. K.
Hambleton (Eds.), Handbook of modern test theory (pp. 187–208). New York: Springer.
Rulon, P. J. (1939). A simplified procedure for determining the reliability of a test by split-halves.
Harvard Educational Review, 9, 99–103.
Rudas, T. (2008). Handbook of probability: Theory and applications. Thousand Oaks, CA: Sage.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psy-
chometrika Monograph, No. 17, pp. 1–97.
Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph, No. 18.
Sax, G. (1989). Principles of educational and psychological measurement (3rd ed.). Belmont, CA:
Wadsworth.
Scheines, R., Hoijtink, H., & Boomsma, A. (1999). Bayesian estimation and testing of structural
equation models. Psychometrika, 64, 37–52.
Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related validity
studies. Journal of Applied Psychology, 61, 473–485.
Schumacker, R. E., & Lomax, R. G. (2010). A beginner’s guide to structural equation modeling (3rd
ed.). New York: Routledge.
Schwartz, G. (1978). Estimating the dimensions of a model. Annals of Statistics, 6, 461–464.
Scientific Software International. (2003a). TESTFACT version 2.0 computer program. Chicago:
Author.
Scientific Software International. (2003b). BILOG version 3.0 computer program. Chicago: Author.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs
for generalized causal inference. New York: Houghton Mifflin.
Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA:
Sage.
Spearman, C. (1904). General intelligence: Objectively determined and measured. American Jour-
nal of Psychology, 15, 201–293.
Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American
Journal of Psychology, 18, 161–169.
528  References

Stanley, J. C. (1970). Reliability. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.,


pp. 359–442). Washington, DC: American Council on Education.
Stevens, J. P. (2003). Applied multivariate statistics for the social sciences (4th ed.). Mahwah, NJ:
Erlbaum.
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.
Stevens, S. S. (Ed.). (1951a). Handbook of experimental psychology. New York: Wiley.
Stevens, S. S. (1951b). Mathematics, measurement, and psychophysics. In S. S. Stevens (Ed.).
Handbook of experimental psychology (pp. 1–49). New York: Wiley.
Stocking, M., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied
Psychological Measurement, 7, 201–210.
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psy-
chometrika, 52, 589–617.
Stout, W. (2006). DIMTEST: Nonparametric dimensionality assessment, version 2.1. Minneapolis,
MN: Assessment Systems Corporation.
Tabachnick, B., & Fidell, L. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & Bacon.
Taylor, H. C., & Russell, J. T. (1939). The relationship of validity coefficients to the practical effec-
tiveness of tests in selection. Journal of Applied Psychology, 23, 565–578.
Thissen, D., & Wainer, H. (2001). Test scoring. Mahwah, NJ: Erlbaum.
Thompson, B. (2000). Q-Technique factor analysis: One variation on the two-mode factor analysis
of variables. In L. G. Grimm, & P. R. Yarnold (Eds.), Reading and understanding more multi-
variate statistics (pp. 207–226). Washington, DC: American Psychological Association.
Thurstone, L. L. (1927). Three psychophysical laws. Psychological Review, 34, 424–432.
Torgerson, W. (1958). Theory and methods of scaling. New York: Wiley.
Verhelst, N. D., Verstralen, H. H. F. M., & Jansen, M. G. H. (1997). A logistic model for time-
limited tests. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern test
theory (pp. 169–186). New York: Springer.
von Davier, A. (2011). Statistical models for test equating, scaling and linking. New York: Springer.
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3-PL useful
in testlet-based adaptive testing. In W. J. vander Linden & C. A. W. Glass (Eds.), Computer-
ized adaptive testing, theory and practice (pp. 245–270). Boston, MA: Kluwer-Nijhoff.
Wainer, H., Bradlow, E., & Wang, X. (2007). Testlet response theory and its applications. New York:
Cambridge University Press.
Wainer, H., & Kiely, G. (1987). Item clusters and computer adaptive testing: A case for testlets.
Journal of Educational Measurement, 24, 185–202.
Waller, N. (2006). Construct validity in psychological tests. In N. Waller, L. Younce, W. Grove,
D. Faust, & M. Lenzenweger (Eds.), A Paul Meehl reader: Essays on the practice of scientific
psychology (pp. 9–30). Mahwah, NJ: Erlbaum.
Wechsler, D. (1997a). The WAIS-III/WMS-III Technical Manual. San Antonio, TX: Psychological
Corporation, Harcourt, Brace & Co.
Wechsler, D. (1997b). Wechsler Adult Intelligence Scale—Third edition. San Antonio, TX: Psycho-
logical Corporation.
Wechsler, D. (2008). Wechsler Adult Intelligence Scale—Fourth edition. San Antonio, TX: Psycho-
logical Corporation.
Whitely (Embretson), S. E. (1980). Multicomponent latent trait models for ability tests. Psychometrika,
45, 479–494.
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ:
Erlbaum.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
Yen, W. (1984). Effects of local item independence on the fit and equating performance of the three
parameter logistic model. Applied Psychological Measurement, 8, 125–145.
References  529

Yen, W. (1993). Scaling performance assessments: Strategies for managing local item indepen-
dence. Journal of Educational Measurement, 30, 187–213.
Zieky, M. J., Peirie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of
performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.
Zimmerman, D. W., & Williams, R. H. (1982). Gain scores in research can be highly reliable. Jour-
nal of Educational Measurement, 19, 149–154.
Zimmerman, D. W., Williams, R. H., & Zumbo, B. (1993). Gains scores in research can be highly
reliable. Journal of Educational Measurement, 19(2), 149–154.
Author Index

Note. f, n, or t following a page number indicates a figure, note, or a table.

Adams, R. J., 344 Bock, R. D., 149, 335t, 343, 344, 363, 369
Ahlawat, K. S., 338 Bollen, K. A., 300
Aiken, L. R., 152, 153 Bond, T. G., 334, 366
Aiken, L. S., 76 Boomsma, A., 473
Aitkin, M., 335t, 363 Boring, E. G., 4
Akaike, H., 403t Bradlow, E. T., 333, 335t, 364
Albert, J. H., 335t Brandt, D., 251
Algina, J., 9, 146, 156, 187, 212, 214, 216, 218, 221, 263, Brennan, R. L., 22, 136, 139, 252, 258, 261f, 262, 263,
277n, 279n, 280n, 290, 299, 300, 308, 415, 419, 269t, 270, 287, 331, 424, 426, 427, 437, 439
428, 434–436, 443 Brown, T. A., 291, 292, 302, 319, 321
Allen, M. J., 22, 114, 497 Browne, M. W., 292
Anastasi, A., 127 Bruce, V., 147
Anderson, R. E., 102, 293f, 316f Bunch, M. B., 193, 194, 199
Andrich, D., 366 Bush, R. R., 333
Andrzejewski, M. E., 19f, 38f, 40f
Angoff, W. H., 180, 181, 186, 424, 428, 434
C

Camilli, G., 382


B
Campbell, D. T., 134, 140, 174
Baker, F. B., 331, 363, 364, 371, 382, 404, 445 Card, N. A., 243
Bayes, T., 469 Carlin, J. B., 471
Bennett, J. F., 156 Carnap, R., 454
Benson, J., 128, 129t Carroll, D. J., 151
Berk, R. A., 3 Carroll, J. B., 454
Bernstein, I., 146, 147, 166, 228, 245 Cattell, R. B., 291, 337, 454
Birnbaum, A., 335t Chen, F., 300
Birnbaum, M. H., 9, 144, 150 Chib, S., 335t
Black, W. C., 102, 293f, 316f Cizek, G. J., 193, 194, 199
Bloom, B. S., 170, 172t Cohen, J., 76, 102, 500t, 503t

531
532  author Index

Cohen, P., 76 Fox, P. T., 473


Cohen, R. J., 63, 102, 103, 126, 138, 158, 194, 408, 449 Fraser, C., 343
Comrey, A. L., 292, 299, 322 Frisbie, C., 65, 170, 172t, 185, 185t, 196, 199, 200
Conover, W. J., 503, 503t Furst, E. J., 170
Cook, T. D., 174
Coombs, C. H., 153, 230
G
Costa, P. T., 1
Crocker, L., 9, 146, 156, 187, 212, 214, 216, 218, 221, Gable, R. K., 3, 150, 178
263, 277n, 279n, 280n, 281t, 290, 299, 300, 308, Gagné, R. M., 170, 172t
415, 419, 428, 429t, 434–436, 443 Galton, F., 1, 4, 10, 452f
Cronbach, L. J., 63, 64, 104, 125–127, 136, 139, 205t, Gelman, A., 471
233, 251, 258 Gemignani, M. C., 465, 466
Cudeck, R., 315 Georgeson, M. A., 147
Curran, P., 300 Ghiselli, E. E., 104
Gibbons, R., 343
Gill, J., 469
D
Glass, G. V., 106, 138, 485
Darwin, C., 4 Glenberg, A. M., 19f, 38f, 40f
Dawes, R. M., 153, 154, 156 Gleser, G. C., 63, 136
de Ayala, R., 17, 330, 331, 344, 347, 348, 363, 364, 375, Glutting, J., 245, 247
382, 398, 404, 441 Green, P. E., 151
Divgi, D. R., 506, 507n Green, P. R., 147
Dorans, N. J., 425, 426 Greene, J., 170
Draper, N. R., 80, 85, 102 Gregory, R. J., 2t, 65, 103, 416, 418f
Driscoll, M. P., 170 Groves, R. M., 409
Du Toit, M., 187, 364, 370, 371, 377, 383, 498, 504, 505, Guilford, J. P., 68, 69, 144, 151
509 Guion, R., 64, 126
Dunn-Rankin, P., 151 Gulliksen, H., 63, 68, 125, 204, 206, 214, 244
Dunson D. B., 473 Guttman, L. A., 152, 153, 231, 335t

E H

Ebel, R. L., 65, 170, 172t, 185, 185t, 196, 199, 200 Haebara, T., 445
Eignor, D. R., 425 Hair, J. F., 102–104, 106, 138–140, 293f, 316f, 327
Enders, C. K., 162 Haladyna, T. M., 175, 176
Engelhart, M. D., 170 Hald, A., 452f, 469, 480, 488
Enzman, D., 506 Hambleton, R. K., 193, 197, 330, 331, 337, 345, 351,
354, 371, 440, 441, 443, 445
Han, C., 445
F
Hanson, B. A., 252
Fabrigar, L. R., 290, 291, 297, 301, 314, 316, 317, 322 Hartley, H. O., 485
Fechner, G. T., 147, 452f Hattie, J. A., 337
Feldt, L. S., 331 Hayes, W. I., 156
Ferdous, A. A., 347 Hebb, D. O., 454
Fidell, L., 85, 99, 107, 122, 123t, 125, 307t, 315, 494 Heise, D. R., 159
Fischer, G. H., 335t Henry, N. W., 335t
Fisher, R. A., 107 Hill, W. H., 170
Fiske, D. W., 60, 63, 134, 140 Hocking, R. R., 102
Flanagan, D. P., 6, 454 Hofacker, C., 506
Flynn, J. R., 6 Hoijtink, H., 473
Forrest, D. W., 4 Holland, P. W., 334, 425, 426
Fox, C. M., 334, 366 Hoover, H. D., 3
Fox, J.-P., 335t Hopkins, K. D., 106, 138, 485
Author Index  533

Horn, J. L., 6 Likert, R., 158


Hoskins, M., 334 Linn, R. L., 241
Hosmer, D. W., 122, 123t, 124 Little, T. D., 243
Hotelling, H., 315 Livingston, S. A., 194
Hoyt, C., 240 Lomax, R. G., 319, 321, 349
Huberty, C. J., 106f, 107, 108, 138, 140 Lonero, P., 158
Hunter, J. E., 65 Lord, F. M., 101, 146, 160, 186, 206, 208, 212, 214, 216,
217, 219, 221, 244, 245, 248, 249, 330, 331, 334,
335t, 337, 338, 363, 431, 433, 440, 445
I
Lurie, A., 331, 439
Ingham, R., 473
M
J
Magnusson, D., 209f
Jannarone, R. J., 365 Marcoulides. G. A., 315
Jansen, M. G. H., 365 Masters, G. N., 335t, 366
Jones, L. V., 149 McAdams, D. P., 63
Jörskog, K., 338, 339, 341, 505 McArdle, J. J., 291
McCrae, R. R., 1
McDermott, P., 245
K
McDonald, R. P., 291, 312, 314, 335t, 338, 343, 344
Kane, M. T., 126, 136, 453 McGrew, K. S., 6
Katz, R. C., 158 Meehl, P. E., 125, 126
Kelderman, H., 335t, 344 Melican, G. J., 196
Kelley, T. L., 221 Mertler, C. A., 325
Kendall, M. G., 505, 506 Messick, S., 60, 62, 104, 126, 127, 127f, 128, 134, 137,
Kerlinger, F. N., 178, 301, 326 140
Khuri, A., 465 Millis, S., 132
Kiely, G., 364 Millman, J., 170
Kim, D., 347 Mills, C. N., 196
Kim, S. H., 331, 363, 364, 371, 382, 404 Minium, E., 20f
King, B., 20f Mislevy, R. J., 369
Kirby, J. A., 300 Mokken, R. J., 335t
Klein, M., 402 Molenaar, I. W., 335t
Kleinbaum, D. G., 402 Molenaar, P. C. M., 292
Kline, P., 130 Moses, T. P., 425
Knezek, G. A., 151 Mosier, C. I., 148
Kolen, M. J., 3, 252, 331, 362, 424, 426, 427, 437, 439, Mosteller, F., 333
439f Mulaik, S. A., 291
Kothari, C. R., 292 Muraki, E., 343
Krathwohl, D. R., 170 Muthén, B. O., 243, 506
Muthén, L., 506

L
N
Laird, A. R., 473
Lattin, J., 151, 156, 302 Nanda, H., 136
Lawshe, C. H., 126, 138 Nandakumar, R., 341
Lazarsfeld, P. F., 335t Nedelsky, L., 195
Lee, H. B., 178, 292, 299, 301, 322, 326 Nering, M. L., 252, 347, 404
Lee, P. M., 473 Novick, M., 101, 146, 160, 206, 208, 212, 214, 216, 217,
Lemeshow, S., 122, 123t, 124, 125, 174 219, 221, 244, 245, 248, 249, 334, 335t, 338, 398,
Levy, P. S., 174 506
Lewis, C., 335t Nunnally, J. C., 146, 147, 166, 228, 245
534  author Index

O Shadish, W. R., 174


Shavelson, R. J., 262
Ortiz, S. O., 6
Slinde, J., 241
Osgood, C. E., 159
Smith, H., 80, 85, 102
Oshima, T. C., 252
Sörbom, D., 338, 339, 341, 505
Ostini, R., 404
Spearman, C., 68, 206, 291
Stanley, J. C., 245, 247
P Stern, H. S., 471
Stevens, J. P., 76, 107, 108
Pals, J. L., 63
Stevens, S. S., 14, 15t, 16, 20, 22, 143, 148, 153
Paxton, P. M., 300
Stocking, M., 445
Pearson, E. S., 485
Stone, M. H., 333, 334, 335t
Pearson, K., 452f, 480, 488
Stout, W., 341
Pedhazur, E. J., 99, 107, 109, 113, 114t, 139, 349
Stuart, A., 505, 506
Peirie, M., 194
Suci, G. J., 159
Peters, C. L. O., 162
Swaminathan, H., 330, 331
Peterson, N. G., 3
Swerdlik, M., 63, 102, 103, 126, 138, 158, 194, 408,
Pitoniak, M. J., 193
449
Plake, B. S., 197
Press, J., 471
Price, L. R., 132, 252, 331, 319, 362, 439, 473 T

Tabachnick, B., 85, 99, 107, 122, 123t, 125, 307t, 315,
Q 494
Tannenbaum, P. H., 159
Quetelet, A., 452f, 480
Tatham, R. L., 102, 293f, 316f
Taylor, H. C., 113
R Thissen, D., 68, 333
Thompson, B., 292
Rajaratnam, N., 136
Thurstone, L. L., 148, 452f
Raju, N. S., 252, 331, 362
Torgerson, W., 142, 144, 146, 153, 160
Rasch, G., 333, 334, 335t, 366
Tulsky, D., 132
Raudenbush, S. W., 243
Raykov, T., 219, 315
Reckase, M. D., 335t, 344 U
Rogers, H. J., 331
Rogosa, D. R., 241 Urry, V. W., 65
Roskam, E. E., 365
Rubin, D. B., 471 V
Rudas, T., 212
Rulon, P. J., 231 Vannatta, R. A., 325
Russell, J. T., 113 Verhelst, N. D., 365
Verstralen, H. H. F. M., 365
von Davier, A., 426, 427, 439
S

Samejima, F., 335t


W
Santman, J., 158
Sax, G., 184, 185t, 186t Wainer, H., 68, 333, 335t, 364, 365
Scheines, R., 473 Wallace, S., 151
Schmelkin, L. P., 110, 113, 114t, 139 Waller, N., 60
Schmidt, F. L., 65 Wang, W. C., 344
Schumacker, R. E., 319, 321 Wang, X., 333, 335t
Schwartz, G., 403t Webb, N. M., 262
Author Index  535

Weber, E. H., 147, 162, 452f Y


Wechsler, D., 67, 132, 193, 229
Yen, W. M., 22, 114, 347, 348, 497
Wegner, D. T., 290, 291, 297, 301, 314, 316, 317,
322
Weiss, L., 132
Z
West, S. G., 76
Whitely (Embretson), S. E., 335t Zhang, G., 292
Wilkins, C., 439 Zhang, S., 151
Williams, R. H., 241, 243 Zieky, M. J., 194, 199, 200
Wilson, M. R., 344, 366 Zimmerman, D. W., 241, 243
Wolfe, M. B., 3, 150, 178 Zimowski, M., 241
Wright, B. D., 333, 334, 335t, 366 Zumbo, B., 243
Subject Index

Note. f or t following a page number indicates a figure or a table.

Ability, 337, 445–447, 447f Assumptions


Ability estimation, 362–364, 387t–388t, 442–443, 443t, factor analysis and, 324
444t item response theory and, 336–337
Absolute decisions, 260, 287 multiple linear regression and, 86t
Absolute terms, 19–20, 20f Pearson r and, 491–493, 492f, 493f
Absolute threshold, 147, 162 Attenuation, correction for. See Correction
Absolute zero, 19–20, 20f, 55 for attenuation
Accuracy, 453, 515 Attitudes, 178–179, 404
Achievement tests, 2t, 193. See also Psychological test Attributes
Additive theorem of probability, 462, 515 definition, 253, 515
Advanced test theory, 10 differences between ordinal and interval levels
Age-equivalent scores, 424–425 of measurement and, 19–20, 19f, 20f
Alternate choice format, 176t. See also Test items overview, 2, 451, 514
American Educational Research Association (AERA), test development and, 172–173
59 true score model and, 206–207
American Psychological Association (APA), 59
Analysis of variance (ANOVA)
B
definition, 55, 102, 287, 515
facets of measurement and universe scores and, 259 Backward selection, 125
generalizability theory and, 260–262, 261f, 266–271, Base rate, 113, 138
268t, 269f, 270t Bayesian methods, 335t, 475
overview, 82–83, 82t, 481 Bayesian probability, 469–474, 470f, 472f, 515
regression equation and, 90t, 100t Behavior, 3, 5–7, 7t, 8f
reliability and, 240, 241t, 253 Best line of fit, 51–52
single-facet crossed design and, 274–278, 275t, 276t Bias, 127t
sum of squares and, 96, 96t Bimodal distribution, 55
two-facet designs and, 282, 284t Biserial correlation, 188–189, 199, 497–498, 499t,
Anchor test, 435–436, 436t, 448 504
Angoff method, 196–197, 197t, 198–199 Bivariate relationships, 488–491, 490t, 506
Arithmetic mean, 475–476, 515 Bookmark method, 198–199
Association, 495–503, 496t, 499t, 500t Borderline examinee, 193, 195–196, 199

537
538  Subject Index

C Coefficients, 261f
Common factor model, 291, 309–312, 313f, 325
Canonical function, 111t, 114t, 116f. See also
Common factors, 291, 325
Discriminant analysis
Communality, 309–312, 313f, 324, 325
Categorical data, 173, 458
Communication, 452–454, 453f, 454, 515
Categorization, 14, 14f, 61f, 458
Comparative judgment, 148–150
Ceiling effects, 66, 102
Complex multiple-choice format, 176t. See also Test items
Central limit theorem, 363–364
Components, 312, 314, 314t, 326. See also Principal
Central tendency, 32, 33–34. See also Mean; Median; Mode
components analysis (PCA)
Chi-square statistics, 344, 346–347, 347t, 370, 402
Composite score
Choices, 150
coefficient alpha and, 238–239
Classes, 458
common standard score transformations or
Classical approach, 461
conversions, 423
Classical probability theory, 345
definition, 10, 253
Classical test theory (CTT)
norms and, 423, 448
compared to item response theory (IRT), 330–331, 441
overview, 7, 208
definition, 253, 287
reliability and, 223–228, 224t, 227t
factor analysis and, 296, 312, 314
Computer adaptive testing (CAT), 331, 404
generalizability coefficient and, 273–274
Concepts, 129t, 261f
generalizability theory and, 260, 261f, 273
Conditional distribution, 94
invariance property, 349–351, 350f, 351t
Conditional probability theory, 345
item response theory and, 404
Confidence interval
overview, 10, 257–258, 329
definition, 256, 287
reliability and, 67, 204
generalizability theory and, 281
standard error of measurement and, 281
overview, 245
strong true score theory and, 332–333
reliability and, 244, 246–248
Classical true score model, 204, 253
Confidence limits, 245, 248, 254
Classification
Confirmatory bias, 126
definition, 10
Confirmatory factor analysis (CFA). See also
discriminant analysis and, 106–114, 110t, 111t, 112t,
Factor analysis
113f, 114t
construct validity and, 132
overview, 2
definition, 138, 325
purpose of a test and, 169t
overview, 290, 293f, 319, 325
scaling models and, 162
principal components analysis and, 315–316
statistics and, 112t, 115t
structural equation modeling and, 319–322, 320f, 321f,
techniques for, 105–106, 106f
322f, 323f
Classification table
Congeneric tests, 219, 220t, 254
definition, 138
Consequential basis, 127t
logistic regression and, 122t, 124t
Consistency. See also Reliability
overview, 109–110, 112t, 116t, 257
Constant, 23, 55, 458, 515
Cluster analysis, 289–290
Constant error, 204–205, 254
Coefficient alpha
Construct validity. See also Constructs; Validity
composite scores based on, 238–239
correlational evidence of, 130–131
definition, 253
definition, 102, 138
estimating criterion validity and, 234–236, 235t, 236t,
evidence of, 127–130, 127f, 129t
237t
factor analysis and, 131–134, 133t, 134f
overview, 233, 233–235, 234t, 253
generalizability theory and, 136–137
Coefficient of contingency, 503, 503t
group differentiation studies of, 131
Coefficient of determination, 52–53, 53t, 55
overview, 10, 60, 126–127, 137, 141
Coefficient of equivalence, 229, 253
reliability and, 206
Coefficient of generalizability, 272–273, 287
Constructs. See also Construct validity; Individual
Coefficient of multiple determination, 80–83, 82t, 83t, 102
differences
Coefficient of reliability, 228–229, 240, 241t, 253. See
covariance and, 42
also Reliability
definition, 10
Coefficient of stability, 228–229, 253
overview, 5–6
Subject Index  539

test development and, 172–173 partial correlation and, 70–77, 73t, 75f
units of measurement and, 18–19 regression equation and, 85, 86t
validity continuum and, 61f standard-setting approaches and, 194
Content analysis, 61f, 173, 199 statistical estimation of, 66–68
Content validity. See also Validity Criterion-referenced test, 3, 10, 169t, 200. See also
definition, 103, 138 Norm-referenced test
limitations of, 126 Cross tabulation, 346t
overview, 63, 125–126, 137, 141 Cross validation, 85, 103, 138
Content validity ratio (CVR), 126, 138 Crossed designs, 260, 266, 287
Continuous data, 335t, 459 Cross-products matrices, 140
Continuous probability, 465–466 Cross-validation, 114
Continuous variable, 23–24, 55, 457, 515. See also Variance Crystallized intelligence. See also GfGc theory;
Convenience sampling, 409. See also Sampling Intel­lectual constructs
Convergent validity evidence, 134–135 correlation and, 45f
Conversions, 422–423 criterion validity, 66–67
Correction for attenuation, 68–70, 76–77, 103 factor analysis and, 292, 294, 294t, 295f, 296–301,
Correlated factors, 306–308 296t, 297t, 298t
Correlation. See also Correlation coefficients; Multiple item response theory and, 346, 346f
correlation; Partial correlation; Semipartial overview, 6–7, 7t, 8f, 455–456, 456t
correlation partitioning sums of squares, 54t
item discrimination and, 186 reliability and, 204
measures of, 495–503, 496t, 499t, 500t rules of correspondence and, 454–455, 455f
overview, 42–43, 44t, 45f, 488–491, 490t, 492, 513t, 514 scatterplot and, 45f
partial regression slopes, 90–92 standard error of estimate, 53f
Correlation coefficients. See also Correlation; Pearson structural equation modeling and, 319–322, 320f, 321f,
correlation coefficient 322f, 323f
correction for attenuation and, 76 subject-centered scaling and, 156–160, 157f, 158f, 159f
estimating criterion validity and, 83t subtests in the GfGc dataset, 23t
factor analysis and, 324 test development and, 166–167, 168–172, 168f, 169t,
semipartial correlation, 73–74, 75f 170t, 171t, 172t, 177, 191–192, 191t, 192f
Correlation matrix, 294, 296–301, 296t, 297t, 298t true score model and, 210t
Correlation ratio, 509 validity continuum and, 61–62
Correlational evidence, 130–131. See also Evidence Cumulative probability distribution (density) function,
Correlational studies, 127 465–466, 515
Counterbalancing, 432–435 Cumulative relative frequency distribution, 26, 36–37,
Counting, 460–461 55
Covariance Cumulative scaling model, 156, 162. See also
definition, 55 Scaling models
overview, 42, 45–47, 488–491, 490t, 492 Cutoff score, 193, 198–199, 200
Covariance matrix, 314, 490t
Covariance structural modeling, 46, 55, 133. See also
D
Structural equation modeling (SEM)
Covariation, 481–484, 515 Data, 461
Cramer’s contingency coefficient, 503, 503t Data analysis, 61f
Criterion, 61f, 166, 199 Data collection, 9, 322
Criterion contamination, 64, 66, 103 Data layout, 373–374, 374f
Criterion content, 60 Data matrix, 161, 161t, 163, 373–374, 374f
Criterion measure, 69–70 Data organization, 160–162, 161t
Criterion validity. See also Validity Data summary, 9
classification and selection and, 105–106, 106f Data types, 458–459
definition, 103 Data-driven approach, 333–334. See also Sampling
higher-order partial correlations and, 77–80, 79t Datum, 461, 515
high-quality criterion and, 63–66 Decision studies. See D-study
multiple linear regression and, 84, 84f, 85f Decision theory, 105, 138, 475, 515
overview, 63, 141 Decision-making process, 193–194
540  Subject Index

Degrees of freedom Domain of content, 166, 200


definition, 103 D-study
item response theory and, 344, 402 classical test theory and, 260
overview, 90 definition, 287
standard deviation and, 482 generalizability theory and, 261–262, 261f
Density functions, 475–481, 476f–477f, 478f, 486–487 single-facet crossed design and, 274–278, 275t, 276t
Dependability of measurement, 258. See also standard error of measurement and, 281
Generalizability theory steps in conducting, 263
Dependent variable, 458, 515. See also Variable universe score and, 259
Descriptive discriminant analysis (DDA), 107, 138
Descriptive statistics
E
definition, 56
overview, 22–23, 37t Ebel method, 196
reliability and, 241–243, 243t Educational achievement testing, 64–65
standard scores and, 413–415, 415t Eigenvalue
Deviance value, 402–403 definition, 139, 325, 404
Deviation scores discriminant analysis and, 110t, 114t
covariance and, 45–47 factor analysis and, 312, 314, 314t
definition, 254 item response theory and, 337
generalizability theory and, 264 overview, 108
overview, 219, 220t principal components analysis and, 317, 318t
Diagnostic purpose of a test, 169t Eigenvectors, 312, 314, 314t, 325
Dichotomous data Element, 161t, 163
item response theory and, 335t, 338 Equating. See also Test equating
one-parameter logistic IRT model and, 374–381, 376f, definition, 448
378f, 380t–381t, 381f equipercentile equating, 436–439, 438f, 439f
three-parameter logistic IRT model and, 389–399, linear methods, 428–429
393t–396t, 397f, 398f one test administered to each study group, anchor test
two-parameter logistic IRT model and, 381–389, administered to both groups (equally reliable
384f, 385t–386t, 387t–388t, 389f tests), 435–436, 436t
Difference limen (DL), 147, 163 overview, 427, 447–448
Difference scores, 241–243, 243t random groups with both tests administered to each
Differential item functioning (DIF), 331 group, counterbalanced, 432–435
Dimensionality random groups—one test administered to each group,
correlation matrix and, 337–341, 338f, 339f, 340f 429–432, 430t
definition, 404 test score linking and equating, 425–428, 426f, 428f,
item response theory and, 337, 341–344 429t
overview, 336–337 true score equating, 443, 445
Direct rankings, 150–151, 151t, 152t, 163 Equating function, 429, 448
Discrete data, 458, 515 Equipercentile, 431–432, 436–439, 438f, 439f
Discrete variable, 56, 457, 476–477. See also Variance Equipercentile equating, 448
Discriminal process, 163 Error, normality of, 494–495
Discriminant analysis Error of prediction (or residual), 50–51, 51f, 56, 347
definition, 138 Error of reproducibility, 153, 163
logistic regression and, 117–122, 117f, 118f, 119f, 121f, Error scores, 209–210, 210t, 214–216, 215f
122t Error variances, 261f, 311–312
multiple-group discriminant analysis, 114–116, 115t, Errors of estimation, 493–494, 494f
116f Essential tau-equivalence, 219, 220t, 254
overview, 106–114, 110t, 111t, 112t, 113f, 114t Events, 458, 461, 515
Discriminant function, 107, 116f, 138 Evidence
Discriminant z-score, 107, 138. See also z-score construct validity and, 127–130, 127f, 129t, 134–135
Distributions correlational evidence of construct validity, 130–131
Bayesian probability and, 472–473 factor analysis and, 132–133
factor analysis and, 322–323 overview, 59
shape, central tendency, and variability of, 31–42, 32f, validity continuum and, 61f, 62
36t, 37t, 40t, 41f Evidential bias, 127t
Subject Index  541

Examinee population, 173–174. See also Sampling Fixed facets of measurement, 260, 266, 287
Expectation (mean) error, 212 Floor effects, 66, 103
Expected a posteriori (EAP), 364 Fluid intelligence. See also GfGc theory;
Explication, 142 Intellectual constructs
Exploratory factor analysis (EFA). See also Factor analysis correlation and, 45f
construct validity and, 131 estimating criterion validity and, 72
definition, 139, 326 factor analysis and, 292, 294, 294t, 295f, 296–301,
overview, 290, 293f 296t, 297t, 298t
principal components analysis and, 315–316 overview, 6–7, 7t, 8f, 455–456, 456t
Extended matching format, 176t partitioning sums of squares, 54t
External stage, 129t regression and, 49, 50f
reliability and, 204
rules of correspondence and, 454–455, 455f
F
scatterplot and, 45f
Facets standard error of estimate, 53f
definition, 287 structural equation modeling and, 319–322, 321f,
generalizability theory and, 266–271, 268t, 269f, 270t 322f, 323f
of measurement and universe scores, 259–260 subject-centered scaling and, 156–160, 157f, 158f, 159f
overview, 258 subtests in the GfGc dataset, 23t
two-facet designs, 281–284, 282t, 283t, 284t, 285t, 286t test development and, 166–167, 168–172, 168f, 169t,
Factor, 296, 326 170t, 171t, 172t, 177
Factor analysis Forward selection, 125
applied example, 292, 294, 294t, 295f Fourth moment, 481, 485, 515. See also Kurtosis
communality and uniqueness and, 309–312, 313f Frequency, 417t, 420t, 461, 515
compared to principal components analysis, 315– Frequency distributions
318, 316f, 317t, 318t definition, 515
components, eigenvalues, and eigenvectors, 312, 314, graphing, 26–30, 27f, 28f, 40f
314t overview, 24–26, 24t, 25t, 27f, 28f, 461, 464t
construct validity and, 131–134, 133t, 134f Frequency polygon, 26, 28–29, 40f, 56. See also Relative
correlated factors and simple structure, 306–308 frequency polygon
correlation matrix and, 337–341, 338f Frequentist approach, 461
errors to avoid, 322–325 Frequentist probability, 515
factor loadings and, 294, 296–301, 296t, 297t, 298t F-test, 89, 509–510
factor rotation and, 301–306, 302f, 303f, 304f, 305t,
306t, 307t
G
history of, 291–292, 293f
overview, 10, 289–291, 325 G coefficient. See Coefficient of generalizability
structural equation modeling and, 319–322, 320f, Galton, Francis, 4, 10
321f, 322f, 323f General theory of intelligence (GfGc theory). See GfGc
test development and, 180 theory
Factor extraction, 297 Generalizability coefficient, 136–137, 258, 273–274, 287
Factor indeterminacy, 300, 326 Generalizability study, 139, 263. See also G-study
Factor loading Generalizability theory. See also D-study; G-study
construct validity and, 133t analysis of variance and, 260–262, 261f
definition, 139, 326 classical test theory and, 260, 273–274
overview, 133, 294, 296–301, 296t, 297t, 298t construct validity and, 136–137
Factor matrix, 293f, 301–302 definition, 254, 287
Factor rotation, 301–306, 302f, 303f, 304f, 305t, 306t, facets of measurement and universe scores, 259–260
307t, 326 overview, 10, 257–258, 286
Factor-analytic studies, 127 proportion of variance for the person effect and,
False negative, 110, 113–114, 139 271–273
False positive, 110, 113–114, 139 purpose of, 258
Falsifiability, 333, 404 reliability and, 251–252
First moment, 480–481, 515 single-facet crossed design and, 274–278, 275t, 276t
First-order partial correlation, 71, 76–77, 103. See also single-facet design with multiple raters rating on two
Partial correlation occasions, 280, 281t
542  Subject Index

Generalizability theory (continued) higher-order partial correlations and, 79t


single-facet design with the same raters standard error of estimate and, 95, 95f
on multiple occasions, 278–279 standardized regression equation, 94
single-facet nested design with multiple raters, 279–280 High-quality criterion, 63, 63–66. See also Criterion
single-facet person by item analysis, 266–271, 268t, validity
269f, 270t Histogram, 26, 56
standard error of measurement and, 281 Homogeneous scale, 130–131, 139
statistical model of, 263–265, 265t, 266t Homoscedastic errors of estimation, 493–494, 494f
two-facet designs, 281–284, 282t, 283t, 284t, 285t, 286t Horizontal equating, 427, 448
GfGc theory. See also Intellectual constructs Hypothesis, 9, 88, 92
factor analysis and, 289–290, 292, 294, 294t, 295f
overview, 6–7, 7t, 455–456, 456t
I
reliability and, 204
role validity and, 61–62 Identity, 16–17, 56
rules of correspondence and, 454–455, 455f Improper solution, 473, 516
subject-centered scaling and, 156–160, 157f, 158f, 159f Incomplete data, 162
test development and, 166–167, 168f, 191–192, 191t Independent events, 461–462, 516
Goodness-of-fit test, 404 Independent trial, 516
item response theory and, 370 Independent variable, 458, 516. See also Variable
logistic regression and, 121f, 123t Index measurement, 156, 163, 261f
overview, 333 Individual differences, 3, 5–6, 16–17, 158. See also
reliability and, 231t Constructs
Grade-equivalent scores, 424–425 Inferential statistical techniques, 22–23, 56, 485
Graphing frequency distributions, 26–30, 27f, 28f. See Information function, 358–362, 360t–361t, 361f
also Frequency distributions Instructional value or success purpose of a test, 169t
Group difference studies, 127 Instrument development, 166–181, 167f, 168f, 169t, 170t,
Group differentiation studies, 131 171t, 172t, 176t, 178f. See also Test development
Grouped Frequency distribution, 27f, 56 Instruments, 9
Group-level statistics, 410 Intellectual constructs. See also Constructs; GfGc theory
G-study. See also Generalizability study overview, 6
classical test theory and, 260 subject-centered scaling and, 156–160, 157f, 158f, 159f
definition, 287 test development and, 166–167, 168f, 177
generalizability theory and, 261–262, 261f units of measurement and, 18–19
overview, 258 Intelligence tests. See also Crystallized intelligence; Fluid
single-facet crossed design and, 274–278, 275t, 276t intelligence; GfGc theory; Psychological test; Short-
single-facet design with multiple raters rating on two term memory
occasions, 280, 281t criterion validity, 64–65
single-facet design with the same raters overview, 2t
on multiple occasions, 278–279 real number line and, 14f
standard error of measurement and, 281 test development and, 168–172, 169t, 170t, 171t, 172t
steps in conducting, 263 Intercept, 47–49, 56, 125
two-facet designs, 281–284, 282t, 283t, 284t, 285t, 286t Interindividual differences, 42
universe score and, 259 Interlinear item set format, 176t. See also Test items
Guttman reliability model, 232, 232t Internal consistency
Guttman scaling model, 151–153. See also Scaling models definition, 254
Guttman’s equation, 231–232, 254 overview, 226, 233
reliability and, 204, 233–235, 234t, 253
true score model and, 214
H
Interpretation
Heteroscedastic errors, 245, 254 percentile ranks and, 416, 418, 418f
Heterotrait–heteromethod, 135, 139 reliability and, 248, 249, 251
Heterotrait–monomethod, 135, 139 two-facet designs and, 286t
Higher-order partial correlation, 77–80, 79t, 103 Interpretative scores, 180–181. See also Scoring
Highly Valid Scale of Crystallized Intelligence (HVSCI) Interval scale. See also Measurement; Scaling
criterion validity and, 67 compared to ordinal levels of measurement, 19–20, 19f
estimating criterion validity and, 71–73, 84 definition, 56
Subject Index  543

overview, 14–17, 15t, 16f, 21, 146 three-parameter logistic IRT model and, 389–399,
subject-centered scaling and, 160 393t–396t, 397f, 398f
unfolding technique and, 153 true score equating, 443, 445
Intraindividual differences, 3, 42 two-parameter logistic IRT model and, 381–389,
Invariance property, 349–351, 350f, 351t, 441, 442f 381f, 385t–386t, 387t–388t, 389f
Invariant comparison, 366 when traditional models of are inappropriate to use,
Item. See Test items 364–365
Item analysis, 180, 182, 183t, 184t, 191–192 Item validity index, 191–192, 191t, 192f, 200. See also
Item characteristic curve (ICC), 332, 404 Test items; Validity
Item difficulty, 182, 183t, 184t, 257–258
Item discrimination, 184–186, 185t, 186t
J
Item facet, 262, 282, 287
Item format, 175, 200. See also Test items Joint density function, 487, 516
Item homogeneity, 130–131, 139, 254 Joint maximum likelihood estimation (JMLE), 363, 405.
Item information, 373, 388–389, 389f See also Maximum likelihood estimation (MLE)
Item information function (IIF) Joint probability, 351–358, 352f, 354t, 355t, 357f
definition, 404 Judgment scaling, 163
item response theory and, 358–362, 360t–361t, 361f Judgments, 148–150, 150
three-parameter logistic IRT model and, 397–399, 398f Just noticeable difference (JND), 147, 163
Item parameter estimates, 358–362, 360t–361t, 361f,
362–364
K
Item reliability index, 190–192, 191t, 192f, 200. See also
Test items Küder–Richardson 20, 233, 238–239, 253, 254
Item response function (IRF), 332 Küder–Richardson 21, 233, 238–239, 253, 254
Item response theory (IRT) Kurtosis, 410, 481, 485–486, 516
assumptions of, 336–337
Bayesian methods and, 475
L
bookmark method and, 198–199
compared to classical test theory (CTT), 330–331 Language development, 66–67
conceptual explanation of, 334, 336, 336f Latent class analysis (LCA), 344, 405
correlation matrix and, 337–341, 338f, 339f, 340f Latent factor, 291, 326
data layout, 373–374, 374f Latent trait. See also Item response theory (IRT)
definition, 163, 405, 516 definition, 405, 516
dimensionality assessment specific to, 341–344 item response theory and, 338, 439–440
invariance property, 349–351, 350f, 351t overview, 148, 331
item parameter and ability estimation and, 362–364 Latent variable, 336f, 458. See also Variable
item response theory and, 344 Least-squares criterion, 51–52, 56, 139
joint probability of based on ability, 351–358, 352f, Likelihood ratio tests, 123t, 402
354t, 355t, 357f Likelihood value, 118, 139
linear models and, 366–371, 368f, 369f, 370f Likert-type items, 178f, 404. See also Test items
local independence of items, 345–348, 346t, 347f Linear equation, 109, 264
logistic regression and, 366–371, 368f, 369f, 370f Linear models, 366–371, 368f, 369f, 370f, 428–429
maximum likelihood estimation (MLE) and, 468 Linear regression. See also Regression; Simple linear
model comparison approach and, 400–403, 403t regression
observed score, true score, and ability, 445–447, 447f assessing, 509–510
one-parameter logistic IRT model and, 374–381, 376f, generalizability theory and, 263–265
378f, 380t–381t, 381f overview, 47
overview, 10, 148, 329–330, 331–332, 404 Pearson r and, 492–493, 492f, 493f
philosophical views on, 333–334, 335t Linear scaling equation, 412
Rasch model and, 366–373, 368f, 369f, 370f, 372t Linear transformation, 411–415, 413t, 415t, 482, 484,
reliability and, 243, 252 485, 516
scaling and, 160 Linear z-scores, 416, 417t. See also z-score
standard error of ability, 358–362, 360t–361t, 361f Local independence, 331, 345–348, 346t, 347f, 405
strong true score theory and, 332–333 Local norms, 419, 448
test dimensionality and, 337 Location, 475
test score equating and, 439–443, 442f, 443t, 444t Log likelihood, 351–354
544  Subject Index

Logistic curve, 117, 117f, 139 Measurement theory, 5, 10


Logistic equation, 356–357 Measures, 61f
Logistic function, 366–367, 405 Measuring, 460–461
Logistic multiple discriminant analysis, 122–124, 123t, 124t Median, 33–34, 56, 424. See also Central tendency
Logistic regression, 107. See also Regression Memory, 62. See also Short-term memory; Working
definition, 139 memory
item response theory and, 366–371, 368f, 369f, 370f Methods, 129t
maximum likelihood estimation (MLE) and, 468 Metric, 461, 516
model fit in, 125 Metric multidimensional scaling (MDS), 156
multinomial logistic regression, 122–124, 123t, 124t Missing data, 160–162, 161t
overview, 117–122, 117f, 118f, 119f, 121f, 122t Mixed-facet generalizability theory, 262
Logits, 334 Mixture modeling, 344, 405
Long-run probability theory, 26 Modality, 32
Mode, 34, 56. See also Central tendency
Model fit, 121f, 123t, 125
M
Model summary, 100t
Marginal maximum likelihood estimation (MMLE), 363, Modern test theory, 329. See also Classical test theory
405, 475. See also Maximum likelihood estimation (CTT)
(MLE) Modification of instruction purpose of a test, 169t
Marginal probability, 387t–388t Moment, 480–481, 516
Matching format, 176t. See also Test items Morally Debatable Behavior Scale—Revised (MDBS-R),
Maximum a posteriori (MAP), 364 158
Maximum likelihood estimation (MLE) Multicategory data, 335t
Bayesian methods and, 475 Multicomponent response, 335t
definition, 405 Multidimensional (compensatory) model, 335t
item response theory and, 354–355, 358, 363 Multidimensional (noncompensatory) model, 335t
overview, 118, 467–469, 470f Multidimensional map, 142, 163
Maximum likelihood method, 139 Multidimensional scaling (MDS), 289–290, 404
Mean. See also Central tendency Multinomial logistic regression, 122–124, 123t, 124t
age- and grade-equivalent scores, 424 Multinomial regression, 122, 140
definition, 56 Multiple correlation. See also Correlation
discrete variables and, 476–477 coefficient of multiple determination and, 80–83, 82t, 83t
estimating criterion validity and, 79t definition, 103
overview, 33 overview, 90, 90t, 510–511
planning a norming study and, 410 Multiple discriminant analysis (MDA)
Mean of ratings, 284t definition, 140
Mean of the squared deviations, 36 multinomial logistic regression, 122–124, 123t, 124t
Mean squared deviation, 481, 516 overview, 107, 116, 116f
Measurable space, 457–458, 516 Multiple independent random variables, 486–488
Measurement. See also Psychometrics Multiple linear regression, 103
behavior and, 5–7, 7t, 8f Multiple linear regression (MLR)
definition, 5, 10, 56 assumptions of, 86t
facets of, 259–260 logistic regression and, 117–122, 117f, 118f, 121f, 122t
factor analysis and, 289–290 overview, 82, 82t, 84, 85f, 108
goals of, 451–452, 452f Multiple predictors, 70–77, 73t, 75f, 77–80, 79t
history of, 143 Multiple raters. See also Raters
levels of, 20–22 single-facet crossed design and, 274–278, 275t, 276t
normal distribution and, 41 single-facet design with multiple raters rating on two
origins of psychometrics and, 4 occasions, 280, 281t
overview, 2, 9–10, 13–14, 14f, 55, 514 single-facet nested design with multiple raters,
properties of, 14–20, 15t, 16f, 17f, 19f 279–280
research studies and, 7, 9 Multiple regression equation, 87–88, 94. See also
variables and their application and, 456–458 Regression equation
Measurement model, 320, 326 Multiple true–false format, 176t. See also Test items
Measurement observations, 24. See also Observations Multiple-choice format, 175–176, 176t, 185–186. See
Measurement precision, 217, 254, 257, 257–258, 287 also Test items
Subject Index  545

Multiple-group discriminant analysis, 114–116, 115t, Oblique rotational matrix. See also Rotational method
116f. See also Discriminant analysis definition, 326
Multiplication theorem of probability, 461–462, 516 factor analysis and, 293f, 324
Multitrait–multimethod (MTMM) studies overview, 302–306, 304f, 305t, 307t
construct validity and, 127 Observations, 13–14, 14f, 24. See also Measurement
definition, 140 observations
overview, 134–135, 134t, 135t Observed score
Multivariate analysis of variance (MANOVA), 107, 140 overview, 445–447, 447f
Multivariate normality, 107 true score model and, 209–210, 210t, 211f, 219–221,
Multivariate relationships, 488–491, 490t 220t
Observer (rater) facet, 282, 282t
N Obtained score units, 248
Occasion facet, 262, 288
National Council on Measurement in Education (NCME),
Odds ratio, 120, 122t, 140
59
One-facet design, 266–271, 268t, 269f, 270t
Nedelsky method, 195–196
One-factor models, 344
Nested designs, 260, 279–280, 287
One-parameter logistic IRT model
Nominal scale. See also Measurement; Scaling
for dichotomous item responses, 374–381, 376f, 378f,
definition, 56
380t–381t, 381f
item response theory and, 404
model comparison approach and, 400–403, 403t
overview, 14–17, 15t, 16f, 17f, 21
test score equating and, 439–440
Nonequivalent anchor test (NEAT) design, 427, 448
Open-ended questions, 173
Nonlinear regression, 492–493, 492f, 493f
Order, 150–151
Nonmetric measurement, 153, 156, 163
Ordered categorical scaling methods, 158. See also Scal-
Nonparametric model, 335t, 341
ing models
Nonprobability sampling, 174, 200. See also Sampling
Ordinal, 56
Normal distribution, 39–42, 41f, 56, 148–150. See also
Ordinal scale. See also Measurement; Scaling
Score distributions; Standard normal distribution
compared to interval levels of measurement, 19–20, 19f
Normality of errors, 494–495
definition, 57
Normalized scale scores, 418–421, 420t, 421f, 422–423,
overview, 14–17, 15t, 16f, 17f, 21, 146, 150–151
422f
subject-centered scaling and, 160
Normalized standard scores
Thurstone’s law of comparative judgment and, 148–150
definition, 448
unfolding technique and, 153
overview, 418–421, 420t, 421f, 422f
Orthogonal rotational matrix. See also Rotational method
Normative population, 180–181, 200
definition, 326
Normative sample, 408, 448
factor analysis and, 293f, 324
Normative scores, 410, 415–416, 417t
overview, 302–306, 303f, 306t, 307t
Norming, 408, 408–410, 449. See also Norms
Norm-referenced test. See also Criterion-referenced test;
Norms P
definition, 10, 200, 449
Paired comparisons, 150–151, 151t, 152t, 163
overview, 3, 408
Parallel forms method, 229
standard-setting approaches and, 194
Parallel test, 214, 216–219, 254
test development and, 169t
Parameter, 33
Norms. See also Norming; Norm-referenced test
Parameter estimates, 57, 124t, 394t–396t
definition, 449
Parametric factor-analytic methods, 341
normalized standard or scale scores, 418–421, 420t,
Parametric statistical inference, 471
421f, 422f
Partial correlation. See also Correlation; First-order
overview, 1–2, 10, 407–408
partial correlation
planning a norming study, 408–410
correction for attenuation and, 76–77
test development and, 180–181
estimating criterion validity and, 70–80, 73t, 75f, 79t,
Numbers, 14–17, 15t, 16f, 17f
83t
overview, 511–512, 513t
O
Partial regression slopes, 90–92
Object of measurement, 262, 288 Partially nested facet, 262, 288
Objectivity, 366, 405, 452–454, 453f, 516 Partitioning sums of squares, 54, 54t
546  Subject Index

Pattern matrix, 305, 326 Preference, 153, 163


Pearson correlation coefficient. See also Correlation; Principal axis factor (PAF), 297–298, 337, 338f, 405
Correlation coefficients Principal components analysis (PCA). See also
biserial correlation and, 188 Components
definition, 57 compared to factor analysis, 315–318, 316f, 317t, 318t
estimating criterion validity and, 72, 73t components, eigenvalues, and eigenvectors, 312,
overview, 43, 45f, 491–493, 492f, 493f, 499t, 505, 314, 314t
507–509, 508t, 513t factor analysis and, 289–290
semipartial correlation, 73–74, 75f Probability, 461–467, 464t, 467f
Pearson correlation matrix, 490t Probability distribution function, 463, 516
Pearson product–moment coefficient of correlation, 488– Probability function, 463, 517
489, 516 Probability sampling, 174, 200. See also Sampling
Percentile ranks Probability spacing, 457–458, 517
definition, 449 Probability theory, 207–208, 207t
normalized standard or scale scores, 420t Problem-solving item set format, 176t. See also Test items
overview, 415–416, 416, 417t, 418, 418f Product–moment correlation coefficient, 491, 517
test score equating and, 436–439, 438f, 439f Program value purpose of a test, 169t
Percentiles, 36–37, 57, 415–416, 417t Progress purpose of a test, 169t
Person effect, 153, 257–258, 271–273 Property of invariance. See Invariance property
Person response profiles, 163 Proportion of variance for the person effect, 271–273
Personality tests, 2t, 178–179. See also Psychological test Proportionality, 472–473, 517
Phi coefficient Proportionally stratified sampling, 174, 200–201. See
definition, 200 also Sampling
factor analysis and, 324 Pseudo R-square, 123t
overview, 499–503, 500t Psychological objects, 142, 163
test development and, 189, 189f Psychological scaling, 144–145, 145t, 163.
Philosophical foundation of a test or instrument, 166–168 See also Scaling
Pictorial item set format, 176t. See also Test items Psychological test, 1–2, 2t, 3–4, 9–10
Pilot test, 179–180 Psychometricians, 5, 11
Placement purpose of a test, 169t Psychometrics. See also Measurement
Point–biserial correlation, 186–187, 188t, 200, 496–497, Bayesian methods and, 475
499t definition, 11, 163
Polychoric r, 504–505, 505t, 507–509, 508t factor analysis and, 131
Polygons, 28–29. See also Frequency polygon; Relative goals of, 451–452, 452f
frequency polygon history of, 143–144, 143f, 144f
Polyserial r, 504–505, 505t normal distribution and, 41
Polytomous data, 335t origins of, 4
Population standard deviation, 35–36, 36t overview, 3–4, 9–10, 13, 55, 143, 143f, 144f, 514
Positively skewed distribution, 57 research studies and, 7, 9
Posterior distribution, 475, 516 statistical foundations for, 22–23
Precision, 452–454, 453f, 516 taxonomy of, 452, 452f
Prediction, 93–94, 250–251 Psychometry, 4
Prediction equation, 87, 88, 96–98, 97t Psychophysical scaling, 144–145, 145t, 147–150, 164.
Prediction equation (linear), 103 See also Scaling
Predictive accuracy, 94–101, 95f, 96t, 97t, 100t, 114 Psychophysics, 147, 164
Predictive discriminant analysis (PDA), 107, 140 P-type functional analysis, 292, 293f, 326
Predictive efficiency, 110, 140 Purpose
Predictive validity, 113–114, 114t construct validity and, 129t
Predictor test development and, 168–172, 169t, 170t, 171t, 172t
criterion validity, 66 Purposeful sampling, 409. See also Sampling
higher-order partial correlations and, 77–80, 79t
logistic regression and, 122t, 123t, 125
Q
multiple linear regression and, 84, 84f, 85f
partial correlation and, 70–77, 73t, 75f Q-type functional analysis, 292, 293f, 326
regression equation and, 88 Qualitative variables, 23
Predictor subset selection, 101–102 Quantitative variables, 5, 23. See also Measurement
Subject Index  547

R partial regression slopes, 90–92


partitioning sums of squares, 54, 54t
Random error
Pearson r and, 491–493, 492f, 493f
definition, 254
predictor subset selection in, 101–102
overview, 204, 257, 257–258, 288
Regression analysis. See also Regression
reliability and, 205t
estimating criterion validity and, 85, 86t
true score model and, 209–210
predictive accuracy of, 94–101, 95f, 96t, 97t, 100t
Random facets of measurement, 260, 288
predictor subset selection in, 101–102
Random sample, 57
Regression coefficients, 96–97, 97t, 100t
Random variables. See also Variance
Regression equation
definition, 517
estimating criterion validity and, 85, 86t
elements of, 461–467, 464t, 467f
overview, 84
overview, 457–458
standardized regression equation, 93–94
reliability and, 207–208, 207t
testing for significance, 87–90, 89t, 90t
Range, 65–66
unstandardized multiple regression equation, 87, 88
Rank-ordering approach, 150–151, 495–496, 496t
Regression equation (linear), 103
Rasch measurement
Regression line. See also Regression
conceptual explanation of, 334, 336, 336f
estimating criterion validity and, 84, 84f, 85f
definition, 405
least-squares criterion, 51–52
item response theory and, 334, 337, 359
overview, 49, 50f, 84, 85f
overview, 366
true score model and, 211f
Rasch model
Relational structures, 290
data layout, 373–374, 374f
Relative decisions, 260, 288
item information for, 373
Relative frequency, 461, 517
item response theory and, 355–357, 357f, 365, 366–
Relative frequency polygon, 28–29, 28f. See also Fre-
373, 368f, 369f, 370f, 372t, 404
quency polygon
model comparison approach and, 400–403, 403t
Relative terms, 19–20, 20f
one-parameter logistic IRT model and, 374–381, 376f,
Reliability
378f, 380t–381t, 380t–381t, 381f
analysis of variance and, 240, 241t
overview, 366
coefficient alpha and, 233–236, 234t, 235t, 236t, 237t
properties and results of, 371–373, 372t
coefficient of, 228–229
test score equating and, 439–440
of a composite, 223–228, 224t, 227t
three-parameter logistic IRT model and, 389–399,
of composite scores based on coefficient alpha,
393t–396t, 397f, 398f
238–239
two-parameter logistic IRT model and, 381–389,
conceptual overview, 204–206, 205t
381f, 385t–386t, 387t–388t, 389f
correction for attenuation and, 76
Raters
criterion validity and, 67
single-facet crossed design and, 274–278, 275t, 276t
definition, 254, 288, 517
single-facet design with multiple raters rating on two
of difference scores, 241–243, 243t
occasions, 280, 281t
errors of measurement and, 244–249
single-facet design with the same raters on multiple
generalizability theory and, 260
occasions, 278–279
overview, 203–204, 221–223, 252–253, 257, 453
single-facet nested design with multiple raters, 279–280
probability theory and, 207–208, 207t
two-facet designs and, 282
random variables and, 207–208, 207t
Rating scales, 404. See also Summated rating scales
relationship between observed and true scores, 219–
Ratio scale, 14–17, 15t, 16f, 21–22, 57. See also
221, 220t
Measurement; Scaling
single testing occasion, 230–234, 230t, 231t, 232t,
Raw score scale, 411, 449. See also Scale scores
234t
Raw scores, 417t, 420t
standard error of measurement and, 244–249
Real numbers, 14, 14f, 57
standard error of prediction and, 250–251
Reduced correlation matrix, 316, 326
summarizing and reporting information and, 251–252
Regression. See also Logistic regression; Regression
true score model and, 206–208, 207t, 209–219, 210t,
analysis; Regression line
211f, 215f
estimating criterion validity and, 83t
Reliability coefficient, 214, 221–223, 254, 273
factor analysis and, 322
Reliability indexes, 67, 221–223, 254
overview, 42, 47–50, 50f
Reliability of the predictor, 66
548  Subject Index

Repeatability, 205–206, 452–453, 517 psychophysical versus psychological scaling, 144–


Reporting information, 251–252 145, 145t
Representational measurement, 153 response-centered scaling method, 150
Representative sample, 171, 179–180, 424. stimulus-centered scaling, 147–148
See also Sampling test score equating and, 439–443, 442f, 443t, 444t
Research, 7, 9, 129t Thurstone’s law of comparative judgment, 148–150
Residual. See Error of prediction (or residual) two-parameter logistic IRT model and, 382
Response-centered scaling method. See also Scaling models Scaling models. See also Scaling
definition, 164 definition, 164
overview, 145t, 146, 150, 162 Guttman scaling model, 151–153
test development and, 165 importance of, 145t, 146
Role validity, 61–62. See also Validity order and, 150–151, 151t, 152t
Rotational method overview, 142, 162
correlated factors and simple structure and, 306–308 subject-centered scaling, 156–160, 157f, 158f, 159f
factor analysis and, 293f, 324 test development and, 165
overview, 301–306, 302f, 303f, 304f, 305t, 306t, 307t types of, 145t, 146–147
R-type functional analysis, 291–292, 293f unfolding technique, 153–156, 154t, 155f
Rules of correspondence, 454–455, 455f Scatterplot, 43, 45f, 57, 211f
Rules of measurement, 22 Score distributions. See also Normal distribution
Rulon’s formula, 231–232, 255 reliability and, 204–206, 205t
shape, central tendency, and variability of, 31–42, 32f,
36t, 37t, 39t, 41f
S
Score interpretation, 331–332
Sample, 57, 408–409. See also Sampling Score reliability, 260, 288. See also Reliability
Sample size, 65, 363–364, 371 Score validity, 103, 167–168, 201. See also Validity
Sample standard deviation, 35–36 Scores. See also Scoring
Sampling. See also Representative sample under linear transformation, 411–415, 413t, 415t
age- and grade-equivalent scores, 424 overview, 23–30, 23f, 24t, 25t, 27f, 28f
Bayesian probability and, 471 Scoring
definition, 201 age- and grade-equivalent scores, 424–425
factor analysis and, 324 common standard score transformations or
item response theory and, 333–334 conversions, 422–423
planning a norming study and, 408–409 linear methods, 428–429
test development and, 173–174, 179–180 normalized standard or scale scores, 418–421, 420t,
Sampling distribution, 41, 57 421f, 422f
Sampling distribution of the mean, 57 observed score, true score, and ability, 445–447, 447f
Sampling error, 410 overview, 447–448
Sampling theory, 461, 517 percentile rank scale and, 415–416, 417t
Scalar, 161, 461 test development and, 177, 180–181
Scale, 461 test score linking and equating, 425–428, 426f, 428f,
Scale aligning, 426, 449 429t
Scale indeterminacy, 371 true score equating, 443, 445
Scale scores Scree plot, 337, 338f, 405
common standard score transformations Second moment, 480–481, 517
or conversions, 422–423 Selection, 105–106, 106f, 169t
definition, 449 Selection ratio, 110, 113, 140
overview, 410–411, 418–421, 420t, 421f, 422f Semantic differential item, 178f. See also Test items
Scaling. See also Interval scale; Nominal scale; Ordinal Semipartial correlation, 73–74, 75f, 512–514, 513t. See
scale; Ratio scale; Scaling models also Correlation
data organization and missing data and, 160–162, 161t Sensory threshold, 147, 164
definition, 57, 164 Shape
history of, 142–144, 143f, 144f age- and grade-equivalent scores, 424
incomplete and missing data, 162 kurtosis and, 485
item response theory and, 375, 439–443, 442f, 443t, normal distribution and, 39–41, 41f
444t psychometrics and, 143f, 144f
overview, 20–22, 141–142, 162, 410–411, 514 of score distributions, 31–32, 32f
Subject Index  549

Short-term memory. See also GfGc theory; Intellectual Standard error of the mean, 410
constructs Standard normal distribution, 42, 57, 143f, 144f. See also
factor analysis and, 292, 294, 294t, 295f Normal distribution
generalizability theory and, 266t Standard score
overview, 6–7, 7t, 8f, 455–456, 456t definition, 449
reliability and, 204 under linear transformation, 411–415, 413t, 415t
rules of correspondence and, 454–455, 455f overview, 408
subject-centered scaling and, 156–160, 157f, 158f, Standard score conversion tables, 410
159f Standard setting, 193–194, 194, 201
subtests in the GfGc dataset, 23t Standardized regression equation, 93–94. See also
test development and, 166–167, 168–172, 168f, 169t, Regression equation
170t, 171t, 172t, 191–192, 191t Standardized regression slope, 104
validity continuum and, 62 Standardized regression slopes, 93
Sigma notation, 29–31, 57. See also Summation Standardized regression weights, 305
Significance, 87–90, 89t, 90t, 92 Standards for Educational and Psychological Testing, 60
Simple linear regression, 47, 57. See also Linear Standards-referenced method, 194, 201
regression; Regression Statistic
Simple structure definition, 57
correlated factors and simple structure and, 306–308 generalizability theory and, 261f
definition, 326 notation and operations overview, 459–460
factor analysis and, 306–308 overview, 33, 55, 514
overview, 301–302 planning a norming study and, 409–410
Single random variable, 486–487 reliability and, 231t
Single-facet crossed design, 274–278, 275t, 276t subject-centered scaling and, 160
Single-facet design, 278–280, 281t Statistical control, 70–71, 104
Single-facet person, 266–271, 268t, 269f, 270t Statistical estimation, 66–68, 475, 517
Skewness, 410, 481, 485–486, 517 Statistical foundations, 22–23
Slope of a line, 47–48, 57, 90–92 Statistical inference, 41
Slope–intercept equation, 376–377, 378f Statistical model, 263–265, 265t, 266t
Smoothing techniques, 424 Statistical power, 76
Spearman–Brown formula, 255 Stepwise selection, 125
Spearman’s rank order correlation coefficient, 495–496, Stimulus intensity, 143
496t Stimulus-centered scaling method. See also Scaling
Specific objectivity, 371, 405 models
Specific variance, 310–312, 326 definition, 164
Split-half method, 204 overview, 145t, 146, 147–148, 162
Split-half reliability, 226, 253, 255 test development and, 165
Square root of the reliability, 249 Thurstone’s law of comparative judgment and,
Squared multiple correlation, 76, 103 149–150
Stability of scores, 228–229. See also Reliability Stratified random sampling, 174, 201. See also Sampling
Standard deviation Strong true score theory, 332–333, 406
definition, 517 Structural equation modeling (SEM). See also
estimating criterion validity and, 79t Covariance structural modeling
overview, 34, 481–482 confirmatory factor analysis and, 319–322, 320f, 321f,
variance and, 35–36 322f, 323f
Standard error, 92, 99–100, 387t–388t, 433, 509 definition, 58, 140, 327
Standard error of ability, 358–362, 360t–361t, 361f factor analysis and, 133, 289–290, 325
Standard error of equating, 433–435 overview, 46
Standard error of estimation, 244 Structural model, 320, 327
Standard error of measurement (SEM), 244–249, 255, Structural stage, 129t
263, 281, 288 Subject-centered scaling method. See also Scaling models
Standard error of prediction, 244, 250–251, 255 definition, 164
Standard error of the estimate (SEE) overview, 145t, 146, 156–160, 157f, 158f, 159f, 162
definition, 57, 104, 255 test development and, 165
overview, 52–53, 53t Subjectivity, 126
regression analysis and, 94–95, 95f Subject-matter experts (SMEs), 195–196, 198–199, 201
550  Subject Index

Substantive stage, 129t Nedelsky method and, 195–196


Success ratio, 113, 140 writing, 175–179, 176t, 178f
Sum of squares Test score equating
analysis of variance and, 96 item response theory and, 439–443, 442f, 443t, 444t
definition, 140, 517 linear methods, 428–429
overview, 80–83, 82t, 481 overview, 425–428, 426f, 428f, 429t
Sum of squares regression, 98, 104 Test score linking, 425–428, 426f, 428f, 429t, 449
Sum of squares total, 98, 104 Test score scaling, 425–428, 426f, 428f, 429t
Summarizing information, 251–252, 460–461 Test use, 127t
Summated rating scales, 158, 160, 178f. See also Scaling Testing documentation, 181, 410
models; Test items Testlets, 364–365, 406
Summation, 29–31. See also Sigma notation Test–retest method, 204, 228–229
Sum-of-squares and cross-products matrices, 107 Tetrachoric correlation
Symmetric distribution, 58 definition, 201
Systematic variance, 71, 104 factor analysis and, 324
matrix, 338
overview, 190, 505–509, 508t
T
Theoretically continuous variables, 23–24. See also
Table of specifications, 170, 201 Continuous variable
Tau-equivalence, 219, 220t, 255 Third moment, 481, 517
t-distribution, 90, 104 Three-parameter logistic IRT model
Technical manual, 181, 410 for dichotomous item responses, 389–399,
Test administration procedures, 179 393t–396t, 397f, 398f
Test characteristic curve (TCC) method, 445 item information for, 397–399, 398f
Test development model comparison approach and, 400–403, 403t
Angoff method, 196–197, 197t test score equating and, 439–440
biserial correlation and, 188–189 Thurstone’s law of comparative judgment, 148–150,
bookmark method, 198–199 164
construct validity and, 132 Trait, 62–63, 104, 134, 140, 337
Ebel method, 196 Transformations, 422–423, 425–428, 426f, 428f, 429t
factor analysis and, 289–290 True criterion score, 63, 104
guidelines for, 166–181, 167f, 168f, 169t, 170t, 171t, True score
172t, 176t, 178f definition, 255
item analysis and, 182, 183t, 184t factor analysis and, 312
item difficulty, 182, 183t, 184t item response theory and, 443, 445
item discrimination, 184–186, 185t, 186t overview, 208, 445–447, 447f
item reliability and validity, 190–192, 191t, 192f True score model
item response theory and, 331–332 definition, 255
Nedelsky method, 195–196 equivalence, 219, 220t
overview, 165–166, 199 overview, 206–207
phi coefficient and, 189, 189f properties and assumptions of, 209–219, 209f, 210t,
planning a norming study and, 409–410 211f, 215f
point–biserial correlation and, 186–187, 188t relationship between observed and true scores, 219–
standard setting, 193–194 221, 220t
tetrachoric correlation, 190 reliability and, 207–208, 207t, 247–248
Test equating, 10, 407–408, 447–448. See also Equating standard error of measurement and, 245–246
Test form facet, 262, 288 True–false format, 176t. See also Test items
Test information function, 406 Two-facet designs, 281–284, 282t, 283t, 284t, 285t, 286t
Test interpretation, 127t Two-factor models, 344
Test items. See also Item format; Item reliability index; –2 Log Likelihood statistic, 125
Item validity index; Test development Two-parameter logistic IRT model
analysis of, 182, 183t, 184t for dichotomous item responses, 381–389, 384f,
content of, 174 385t–386t, 387t–388t, 389f
definition, 200 item information for, 388–389, 389f
difficulty of, 182, 183t, 184t model comparison approach and, 400–403, 403t
discrimination of, 184–186, 185t, 186t test score equating and, 439–440
Subject Index  551

U overview, 22, 23–30, 23f, 24t, 25t, 27f, 28f, 32, 34, 40f
reliability and, 204–206, 205t
Unadjusted linear transformation, 411–412, 449. See
Variable
also Linear transformation
definition, 11, 58, 517
Unbiased estimate, 482, 517
factor analysis and, 323, 323–324
Unfolding technique, 153–156, 154t, 155f. See also
overview, 6, 23, 456–458
Scaling models
research studies and, 9
Unidimensional model, 335t
validity continuum and, 61f
Unidimensional scale, 142, 164
Variance
Unidimensional unfolding technique, 153–156, 154t,
definition, 58, 517
155f, 164. See also Scaling models
factor analysis and, 293f
Unidimensionality, 217, 331, 406
generalizability theory and, 261f, 265t
Unimodal distribution, 58
normal distribution and, 41, 41f
Unique factor, 309–312, 313f, 327
overview, 35–36, 36t
Units of measurement, 18–19. See also Measurement
planning a norming study and, 410
Universe scores, 259–260, 262, 266, 288
reliability and, 223–225, 224t
Unobservable variables. See also Constructs
two-facet designs and, 285t, 286t
covariance and, 42
Variance component, 259, 288
factor loadings and, 294, 296–301, 296t, 297t, 298t
Variance–covariance matrix, 133, 223–225, 224t, 316,
overview, 5–6
317f
units of measurement and, 18–19
Variance partition, 312, 313f
Unobserved ability, 364
Variates, 107, 140
Unstandardized multiple regression equation, 87, 88.
Variations, 481–484
See also Regression equation
Verbal intelligence, 67
Unstandardized multiple regression equation (linear), 104
Vertical equating, 427, 449
Vignette or scenario item set format, 176t. See also Test
V items
Valid negatives, 110, 113–114, 140
Valid positives, 110, 113–114, 140 W
Validation, 60, 104, 127, 129t
Validity. See also Construct validity; Content validity; Wechsler Adult Intelligence for Adults—Third Edition
Criterion validity (WAIS-III), 67
classification and selection and, 105–106, 106f Wechsler Adult Intelligence Scale—Fourth Edition
construct-related variance and, 206 (WAIS-IV), 1
criterion validity, 63 Wechsler Memory Scale—Third Edition
definition, 104, 255 (WMS-III), 132
discriminant analysis and, 106–114, 110t, 111t, 112t, Working memory, 62
113f, 114t
high-quality criterion and, 63–66 Y
overview, 59–63, 61f, 102, 137, 141
scaling and, 22 Yates’s correction for continuity, 502–503, 517
test development and, 167–168, 190–192, 191t, 192f
validity continuum and, 61f
Z
Validity coefficient
correction for attenuation and, 68–70 Z-distribution, 58
definition, 104 Zero-order correlation, 72, 104
generalizability theory and, 136–137 z-score
overview, 63 common standard score transformations
reliability and, 67–68 or conversions, 422–423
Validity continuum, 61f definition, 58
Values, 461 normalized standard or scale scores, 418–421, 420t,
Variability 421f, 422f
definition, 58 overview, 37–38, 37t, 40t
About the Author

Larry R. Price, PhD, is Professor of Psychometrics and Statistics at Texas State University,
where he is also Director of the Initiative for Interdisciplinary Research Design and Analy-
sis. This universitywide role involves conceptualizing and writing the analytic segments
of large-scale competitive grant proposals in collaboration with interdisciplinary research
teams. Previously, he served as a psychometrician and statistician at the Emory University
School of Medicine (Department of Psychiatry and Behavioral Sciences and the Depart-
ment of Psychology) and at The Psychological Corporation (now part of Pearson’s Clinical
Assessment Group). Dr. Price is a Fellow of the American Psychological Association, Divi-
sion 5 (Evaluation, Measurement, and Statistics), and an Accredited Professional Statisti-
cian of the American Statistical Association.

552

You might also like