The Development of Psychometrics

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 9

The Development of Psychometrics

R.D. Buchanan and S.J. Finch

Abstract

Psychometrics developed as a means for measuring psychological abilities and attributes, usually
via the standardised psychological test. Its emergence as a specialised area of psychology
combined entrenched social policy assumptions, existing educational practices, contemporary
evolutionary thinking and novel statistical techniques. Psychometrics came to offer a pragmatic,
scientific approach that both fulfilled and encouraged the need to rank, classify and select people.

Keywords: True score, error, correlation, Norm, Reliability, Validity

Introduction

Psychometrics can be described as the science of measuring psychological abilities,


attributes and characteristics. Such a ubiquitous and hybridised set of techniques has been
said, not surprisingly, to have many proto-scientific and professional antecedents, some
dating back to antiquity. Modern psychometrics is embodied by standardised
psychological tests. American psychometrician Lee Cronbach famously remarked in the
1960s “the general mental test ... stands today as the most important single contribution of
psychology to the practical guidance of human affairs." (Sokal, p.113) However,
psychometrics has come to mean more than just the tests themselves; it also encompasses
the mathematical, statistical and professional protocols that underpin tests – how tests are
constructed and used, and indeed, how they are evaluated.

Early Precedents

1
Historians have denoted the examples of ‘mental testing’ in ancient China and other non-
Western civilisations where forms of proficiency assessment were used to grade or place
personnel. The most obvious template for psychometric assessment, with a more direct
lineage to modern scientific manners, was the university and school examination.
Universities in Europe first started giving formal oral assessments to students in the 13th
century. With the invention of paper, the Jesuits introduced written examinations during
the 16th century. In England, competitive university examinations began in Oxbridge
institutions in the early 1800s. (Rogers) By the end of the 19th century, compulsory forms
of education had spread throughout much of the Western world. Greater social mobility
and vocational streaming set the scene for practical forms of assessment, as governments,
schools, and bureaucracies of industrialised nations began to replace their reliance in
personal judgement with a trust in the impartial authority of numbers. (Porter)

Enter Darwin

Darwinian thought was a key example of the challenge of scientific materialism in the
19th century. If humans were a part of nature, then they were subject to natural law. The
notion of continuous variation was central to the new evolutionary thought. Coupled with
an emerging notion of personhood as a relatively stable, skin-bound entity standing apart
from professional function and social worth, Darwin’s ideas paved the way for
measurement-based psychology. Late in the 19th century, Darwin’s cousin Francis Galton
articulated key ideas for modern psychometrics, particularly the focus on human
variation. The distribution of many physical attributes (e.g., height) had already been
shown by Quetelet to approximate the Gaussian curve. Galton suggested that many
psychological characteristics would show similar distributional properties. As early as
1816, Bessel had described “personal equations” of systematic individual differences in
astronomical observations. In contrast, some of the early psychologists of the modern era
choose to ignore these types of differences. For instance, Wundt focussed on common or
fundamental mechanisms by studying a small number of subjects in-depth. Galton shifted
psychologists’ attention to investigations of how individuals differed and by how much.
(Rogers, Gillman)

2
Mental Testing Pioneers

Galton’s work was motivated by his obsession with eugenics. Widespread interest in the
riddles of hereditability provided considerable impetus to the development of
psychometric testing. If many psychological properties were at least partly innate and
inherited then, arguably, it was even more important and useful to measure them. Galton
was especially interested in intellectual functioning. By the mid-1880s he had developed
a diverse range of (what today) seem like primitive measures: tests of physical strength
and swiftness, visual acuity and memory of forms. Galton was interested in how these
measures related to each other, whether scores taken at an early age might predict later
scientific or professional eminence, and whether eminence passed from one generation to
the next. These were questions of agreement that were never going to be perfect. Galton
needed an index to calibrate the probabilistic rather than deterministic relationship
between two variables. He used scatter plots and noticed how scores on one variable were
useful for predicting the scores on another and developed a measure of the “co-relation”
of two sets of scores. (Gillman) His colleague, the biometric statistician Karl Pearson,
formalised and extended this work. Using the new terms “normal curve” and “standard
deviation” from the mean, Pearson developed what would become the statistical building-
blocks for modern psychometrics (e.g., the product-moment correlation, multiple
correlation, biserial correlation).

By the turn of the century, James Cattell and a number of American psychologists had
developed a more elaborate set of anthropometric measures, including tests of reaction
time and sensory acuity. Cattell was reluctant to measure higher mental processes,
arguing these were a result of more basic faculties that could be measured more precisely.
However, Cattell’s tests did not show consistent relationships with outcomes they were
expected to, like school grades and later professional achievements. Pearson’s colleague
and rival Charles Spearman argued this may have been due the inherent unreliability of
the various measures Cattell and others used. Spearman reasoned that any test would
inevitably contain measurement error, and any correlation with other equally error-prone

3
tests would underestimate the true correlation. According to Spearman, one way of
estimating the measurement error of a particular test was to correlate the results of
successive administrations. Spearman provided a calculation that corrected for this
“attenuation” due to “accidental error,” as did Brown independently, and both gave
proofs they attributed to Yule. Calibrating measurement error in this way proved
foundational. Spearman’s expression of the correlation of two composite measures in
terms of their variance and covariance later known as the “index of reliability.” (Levy)

Practical Measures

The first mental testers lacked effective means for assessing the qualities they were
interested in. In France in 1905, Binet introduced a scale that provided a different kind of
measurement. Binet did not attempt to characterise intellectual processes; instead he
assumed that performance on a uniform set of tasks would constitute a basis for a
meaningful ranking of school children’s ability. To do this it, Binet thought it necessary to
sample complex mental functions, since these most resembled the tasks faced at school and
provided for a maximum spread of scores. (Rose)

Binet did not interpret his scale as a measure of innate intelligence; he insisted it was only a
screening device for children with special needs. However, Goddard and many other
American psychologists thought Binet’s test reflected a general factor in intellectual
functioning and also assumed this was largely hereditary. Terman revised the Binet just
prior to World War II, paying attention to relevant cultural content and documenting the
score profiles of various American age groups of children. But Terman’s revision (called
the Stanford-Binet) remained an age-referenced scale, with sets of problems or “items”
grouped according to age appropriate difficulty, yielding an intelligence quotient mental
age/chronological age (IQ) score.

Widespread use of Binet-style tests in the US army during World War I helped streamline
the testing process and standardise its procedures. It was the first large-scale deployment

4
of group testing and multiple-choice response formats with standardised tests. (Sokal,
Danziger)

Branching Out

In the 1920s, criticism of interpretation of the Army test data - that the average mental
age of soldiers, a large sample of the US population, was “below average” - drew
attention to the problem of appropriate “normative” samples that gave meaning to test
scores. The innovations of the subsequent Wechsler intelligence scales - with test results
compared to a representative sample of adult scores – could be seen as a response to the
limitations of age-referenced Binet tests. The inter-war period also saw the gradual
emergence of the concept of “validity,” that is, whether the test measured what it was
supposed to. Proponents of Binet-style tests wriggled out of the validity question with a
tautology: intelligence was what intelligence tests measured. However, this stance was
developed more formally as operationism, a stop-gap or creative solution (depending on
your point of view) to the problem of quantitative ontology. In the mid-1930s, Stevens
argued that the theoretical meaning of a psychological concept could be defined by the
operations used to measure it, which usually involved the systematic assignment of
number to quality. For many psychologists, the operations necessary to transform a
concept into something measurable were taken as producing the concept itself. (Rogers,
Michell, Wright)

The practical success of intelligence scales allowed psychologists to extend operationism


to various interest, attitude and personality measures. While pencil and paper
questionnaires dated back to at least Galton’s time, this new branch of testing appearing
after World War I took up the standardisation and group comparison techniques of
intelligence scales. Psychologists took to measuring what were assumed to be
dispositional properties that differed from individual to individual not so much in quality
but in amount. New tests of personal characteristics contained short question items
sampling seemingly relevant content. These usually had fixed response formats, with
response scores combined to form additive, linear scales. Scale totals were interpreted as

5
a quantitative index of the concept being measured, calibrated through comparisons with
the distribution of scores of normative groups. Unlike intelligence scales, responses to
interest, attitude or personality inventory items were not thought of as unambiguously
right or wrong - although different response options usually reflected an underlying
psychosocial ordering. Ambiguous item content and poor relationships with other
measures saw the first generation of personality and interest tests replaced by instruments
where definitions of what was to be measured were largely determined by reference to
external criteria. For example, items on the Minnesota Multiphasic Personality Inventory
were selected by contrasting the responses of normal and psychiatric subject groups.
(Buchanan)

Grafting on Theoretical Respectability

In the post World War II era, psychologists subtly modified their operationist approach to
measurement. Existing approaches were extended and given theoretical rationalisations.
The factor analytic techniques that Spearman, Thurstone and others had developed and
refined became a mathematical means to derive latent concepts from more directly
measured variables. (Lovies, Bartholomew) They also played a role in guaranteeing both
the validity and reliability of tests, especially in the construction phase. Items could be
selected that apparently measured the same underlying variable. Several key personality
and attitude scales, such as the R.B. Cattell’s 16 PF and Eysenck’s personality
questionnaires, were developed primarily using factor analysis. Thurstone used factor
analysis to question the unitary concept of intelligence. New forms of item analyses and
scaling (e.g., indices of item difficulty, discrimination and consistency) also served to
guide the construction of reliable and valid tests.

In the mid-1950s, the American Psychological Association stepped in to upgrade all


aspects of testing, spelling out the empirical requirements of a good test, as well as
extending publishing and distribution regulations. They also introduced the concept of
“construct validity,” defined as the sum of the test’s conceptual integrity borne out by its
theoretically expected relationships with other measures. Stung by damaging social

6
critiques of cultural or social bias in the 1960s, testers further revived the importance of
theory to a historically pragmatic field. Representative content coverage, relevant and
appropriate predictive criteria, all became keystones for fair and valid tests. (Buchanan,
Rogers)

The implications of Spearman’s foundational work were finally formalised by Gulliksen


in 1950, who spelt out the assumptions the classical “true score model” required. The true
score model was given a probabilistic interpretation by Lord and Novick in 1968. (Traub)
More recently, psychometricians have extended item level analyses to formulate
generalised response models. Proponents of item response theory claim it enables the
estimation of latent aptitudes or attributes free from the constraints imposed by particular
populations and item sets. (Bock, Embretson)

References

Bartholomew, D.J. (1995). Spearman and the Origin and Development of Factor
Analysis, British Journal of Mathematical and Statistical Psychology, 48, 211-220.

Bock, R.D. (1997). A Brief History of Item Response Theory. Educational


Measurement: Issues and Practice, 16, 21-33.

Buchanan, R.D., (1994). The Development of the MMPI. Journal of the History of the
Behavioral Sciences, 30, 148-161.

Buchanan, R.D., (1997). Ink Blots or Profile Plots: The Rorschach versus the MMPI as
the Right Tool for a Science-based Profession. Science, Technology and Human Values,
21, 168-206.

Buchanan, R.D., (2002). On not ‘Giving Psychology Away’: The MMPI and Public
Controversy over Testing in the 1960s. History of Psychology, 5, 284-309.

7
Danziger, K. (1990). Constructing the Subject: Historical Origins of Psychological
Research, Cambridge University Press, Cambridge.

Embretson, S.E. (1996). The New Rules of Measurement. Psychological Assessment, 8,


341-349.

Gillham, N.W. (2001). A Life of Sir Francis Galton: From African Exploration to the
Birth of Eugenics. Oxford University Press, New York.

Levy, P. (1995). Charles Spearman’s Contributions to Test Theory. British Journal of


Mathematical and Statistical Psychology, 48, 221-235.

Lovie, A.D. & Lovie, P. (1993). Charles Spearman, Cyril Burt, and the Origins of Factor
Analysis, Journal of the History of the Behavioral Sciences, 29, 308-321.

Michell, J. (1999). Measurement in Psychology: Critical History of a Methodological


Concept, Cambridge University Press, Cambridge.

Porter, T.M. (1995). Trust in Numbers: The Pursuit of Objectivity in Science and Public
Life, Princeton University Press, Princeton, NJ.

Rose, N. (1979). The Psychological complex: Mental Measurement and Social


administration, Ideology and Consciousness, 5, 5-68.

Rogers, T.B. (1995). The Psychological Testing Enterprise: An Introduction,


Brooks/Cole, Pacific Grove, California.

Sokal, M.M. (1987). Psychological Testing and American Society, 1890-1930, Rutgers
University Press, New Brunswick, NJ.

8
Traub, R.E. (1997). Classical Test Theory in Historical Perspective. Educational
Measurement: Issues and Practice, 16, 8-14.

Wright, B.D. (1997). A History of Social Science Measurement. Educational


Measurement: Issues and Practice, 16, 33-52.

You might also like