III - Essentials of Test Score Interpretation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

ESSENTIALS OF TEST SCORE

INTERPRETATION
B Y: S U S A N A U B I N A
RAW SCORES
A raw score is a number (X) that summarizes or captures
some aspect of a persons performance in the carefully
selected and observed behavior samples that make up
psychological tests.
FRAMES OF REFERENCE FOR TEST-SCORE
INTERPRETATION

1. Norms. Norm-referenced test interpretation uses standards


based on the performance of specific groups of people to
provide information for interpreting scores.
The term norms refers to the test performance or typical
behavior of one or more reference groups.
Norms are usually presented in the form of tables with
descriptive statistics such as means, standard deviations,
and frequency distributionsthat summarize the
performance of the group or groups in question.
When norms are collected from the test performance of
groups of people, these reference groups are labeled
normative or standardization samples
FRAMES OF REFERENCE FOR TEST-SCORE
INTERPRETATION
2. Performance criteria
When the relationship between the items or tasks of a test
and standards of performance is demonstrable and well
defined, test scores may be evaluated via criterion-
referenced interpretation.
This type of interpretation makes use of procedures, such as
sampling from content domains or work-related behaviors,
designed to assess whether and to what extent the desired
levels of mastery or performance criteria have been met.
NORM-REFERENCED TEST
INTERPRETATION
Developmental Norms
Ordinal Scales Based on Behavioral Sequences
Human development is characterized by sequential processes in a
number of behavioral realms.
Theory-Based Ordinal Scales
Ordinal scales may be based on factors other than chronological age.
are more or less useful depending on whether the theories on which
they are based are sound and applicable to a given segment of a
population or to the population as a whole.
Mental Age Scores
The mental age scores derived from those scales were computed on
the basis of the childs performance, which earned credits in terms of
years and months, depending on the number of chronologically
arranged tests that were passed. They are subject to change over
time, as well as across cultures and subcultures.
Grade Equivalent Scores
The sequential progression and relative uniformity of school
curricula, especially in the elementary grades, provide
additional bases for interpreting scores in terms of
developmental norms.
Grade-based norms or age equivalent score scales also reflect
the average performance of certain groups of students in
specific grades, at a given time and place. They too are
subject to variation over time, as well as across curricula in
different schools, school districts, and nations.
Within-Group Norms
These norms essentially provide a way of evaluating a
persons performance in comparison to the performance of
one or more appropriate reference groups.

The Normative Sample


The foremost requirement of such samples is that they
should be representative of the kinds of individuals for whom
the tests are intended.
Synonymous with the standardization sample, but can refer
to any group from which norms are gathered
The standardization sample is the group of individuals on
whom the test is originally standardized in terms of
administration and scoring procedures, as well as in
developing the tests norms.
Data for this group are usually presented in the manual that
accompanies a test upon publication.

Reference group, in contrast, is a term that is used more


loosely to identify any group of people against which test
scores are compared
Subgroup norms. When large samples are gathered to represent
broadly defined populations, norms can be reported in the
aggregate or can be separated into sub-group norms
Local norms. There are some situations in which test users
may wish to evaluate scores on the basis of reference groups drawn
from a specific geographic or institutional setting.
Convenience norms. For reasons of expediency or financial
constraints, test developers use norms based on a group of people
who simply happen to be available at the time the test is being
constructed. These convenience norms are of limited use because
they are not representative of any defined population
SCORES USED FOR EXPRESSING WITHIN-
GROUP NORMS
Percentiles
are the most direct and ubiquitous method used to convey
norm-referenced test results
readily understood by test takers and applicable to most
sorts of tests and test populations.
indicates the relative position of an individual test taker
compared to a reference group, such as the standardization
sample; specifically, it represents the percentage of persons
in the reference group who scored at or below a given raw
score.
Percentiles are scores that reflect the rank or position of an
individuals performance on a test in comparison to a
reference group; their frame of reference is other people.

Percentage scores reflect the number of correct responses


that an individual obtains out of the total possible number of
correct responses on a test; their frame of reference is the
content of the entire test.
test ceiling , or maximum difficulty level of the test, is
insufficient: one cannot know how much higher the test taker
might have scored if there were additional items or items of
greater difficulty in the test.

if a person fails all the items presented in a test or scores


lower than any of the people in the normative sample, the
problem is one of insufficient test floor.
Standard Scores
One way to surmount the problem of the inequality of
percentile units and still convey the meaning of test scores
relative to a normative or reference group is to transform raw
scores into scales that express the position of scores, relative
to the mean, in standard deviation units.
accomplished by means of simple linear transformations.

A linear transformation changes the units in which scores are


expressed while leaving the interrelationships among them
unaltered.
The first linear transformation performed on raw scores is to
convert them into standard-score deviates, or z scores.
z score expresses the distance between a raw score and the
mean of the reference group in terms of the standard
deviation of the reference group.
T-scores (M = 50, SD = 10), used in many personality
inventories, such as the (MMPI) and the California
Psychological Inventory (CPI).
College Entrance Examination Board (CEEB) scores (M = 500,
SD = 100), used by the College Boards SAT as well as by the
Educational Testing Service for many of their graduate and
professional school admission testing programs, such as the
Graduate Record Exam (GRE).
Wechsler scale subtest scores (M = 10, SD = 3), used for all
the subtests of the Wechsler scales, as well as for the subtests
of several other instruments.
Wechsler scale deviation IQs (M = 100, SD = 15), used for the
summary scores of all the Wechsler scales and other tests,
including many that do not label their scores as IQs.
Otis-Lennon School Ability Indices (M = 100, SD = 16), used in
the Otis- Lennon School Ability Test (OLSAT), which is the
current title of the series of group tests that started with the
Otis Group Intelligence Scale.
Nonlinear Transformations
Nonlinear transformations are those that convert a raw score
distribution into a distribution that has a different shape than
the original.
Equating Procedures
Alternate forms consist of two or more versions of a test
meant to be used interchangeably, intended for the same
purpose, and administered in identical fashion
A stricter form of comparability can be achieved with the
type of alternate versions of tests known as parallel forms
Practice effects (i.e., score increases attributable to prior
exposure to test items or to items similar to those in a test
Anchor tests consist of common sets of items administered
to different groups of examinees in the context of two or
more tests and provide a different solution to the problem of
test score comparability.

Fixed reference groups provide a way of achieving some


comparability and continuity of test scores over time.

Simultaneous norming of two or more tests on the same


standardization sample, often referred to as co-norming , is
yet another method used to achieve comparability of scores
ITEM RESPONSE THEORY (IRT)

These procedures, which date back to the 1960s and are also
known as latent trait models, are most often grouped under
the label of item response theory (IRT ).
The term latent trait reflects the fact that these models seek
to estimate the levels of various unobservableabilities,
traits, or psychological constructs that underlie the
observable behavior of individuals, as demonstrated by their
responses to test items.
Computerized Adaptive Testing
One of the main advantages of IRT methodology is that it is
ideally suited for use in computerized adaptive testing (CAT).
Longitudinal Changes in Test Norms
When a test is revised and standardized on a new sample
after a period of several years, even if revisions in its content
are minor, score norms tend to drift in one direction or
another due to changes in the population at different time
periods.
A puzzling longitudinal trend in the opposite direction,
known as the Flynn effect,
CRITERION-REFERENCED TEST INTERPRETATION
In the realm of educational and occupational assessment,
tests are often used to help ascertain whether a person has
reached a certain level of competence in a field of knowledge
or skill in performing a task.
VARIETIES OF CRITERION-REFERENCED
TEST INTERPRETATION
The term criterion-referenced testing , popularized by Glaser
(1963), is sometimes used as synonymous with domain-
referenced, content-referenced, objective-referenced, or
competency testing.
(a) those that are based on the amount of knowledge of a
content domain as demonstrated in standardized objective
tests,
( b) those that are based on the level of competence in a skill
area as displayed in the quality of the performance itself or of
the product that results from exercising the skill.
the term criterion-referenced testing is also used to refer to
interpretations based on the pre-established relationship
between the scores on a test and expected levels of
performance on a criterion, such as a future endeavor or even
another test.
In this particular usage, the criterion is a specific outcome
and may or may not be related to the tasks sampled by the
test.
NORM- VERSUS CRITERION-REFERENCED
TEST INTERPRETATION
Norm-referenced tests seek to locate the performance of one
or more individuals, with regard to the construct the tests
assess, on a continuum created by the performance of a
reference group.
Criterion-referenced tests seek to evaluate the performance
of individuals in relation to standards related to the construct
itself.
Whereas in norm-referenced test interpretation the frame of
reference is always people, in criterion-referenced test
interpretation the frame of reference may be
knowledge of a content domain as demonstrated in
standardized, objective tests; or
level of competence displayed in the quality of a
performance or of a product.
The term criterion-referenced testing is sometimes also
applied to describe test interpretations that use the
relationship between the scores and expected levels of
performance or standing on a criterion as a frame of reference.
When knowledge domains are the frame of reference for test
interpretation, the question to be answered is How much of
the specified domain has the test taker mastered? and
scores are most often presented in the form of percentages
of correct answers. This sort of criterion-referenced test
interpretation is often described as content- or domain-
referenced testing
Planning for such tests requires the development of a table
of specifications with cells that specify the number of items
or tasks to be included in the test for each of the learning
objectives and content areas the test is designed to evaluate.
The usual methods for evaluating qualitative criteria involve
rating scales or scoring rubrics (i.e., scoring guides) that
describe and illustrate the rules and principles to be applied
in scoring the quality of a performance or product.
Mastery testing. Procedures that evaluate test performance
on the basis of whether the individual test taker does or does
not demonstrate a pre-established level of mastery are
known as mastery tests.
Predicting Performance
Sometimes the term criterion-referenced test interpretation
is used to describe the application of empirical data
concerning the link between test scores and levels of
performance, to a criterion such as job performance or
success in a program of study.
Expectancy tables show the distribution of test scores for
one or more groups of individuals, cross-tabulated against
their criterion performance.
Expectancy charts are used when criterion performance in a
job, training program, or program of study can be classified
as either successful or unsuccessful
In norm-referenced testing, the primary objective is to make
distinctions among individuals in terms of the ability or trait
assessed by a test

In criterion-referenced testing, the primary objective is to


evaluate a persons degree of competence or mastery of a
skill or knowledge domain in terms of a pre-established
standard of performance.

You might also like