Chapter 3: Understanding Test Quality-Concepts of Reliability and Validity
Chapter 3: Understanding Test Quality-Concepts of Reliability and Validity
Chapter 3: Understanding Test Quality-Concepts of Reliability and Validity
Chapter Highlights
1. What makes a good test?
2. Test reliability
3. Interpretation of reliability information from test manuals and reviews
4. Types of reliability estimates
5. Standard error of measurement
6. Test validity
7. Methods for conducting validation studies
8. Using validity evidence from outside studies
9. How to interpret validity information from test manuals and independent reviews.
The test measures what it claims to measure consistently or reliably. This means
that if a person were to take the test again, the person would get a similartest score.
The test measures what it claims to measure. For example, a test of mental ability
does in fact measure mental ability, and not some other characteristic.
The test is job-relevant. In other words, the test measures one or more
characteristics that are important to the job.
By using the test, more effective employment decisions can be made about
individuals. For example, an arithmetic test may help you to select qualified workers
for a job that requires knowledge of arithmetic operations.
The degree to which a test has these qualities is indicated by two technical
properties: reliability and validity.
Test reliability
Reliability refers to how dependably or consistently a test measures a characteristic. If a
person takes the test again, will he or she get a similar test score, or a much different
score? A test that yields similar scores for a person who repeats the test is said to measure
a characteristic reliably.
How do we account for an individual who does not get exactly the same test score every
time he or she takes the test? Some possible reasons are the following:
The discussion in Table 2 should help you develop some familiarity with the different kinds
of reliability estimates reported in test manuals and reviews.
Some constructs are more stable than others. For example, an individual's reading
ability is more stable over a particular period of time than that individual's anxiety
level. Therefore, you would expect a higher test-retest reliability coefficient on a
reading test than you would on a test that measures anxiety. For constructs that are
expected to vary over time, an acceptable test-retest reliability coefficient may be
lower than is suggested in Table 1.
A high parallel form reliability coefficient indicates that the different forms of the test
are very similar which means that it makes virtually no difference which version of
the test a person takes. On the other hand, a low parallel form reliability coefficient
suggests that the different forms are probably notcomparable; they may be
measuring different things and therefore cannot be used interchangeably.
Inter-rater reliability indicates how consistent test scores are likely to be if the
test is scored by two or more raters.
On some tests, raters evaluate responses to questions and determine the score.
Differences in judgments among raters are likely to produce variations in test scores.
A high inter-rater reliability coefficient indicates that the judgment process is stable
and the resulting scores are reliable.
Inter-rater reliability coefficients are typically lower than other types of reliability
estimates. However, it is possible to obtain higher levels of inter-rater reliabilities if
raters are appropriately trained.
A high internal consistency reliability coefficient for a test indicates that the items on
the test are very similar to each other in content (homogeneous). It is important to
note that the length of a test can affect internal consistency reliability. For example,
a very lengthy test can spuriously inflate the reliability coefficient.
Tests that measure multiple characteristics are usually divided into distinct
components. Manuals for such tests typically report a separate internal consistency
reliability coefficient for each component in addition to one for the whole test.
Test manuals and reviews report several kinds of internal consistency reliability
estimates. Each type of estimate is appropriate under certain circumstances. The test
manual should explain why a particular estimate is reported.
The SEM is a useful measure of the accuracy of individual test scores. The smaller the SEM,
the more accurate the measurements.
Types of reliability used. The manual should indicate why a certain type of
reliability coefficient was reported. The manual should also discuss sources of
random measurement error that are relevant for the test.
How reliability studies were conducted. The manual should indicate the
conditions under which the data were obtained, such as the length of time that
passed between administrations of a test in a test-retest reliability study. In general,
reliabilities tend to drop as the time between test administrations increases.
The characteristics of the sample group. The manual should indicate the
important characteristics of the group used in gathering reliability information, such
as education level, occupation, etc. This will allow you to compare the characteristics
of the people you want to test with the sample group. If they are sufficiently similar,
then the reported reliability estimates will probably hold true for your population as
well.
For more information on reliability, consult the APA Standards, the SIOP Principles, or any
major textbook on psychometrics or employment testing. Appendix A lists some possible
sources.
Test validity
Validity is the most important issue in selecting a test. Validity refers to what
characteristic the test measures and how well the test measures that characteristic.
Validity tells you if the characteristic being measured by a test is related to job
qualifications and requirements.
Validity gives meaning to the test scores. Validity evidence indicates that there is
linkage between test performance and job performance. It can tell you what you
may conclude or predict about someone from his or her score on the test. If a test
has been demonstrated to be a valid predictor of performance on a specific job, you
can conclude that persons scoring high on the test are more likely to perform well on
the job than persons who score low on the test, all else being equal.
Validity also describes the degree to which you can make specific conclusions or
predictions about people based on their test scores. In other words, it indicates the
usefulness of the test.
Principle of Assessment: Use only assessment procedures and instruments that have
been demonstrated to be valid for the specific purpose for which they are being used.
It is important to understand the differences between reliability and validity. Validity will tell
you how good a test is for a particular situation; reliability will tell you how trustworthy a
score on that test will be. You cannot draw valid conclusions from a test score unless you
are sure that the test is reliable. Even when a test is reliable, it may not be valid. You
should be careful that any test you select is both reliable and valid for your situation.
A test's validity is established in reference to a specific purpose; the test may not be valid
for different purposes. For example, the test you use to make valid predictions about
someone's technical proficiency on the job may not be valid for predicting his or her
leadership skills or absenteeism rate. This leads to the next principle of assessment.
Similarly, a test's validity is established in reference to specific groups. These groups are
called the reference groups. The test may not be valid for different groups. For example, a
test designed to predict the performance of managers in situations requiring problem
solving may not allow you to make valid or meaningful predictions about the performance of
clerical employees. If, for example, the kind of problem-solving ability required for the two
positions is different, or the reading level of the test is not suitable for clerical applicants,
the test results may be valid for managers, but not for clerical employees.
Test developers have the responsibility of describing the reference groups used to develop
the test. The manual should describe the groups for whom the test is valid, and the
interpretation of scores for individuals belonging to each of these groups. You must
determine if the test can be used appropriately with the particular type of people you want
to test. This group of people is called your target population or target group.
Principle of Assessment: Use assessment tools that are appropriate for the target
population.
Your target group and the reference group do not have to match on all factors; they must
be sufficiently similar so that the test will yield meaningful scores for your group. For
example, a writing ability test developed for use with college seniors may be appropriate for
measuring the writing ability of white-collar professionals or managers, even though these
groups do not have identical characteristics. In determining the appropriateness of a test for
your target groups, consider factors such as occupation, reading level, cultural differences,
and language barriers.
In order to be certain an employment test is useful and valid, evidence must be collected
relating the test to a job. The process of establishing the job relatedness of a test is
called validation.
Second, the content validation method may be used when you want to determine if there is
a relationship between behaviors measured by a test and behaviors involved in the job. For
example, a typing test would be high validation support for a secretarial position, assuming
much typing is required each day. If, however, the job required only minimal typing, then
the same test would have little content validity. Content validity does not apply to tests
measuring learning ability or general problem-solving skills (French, 1990).
Finally, the third method is construct validity. This method often pertains to tests that may
measure abstract traits of an applicant. For example, construct validity may be used when a
bank desires to test its applicants for "numerical aptitude." In this case, an aptitude is not
an observable behavior, but a concept created to explain possible future behaviors. To
demonstrate that the test possesses construct validation support, ". . . the bank would need
to show (1) that the test did indeed measure the desired trait and (2) that this trait
corresponded to success on the job" (French, 1990, p. 260).
Professionally developed tests should come with reports on validity evidence, including
detailed explanations of how validation studies were conducted. If you develop your own
tests or procedures, you will need to conduct your own validation studies. As the test user,
you have the ultimate responsibility for making sure that validity evidence exists for the
conclusions you reach using the tests. This applies to all tests and procedures you use,
whether they have been bought off-the-shelf, developed externally, or developed in-house.
Validity evidence is especially critical for tests that have adverse impact. When a test has
adverse impact, the Uniform Guidelines require that validity evidence for that specific
employment decision be provided.
The particular job for which a test is selected should be very similar to the job for which the
test was originally developed. Determining the degree of similarity will require a job
analysis. Job analysis is a systematic process used to identify the tasks, duties,
responsibilities and working conditions associated with a job and the knowledge, skills,
abilities, and other characteristics required to perform that job.
Job analysis information may be gathered by direct observation of people currently in the
job, interviews with experienced supervisors and job incumbents, questionnaires, personnel
and equipment records, and work manuals. In order to meet the requirements of
the Uniform Guidelines, it is advisable that the job analysis be conducted by a qualified
professional, for example, an industrial and organizational psychologist or other professional
well trained in job analysis techniques. Job analysis information is central in deciding what
to test for and which tests to use.
Validity evidence. The validation procedures used in the studies must be consistent
with accepted standards.
Job similarity. A job analysis should be performed to verify that your job and the
original job are substantially similar in terms of ability requirements and work
behavior.
Fairness evidence. Reports of test fairness from outside studies must be
considered for each protected group that is part of your labor market. Where this
information is not available for an otherwise qualified test, an internal study of test
fairness should be conducted, if feasible.
Other significant variables. These include the type of performance measures and
standards used, the essential work activities performed, the similarity of your target
group to the reference samples, as well as all other situational factors that might
affect the applicability of the outside test for your use.
To ensure that the outside test you purchase or obtain meets professional and legal
standards, you should consult with testing professionals. See Chapter 5 for information on
locating consultants.
In evaluating validity information, it is important to determine whether the test can be used
in the specific way you intended, and whether your target group is similar to the test
reference group.
Available validation evidence supporting use of the test for specific purposes. The
manual should include a thorough description of the procedures used in the
validation studies and the results of those studies.
The possible valid uses of the test. The purposes for which the test can legitimately
be used should be described, as well as the performance criteria that can validly be
predicted.
The sample group(s) on which the test was developed. For example, was the test
developed on a sample of high school graduates, managers, or clerical workers?
What was the racial, ethnic, age, and gender mix of the sample?
The group(s) for which the test may be used.
Here are three scenarios illustrating why you should consider these factors, individually and
in combination with one another, when evaluating validity coefficients:
Scenario One
You are in the process of hiring applicants where you have a high selection ratio and are
filling positions that do not require a great deal of skill. In this situation, you might be
willing to accept a selection tool that has validity considered "likely to be useful" or even
"depends on circumstances" because you need to fill the positions, you do not have many
applicants to choose from, and the level of skill required is not that high.
Scenario Two
You are recruiting for jobs that require a high level of accuracy, and a mistake made by a
worker could be dangerous and costly. With these additional factors, a slightly lower validity
coefficient would probably not be acceptable to you because hiring an unqualified worker
would be too much of a risk. In this case you would probably want to use a selection tool
that reported validities considered to be "very beneficial" because a hiring error would be
too costly to your company.
Here is another scenario that shows why you need to consider multiple factors when
evaluating the validity of assessment tools.
Scenario Three
A company you are working for is considering using a very costly selection system that
results in fairly high levels of adverse impact. You decide to implement the selection tool
because the assessment tools you found with lower adverse impact had substantially lower
validity, were just as costly, and making mistakes in hiring decisions would be too much of a
risk for your company. Your company decided to implement the assessment given the
difficulty in hiring for the particular positions, the "very beneficial" validity of the assessment
and your failed attempts to find alternative instruments with less adverse impact. However,
your company will continue efforts to find ways of reducing the adverse impact of the
system.
When properly applied, the use of valid and reliable assessment instruments will help you
make better decisions. Additionally, by using a variety of assessment tools as part of an
assessment program, you can more fully assess the skills and capabilities of people, while
reducing the effects of errors associated with any one tool on your decision making.