Introduction To Classical Test Theory With CITAS: Nathan A. Thompson, PH.D
Introduction To Classical Test Theory With CITAS: Nathan A. Thompson, PH.D
Introduction To Classical Test Theory With CITAS: Nathan A. Thompson, PH.D
Contact Information
Assessment Systems Corporation
111 Cheshire Lane, #50
Minnetonka, MN 55305
Voice: 763.476.4764
E-Mail: [email protected]
www.assess.com
This white paper is intended for any individual that is interested in learning how to make tests
and assessments better, by helping you apply international best practices for evaluating the
performance of your assessments. CITAS provides basic analytics necessary for this evaluation,
and it does so without requiring advanced knowledge of psychometrics or of software
programming. However, if you are interested in more advanced capabilities and sophisticated
psychometrics, I recommend that you check out www.assess.com/iteman for Classical Test
Theory and www.assess.com/xcalibre for Item Response Theory.
What are the guidelines? There are several resources, and can differ based on the use of the
test as well as your country. General guidelines are published by APA/AERA/NCME and the
International Test Commission. If you work with professional certifications, look at the National
Commission for Certifying Agencies or the American National Standards Institute. In the US,
there are the Uniform Guidelines for personnel selection.
This paper will begin by defining the concepts and statistics used in classical item and test
analysis, and then present how the CITAS spreadsheet provides the relevant information. CITAS
was designed to provide software for quantitative analysis of testing data that is as
straightforward as possible – no command code, no data formatting, no complex interface.
1. A key error;
2. A very attractive distractor;
3. This item is so easy/hard that there are few examinees on one side of the fence,
making it difficult to correlate anything.
All three things are issues with the item that need to be addressed.
Option Statistics
If you wish to dig even deeper into the performance of an item, the next step is an
evaluation of option statistics. With multiple choice items, the word option refers to the possible
answers available. The correct answer is typically called a key and the incorrect options called
distractors.
Evaluating the option statistics for telltale patterns is an important process in diagnosing
items that have been flagged for poor P or rpbis values at the item level. This is done by
evaluating P and rpbis at the option level. In general, we want two things to happen:
1. The P for the key is greater than the P for any of the distractors. That is, we don’t
want more students choosing one of the distractors than the key. In many (but not all) cases,
this means the distractor is arguably correct or the key is arguably incorrect.
2. The rpbis for the distractors should be negative but the rpbis for the key should be
positive. If an rpbis for a distractor is positive, this means that smart examinees are choosing it,
and we usually want the not-so-smart examinees selecting the incorrect answers. However, this
pattern is very susceptible to fluctuations in small sample sizes; if only 4 examinees select an
option and one or two are of very high ability, that is often enough to produce a positive rpbis and
therefore flag the item.
An even deeper analysis is called quantile plots. This methodology splits the sample into
quantiles based on ability (total score) and then evaluates the option P values for each, plotting
them on a single graph. We want to look for the same pattern of performance, which usually
means that the line for the key has positive slope (positive rpbis) and that the lines for the
distractors have negative slope (negative rpbis). An example of this, from Iteman, is below.
Statistic Value
Examinees: 100
Items: 72
Mean: 40.32
SD: 11.01
Variance: 121.29
Min: 14
Max: 69
KR-20: 0.89
SEM: 3.67
Option P
A 0.03 0.03 0.16 0.07 0.12 0.07 0.02 0.40 0.14 0.03
B 0.01 0.89 0.60 0.81 0.00 0.08 0.76 0.05 0.10 0.90
C 0.94 0.05 0.09 0.06 0.06 0.58 0.13 0.18 0.61 0.02
D 0.02 0.03 0.14 0.06 0.82 0.27 0.08 0.37 0.14 0.05
E 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Option
Rpbis
A 0.02 -0.13 -0.33 -0.30 -0.22 -0.18 0.15 0.49 -0.26 -0.05
B -0.20 0.33 0.49 0.37 ##### -0.19 -0.08 -0.26 -0.24 0.23
C 0.18 -0.24 -0.21 -0.05 -0.25 0.43 -0.02 -0.19 0.45 -0.06
D -0.19 -0.17 -0.17 -0.23 0.34 -0.25 0.08 -0.23 -0.16 -0.24
E ##### ##### ##### ##### ##### ##### ##### ##### ##### #####
Here, we can see the general pattern of the key having a strong (positive) rpbis while the
distractors have negative rpbis. The exception, of course, is Item 7 as discussed earlier. There,
the key (B) has rpbis =-0.08 while both A and D have positive rpbis, albeit with small N. This means
that those two distractors happened to pull some smart students, and should be reviewed.
In some cases, you might see a straying from the desired pattern. In Item 1, we see that
A has a positive but small rpbis (0.02). There are only 3 examinees that selected A, so this is a case
of the aforementioned situation where it takes only one smart examinee to select a distractor
and it might be flagged as a positive rpbis. This item is likely just fine.
Summary
Item analysis is a vital step in the test development cycle, as all tests are composed of
items and good items are necessary for a good test. Classical test theory provides some
methods for evaluating items based on simple statistics like proportions, correlations, and
averages. However, this does not mean item evaluation is easy. I’ve presented some guidelines
and examples, but it really comes down to going through the statistical output and a copy of the
test with an eye for detail. While psychometricians and software can always give you the output
with some explanation, it is only the item writer, instructor, or other content expert that can
adequately evaluate the items because it requires a deep understanding of test content.
Although CITAS is quite efficient for classical analysis of small-scale assessments and
teaching of classical psychometric methods, it is not designed large-scale use. That role is filled
by two other programs, FastTest and Iteman 4. Iteman 4 is designed to produce a
comprehensive classical analysis, but in the form of a formal MS Word report ready for
immediate delivery to content experts; please visit www.assess.com/iteman to learn more.
FastTest is ASC’s comprehensive ecosystem for test development, delivery, and analytics. It can
produce Iteman reports directly from the system if you utilize it to deliver your tests.
Further reading
Downing, S.M., & Haladyna, T.M. (Eds.) (2006). Handbook of test development. Philadelphia:
Taylor & Francis.
Furr, R.M., & Bacharach, V.R. (2007). Psychometrics: An introduction. Thousand Oaks, CA: Sage.
Shultz, K.S., & Whiney, D.J. (2005). Measurement theory in action. Thousand Oaks, CA: Sage.