The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice
The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice
The Use of Value-Added Measures of Teacher Effectiveness in Policy and Practice
E D U C A T I O N C H A L L E N G E S F A C I N G N E W Y O R K C I T Y
Sean P. Corcoran
in collaboration with
Annenberg Institute research staff
EDUCATION POLICY FOR ACTION SERIES
E D U C A T I O N C H A L L E N G E S F A C I N G N E W Y O R K C I T Y
Sean P. Corcoran
in collaboration with
Annenberg Institute research staff
About the Annenberg Institute for School Reform
The Annenberg Institute for School Reform is a national
policy-research and reform-support organization, affiliated
with Brown University, that focuses on improving condi-
tions and outcomes for all students in urban public
schools, especially those serving disadvantaged children.
The Institute’s vision is the transformation of traditional
school systems into “smart education systems” that develop
and integrate high-quality learning opportunities in all
areas of students’ lives – at school, at home, and in the
community.
The Institute conducts research; works with a variety of
partners committed to educational improvement to build
capacity in school districts and communities; and shares
its work through print and Web publications. Rather than
providing a specific reform design or model to be imple-
mented, the Institute’s approach is to offer an array of
tools and strategies to help districts and communities
strengthen their local capacity to provide and sustain
high-quality education for all students.
Introduction ............................................................................................................................ 1
1 Value-Added Measurement: Motivation and Context ................................................ 1
2 What Is a Teacher’s Value-Added? .................................................................................. 4
3 Value-Added in Practice: New York City and Houston .............................................. 6
The New York City Teacher Data Initiative 6
The Houston ASPIRE Program 12
5 Discussion ............................................................................................................................ 28
References ........................................................................................................................... 29
Appendix A: Race to the Top Definitions of Teacher Effectiveness
and Student Achievement ...................................................................... 34
Appendix B: Sample New York City Teacher Data Report, 2010 ........................ 35
Appendix C: Sample New York City Teacher Data Report, 2009 ....................... 36
FIGURES
Figure 1 Factors affecting average achievement in two classrooms: hypothetical decomposition ............... 4
Figure 5 Teacher value-added on two reading tests: Houston fourth- and fifth-grade teachers ............... 17
Figure 6 Percent of students with a test score and percent contributing to value-added estimates,
grades four to six, Houston, 1998–2006 ............................................................................................. 20
Figure 7 Average confidence interval width, New York City Teacher Data Reports, 2008-2009 ............ 23
Figure 9 Year-to-year stability in value-added rankings: HISD reading test, 2000–2006 .......................... 26
Figure 10 Year-to-year stability in ELA and math value-added rankings: New York City Teacher Data
Reports, 2007-2008 .................................................................................................................................. 27
iv Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
About the Author His recent publications can be found in the
Journal of Policy Analysis and Management, the
Sean P. Corcoran is an assistant professor of Journal of Urban Economics, Education Finance
educational economics at New York Univer- and Policy, and the American Economic Review.
sity’s Steinhardt School of Culture, Education,
and Human Development, an affiliated faculty
of the Robert F. Wagner Graduate School of
Public Service, and a research fellow at the
Institute for Education and Social Policy
(IESP). He has been a research associate of the
Economic Policy Institute in Washington,
D.C., since 2004 and was selected as a resident
visiting scholar at the Russell Sage Foundation
in 2005-2006. In addition to being a member
of the board of directors of the Association for
Education Finance and Policy (formerly the
American Education Finance Association), he
serves on the editorial board of the journal
Education Finance and Policy.
Corcoran’s research focuses on three areas:
human capital in the teaching profession, edu-
cation finance, and school choice. His recent
papers have examined long-run trends in the
quality of teachers, the impact of income
inequality and court-ordered school finance
reform on the level and equity of education
funding in the United States, and the political
economy of school choice reforms. In 2009,
he led the first evaluation of the Aspiring Prin-
cipals Program in New York City, and he is
currently working on a retrospective assessment
of the Bloomberg-Klein reforms to school
choice and competition in New York City for
the American Institutes for Research. He co-
edits a book series on alternative teacher com-
pensation systems for the Economic Policy
Institute, and in recent years he has been inter-
ested in value-added measures of evaluating
teacher effectiveness, both their statistical
properties and their obstacles to practical
implementation.
vi Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Introduction
Value-added measures of teacher effectiveness
1 Value-Added Measurement: Motivation
and Context
are the centerpiece of a national movement to Traditional measures of teacher quality have
evaluate, promote, compensate, and dismiss always been closely linked with those found in
teachers based in part on their students’ test teacher pay schedules: years of experience, pro-
results. Federal, state, and local policy-makers fessional certification, and degree attainment.
have adopted these methods en masse in recent As recently as the 2001 No Child Left Behind
years in an attempt to objectively quantify Act (NCLB), teacher quality was commonly
teaching effectiveness and promote and retain formalized as a set of minimum qualifications.
teachers with a demonstrated record of success. Under NCLB, “highly qualified” teachers of
Attention to the quality of the teaching force core subjects were defined as those with at least
makes a great deal of sense. No other school a bachelor’s degree, a state license, and demon-
resource is so directly and intensely focused on strated competency in the subject matter
student learning, and research has found that taught (e.g., through a relevant college major
teachers can and do vary widely in their effec- or master’s degree).
tiveness (e.g., Rivkin, Hanushek & Kain 2005; However, these minimum qualifications have
Nye, Konstantopoulos & Hedges 2004; Kane, not been found by researchers to be strongly
Rockoff & Staiger 2008).1 Furthermore, predictive of student outcomes on standardized
teacher quality has been found to vary across tests (e.g., Goldhaber 2008; Hanushek &
schools in a way that systematically disadvan- Rivkin 2006; Kane, Rockoff & Staiger 2008).
tages poor, low-achieving, and racially isolated Knowing that a teacher possesses a teaching
schools (e.g., Clotfelter, Ladd & Vigdor 2005; certificate, a master’s degree, or a relevant col-
Lankford, Loeb & Wyckoff 2002; Boyd et al. lege major often tells us little about that
2008). teacher’s likelihood of success in the classroom.
But questions remain as to whether value-added There are many reasons not to totally dismiss
measures are a valid and appropriate tool for
identifying and enhancing teacher effectiveness.
1
This literature is frequently misinterpreted as stating that teacher
In this report, I aim to provide an accessible quality is more important for student achievement than any other fac-
introduction to these new measures of teaching tor, including family background. Statements such as “Studies show
that teachers are the single most important factor determining stu-
quality and put them into the broader context dents’ success in school” have appeared in dozens of press
releases and publications in recent years. For an example, see
of concerns over school quality and achieve- the May 4, 2010 statement from the U.S. House Committee on
ment gaps. Using New York City’s Teacher Education and Labor at <http://edlabor.house.gov/newsroom/
2010/05/congress-needs-to-support-teac.shtml>. I know of no
Data Initiative and Houston’s ASPIRE (Acceler- study that demonstrates this.
2 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
squarely on test-score growth. One of the pro- evaluation used in many public schools: infre-
gram’s major selection criteria, “Great Teachers quent classroom observations and a pro forma
and Leaders,” contributes at least 70 of the 500 tenure (Toch & Rothman 2008; Weisburg et
possible application points to the linking of al. 2009). But whether or not the shift to
teacher evaluation and student test perform- intensive use of value-added measures of effec-
ance. For example, in their applications, states tiveness will improve our nation’s system of
will be judged by the extent to which they or teaching and learning remains to be seen.
their districts (U.S. Department of Education Indeed, there are good reasons to believe these
2010): measures may be counterproductive.
• measure individual student growth;
• implement evaluation systems that use stu-
dent growth as a significant factor in evaluat-
ing teachers and principals;
• include student growth in annual evaluations;
• use these evaluations to inform professional
support, compensation, promotion, reten-
tion, tenure, and dismissal;
• link student growth to in-state teacher prepa-
ration and credentialing programs, for public
reporting purposes and the expansion of
effective programs;
• incorporate data on student growth into
professional development, coaching, and
planning.
Race to the Top and the “new view” of teacher
effectiveness have stimulated a largely produc-
tive and long-overdue discussion between
policy-makers, researchers, and the public over
how to assess teacher quality and address,
develop, and support under-performing teach-
ers. And it is fair to say that there is little
enthusiasm for the traditional model of teacher
4
0 There are many excellent and readable introductions to value-
Mrs. Appleton Mr. Johnson added methods. In writing this section, I benefited greatly from
Braun (2005), Buddin et al. (2007), Koretz (2008), Rivkin (2007),
Harris (2009), and Hill (2009).
5
This statement assumes something about the scale on which
achievement is measured. I return to this point in sections 4 and 5.
4 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
that these teachers’ students began the year at achievement gains across teachers. In reality,
very different levels. The idea is illustrated in students are not randomly assigned to classes –
Figure 2: Mrs. Appleton’s students started out in many cases, quite purposefully so. Conse-
at a much different point than Mr. Johnson’s, quently, value-added methods use a statistical
but most of the factors determining these ini- model to answer the question: “How would
tial differences in achievement “net out” in the these students have fared if they had not had
average gain score. The remainder represents [Mrs. Appleton or Mr. Johnson] as a teacher?”
the impact of the fourth-grade teacher, as well
This is a difficult question that is taken up in
as other influences that may have produced
the next section. For now, it is useful to think
differential growth between the two tests.
of a teacher’s value-added as her students’ aver-
Figure 2 shows that Mrs. Appleton’s students age test-score gain, “properly adjusted” for
had an average gain of ten points, while Mr. other influences on achievement. The New
Johnson’s gained an average of four. Can we York City Teacher Data Initiative (TDI) and
now declare Mrs. Appleton the more effective the Houston ASPIRE programs are two promi-
math teacher? Do these gain scores represent nent value-added systems that have their own
these teachers’ value-added? Not necessarily. methods for “properly adjusting” student test
While we may have removed the effects of scores. These two programs and their methods
fixed differences between student populations, are described in the next section.
we need to be confident that we have
accounted for other factors that contributed to Figure 2
changes in test performance from third to Factors affecting year-to-year test score gains in two classrooms: hypothetical
decomposition
fourth grade. These factors are potentially
numerous: family events, school-level interven- 80
tions, the influence of past teachers on knowl- Teacher & other
70
edge of this year’s tested material, or a disrup- Community
Student scores, averages and points gained
6 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Despite the official ban on using test scores for and teachers in an accessible and usable form.
tenure decisions, the NYCDOE pressed for- The reports are to be viewed as “one lens” on
ward with the TDI, with conditional support teacher quality that should be “triangulated”
from Weingarten and the UFT. In accordance with other information about classroom effec-
with the law, TDI information was explicitly tiveness to improve performance. The reports
not to be used for rewarding or dismissing are seen as an “evolving tool” that will con-
teachers. As stated in a joint Klein/Weingarten tinue to be refined over time based on princi-
letter to teachers in 2008, the TDI’s value- pal and teacher feedback. Second, it is hoped
added reports were to be used solely for profes- the reports will “stimulate conversation” about
sional development, to “help [teachers] pin- student achievement within schools and pro-
point [their] own strengths and weaknesses, mote better instructional practices through
and . . . devise strategies to improve” (Klein & professional development. Finally, the measures
Weingarten 2008). will help the district learn more about “what
works” in the classroom. Value-added measures
The TDI released its first complete set of
have already enabled a wide range of studies on
Teacher Data Reports to more than 12,000
teacher effectiveness in New York City (e.g.,
teachers in 2009. These reports consisted of
Boyd et al. 2009; Kane, Rockoff & Staiger
separate analyses of English language arts
2008; Rockoff 2008), and the introduction of
(ELA) and mathematics test results and were
these measures into schools will enable addi-
generated for teachers who had taught these
tional research. In 2009, the NYCDOE and
subjects in grades four to eight in the prior
UFT signed on to participate in a historic,
year. (A more detailed description of the report
large-scale Gates Foundation study – the Meas-
itself is provided later in this section.) A second
ures of Effective Teaching, or “MET” project –
year of reports was released in 2010, reflecting
that intends to benchmark value-added meas-
significant revisions made by the NYCDOE
ures against alternative measures of teaching
and its new contractor, the Wisconsin Value-
effectiveness and identify practices that are
Added Research Center.7
associated with high value-added (Medina
The NYCDOE has expressed several broad 2009).9
goals for the TDI program.8 First, its data
New York’s law banning the use of test scores
reports are intended to provide measures of
for teacher evaluation would later complicate
value-added that can be reported to principals
the state’s application for Race to the Top
funding. Guidelines for the federal grant pro-
6
In another article appearing that month, then–Deputy Chancellor gram explicitly penalized states with such laws,
Chris Cerf stated that he was “unapologetic that test scores must
be a central component of evaluation” (Keller 2008). and Mayor Bloomberg and members of the
7
See <http://varc.wceruw.org/>. state legislature pushed for a reversal. Speaking
8
This description is compiled from a phone interview with Amy in Washington in November 2009, Mayor
McIntosh, Joanna Cannon, and Ann Forte of the NYCDOE
(January 14, 2010), an October 2008 presentation by Deputy
Chancellor Chris Cerf (“NYC Teacher Data Initiative”), and a
September 2008 training presentation by Martha Madeira (“NYC
Value-Added Data for Teachers Initiative”).
9
For information on the MET project, see <http://metproject.
org/project>.
8 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
experience (in Mr. Jones’s case, ten years). That would appear to place him squarely in the
is, results are not reported in units of “achieve- “average” performance category.
ment,” but rather as percentile rankings.12
Another element of the Teacher Data Report
Thus, value-added, in practice, is a relative worth noting is the reported range of per-
concept. Teachers are, in effect, graded on a centiles associated with Mr. Jones’s value-added
curve – a feature that is not always obvious to ranking (the black line extending in two direc-
most observers. A district with uniformly tions from his score). In statistical terminology,
declining test scores will still have “high” and this range is referred to as a “confidence inter-
“low” value-added teachers; a district’s logical val.” It represents the level of uncertainty asso-
aspiration to have exclusively “high value- ciated with the value-added percentile measure.
added” teachers is a technical impossibility. As the report’s instructions describe these
The value-added percentile simply indicates ranges: “We can be 95 percent certain that this
where a teacher fell in the distribution of teacher’s result is somewhere on this line, most
(adjusted) student test-score gains. likely towards the center.” These ranges – or
confidence intervals – are discussed more in
Value-added is reported for both last year’s test
Section 4. For now, note that Mr. Jones’s range
results (in this case, 2008-2009) and on all
for the prior year’s test extends from (roughly)
prior year’s test results for that teacher (in this
the 15th percentile to the 71st. Based on his
example, the last four years). Mr. Jones’s value-
last four years, his range extends from the
added places him in the 43rd percentile among
32nd percentile to the 80th. His value-added
eighth-grade math teachers last year; that is, 43
percentiles – 43 and 56 – fall in the middle of
percent of teachers had lower value-added than
these ranges.
he did (and 57 percent had higher value-
added). His value-added based on the last four On the second page of the data report (see
years of results places him in the 56th per- <http:/schools.nyc.gov>), value-added measures
centile. The percentiles are then mapped to and percentiles are reported for several sub-
one of five performance categories: “high” groups of students: initially high-, middle-, and
(above the 95th percentile), “above average” low-achieving students (based on their prior
(75th to 95th), “average” (25th to 75th), year’s math achievement); boys and girls; Eng-
“below average” (5th to 25th), and “low” lish language learners; and special education
(below 5th). Mr. Jones’s percentile rankings students. Mr. Jones performed at the “above
average” level with his initially high-achieving
10
Press release PR-510-09, Office of the Mayor, November 25, students, but fell into the “average” category
2009.
11
for all other subgroups. Ranges, or confidence
On Colorado and Tennessee, see “Colorado Approves Teacher
Tenure Law,” Education Week, May 21, 2010, and “Tennessee intervals, are also reported for each of these
Lawmakers Approve Teacher Evaluation Plan,” Memphis Commer-
cial Appeal, January 15, 2010.
subgroups.
12
I address the reported “proficiency” scores (e.g., 3.27, 3.29) later How are these value-added percentiles calcu-
in the report.
13
A useful and concise explanation is provided at the top of the
lated, exactly? 13 Recall that a teacher’s value-
Teacher Data Report itself. We benefited from a more technical added can be thought of as her students’ aver-
explanation of the 2009 methodology in an internal 2009 techni-
cal report by the Battelle Memorial Institute. The model is similar to
that estimated by Gordon, Kane, and Staiger (2006) and Kane,
Rockoff, and Staiger (2008).
10 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
for ten years, we could also average his stu- As a compromise, most value-added methods
dents’ value-added measures over all of those rescale test scores to have a mean of zero and a
years. standard deviation (SD) of one.15 This new
scale tells us where students are in the distribu-
A few key features of this approach are worth
tion of test scores in each grade. For example,
highlighting. First, students’ predicted scores
Melissa may have moved from a -0.25 to a
are based on how other students with similar
-0.20 on this scale, from 0.25 SDs below the
characteristics and past achievement performed
average third-grader to 0.20 below the average
– who were taught by other teachers in the dis-
fourth-grader, a gain of 0.05 SD. Under cer-
trict. Thus, value-added is inherently relative: it
tain assumptions, this scale is appropriate, but
tells us how teachers measure up when com-
it is important to keep in mind that students,
pared with other teachers in the district or state
like teachers, are being measured on a curve –
who are teaching similar students. Second, test
that is, relative to other tested students in the
scores are rarely of the vertical scale type sug-
same grade.
gested by the above example. That is, we can
rarely say that a student like Melissa moved Rather than reporting results on the admittedly
from a 35 to a 42 on the “math” scale as she less-than-transparent SD scale, the NYCDOE
progressed from third to fourth grade. converts value-added results to a scale with
which teachers are familiar: the state’s “perfor-
14
mance levels” (1 to 4). On Mark Jones’s report,
This concept is also known as the year-specific “teacher effect” or
“classroom effect” for that teacher and year. we see that his twenty-seven students were pre-
15
A standard deviation is a measure of variation in a distribution. dicted to perform at an average of 3.29 in
Loosely, it can be thought of as the “average difference from the
mean.” For example, the average score on a test might be a 70, math, somewhere between proficient (3) and
with a standard deviation of 8. We can think of this as saying that,
advanced (4).16 In practice, his class averaged
on average, students scored 8 points above or below the average
of 70. (This is not technically correct, but it is a useful way of think- 3.26 last year, for a value-added on this scale of
ing about the standard deviation.) The SD depends on the scale of
the original measure. Thus we often put measures on a common -0.03. Over Mr. Jones’s past four years, his
(“standardized”) scale with a mean of zero and SD of one. In this
value-added was +0.03. All of Mr. Jones’s sub-
example of a test with a mean of 70 and SD of 8, a student who
scored an 82 would receive a score of 1.5 on the alternate scale group results are presented in the same way,
(1.5 standard deviations above the mean).
16
with predicted and actual performance levels,
New York City’s “performance level” scale itself is somewhat puz-
zling. The performance levels of 1, 2, 3, and 4 are based on cut value-added, and a percentile based on this
scores determined by the state at certain points in the test score dis-
tribution. They are ordinal categories that represent increasing lev-
value-added. Of course, the number of stu-
els of tested skill (below basic, basic, proficient, and advanced). dents used to estimate each subgroup value-
They are not an interval measure, where the difference in skill
between, say, a 2 and a 3 is equivalent to that between a 3 and added measure is smaller; for example, Mr.
a 4, nor were they intended to be used in such a way. The NYC-
DOE further converts its scores to “fractional units” on this scale. For
Jones taught eighteen initially high-achieving
example, a student that receives a raw score that places them students and twenty-seven who initially scored
between the cut scores of 2 (“basic”) and a 3 (“proficient”) might
be assigned a performance level of 2.42. Because the proficiency in the middle one-third. He also taught more
categories are ordinal, it isn’t clear what a proficiency level of
2.42 means. It plausibly could be interpreted as being 42 percent
boys (forty-two) than girls (nineteen).
of the way between basic and proficient, but it remains that a
movement of 0.10 between each point on the scale (1 to 4) will
represent different gains in achievement. In practice, this unusual
system does not present a problem for the Teacher Data Reports.
There the value-added measures are calculated using standardized
scores and only converted to performance levels after the fact
(Source: internal Battelle Memorial Institute technical report, 2009).
12 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
subject. The teacher performance measure is a superintendent Terry Grier stated that his
“value-added cumulative gain index” for a intended use for the value-added measures was
given teacher and subject. Finally, Strand III for them to be added to “the list of reasons that
offers a mix of additional bonus opportunities, can be used in teacher dismissal.” He expressed
including a bonus for attendance. a willingness to create “a screening process for
principals who propose that teachers gain term
Taken together, the three strands of ASPIRE
contract by requiring them to discuss the per-
amount to a maximum bonus that ranges from
formance/effectiveness of all probationary
$6,600 to $10,300 for classroom teachers.
teachers,” adding, “this discussion will include
Teachers of self-contained classes can receive
the review of value-added. . . . If principals
awards in Strand II in as many as five subjects:
want to grant term contracts to teachers with
reading, math, language arts, science, and
regressive value-added scores, . . . they should
social studies. A teacher scoring in the top
be able to provide a compelling reason for
quartile in all five subjects can receive a bonus
doing so” (Mellon 2010b).
as high as $7,000. According to the Houston
Chronicle, almost 90 percent of eligible school Although ASPIRE’s value-added model differs
employees received a bonus for 2008-2009, markedly from that used by the Teacher Data
with classroom teachers earning an average of Reports in New York City, the programs share
$3,606 and a maximum of $10,890 (Mellon the same core objective: differentiating teachers
2010a). based on their contribution to student achieve-
ment and recognizing and rewarding effective
The EVAAS model is considerably more com-
teachers. Houston’s decision to link its value-
plex and is much less transparent than the
added measures explicitly to pay and tenure
model adopted by the NYCDOE.18 The model
decisions were precursors to similar decisions in
combines results on multiple tests – the Texas
New York City in recent months. In the next
Assessment of Knowledge and Skills (TAKS)
section, I provide an overview of the most sig-
and the Stanford 10 Achievement Test (or the
nificant challenges facing value-added measure-
Aprenda, its Spanish language equivalent) and
ment in practice, drawing upon data from
“layers” multiple years of test results to calcu-
HISD and the New York City Department of
late teachers’ cumulative value-added (McCaf-
Education to illustrate these challenges.
frey et al. 2004). Like the New York City sys-
tem, expected scores in each year are estimated
17
for students in each subject and compared with ASPIRE currently receives funding from several sources: the Broad
and Bill & Melinda Gates foundations ($4.5 million), a U.S.
their actual scores. However, unlike the New Department of Education Teacher Incentive Fund (TIF) grant ($11.7
million), and a Texas District Assessment of Teacher Effectiveness
York City model, the predicted scores rely on a grant. The contract firm Battelle for Kids has provided professional
relatively sparse list of student background development for the ASPIRE program since 2007.
18
EVAAS has been sharply criticized for its lack of transparency and
characteristics. inadequate controls for student background. See, for example,
Amrein-Beardsley (2008).
In February 2010, the HISD board of educa-
tion voted to approve the use of value-added
measures in teacher tenure decisions (Sawchuk
2010). In a letter to the school board, HISD
14 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
ized testing: “skills that a computer cannot Teachers or schools?
replicate and that an educated worker in a low- Even in cases where tests do adequately capture
wage country will have a hard time doing” (p. desired skills, it behooves us to ask whether
22). These skills, Blinder argues, include “cre- value-added – a teacher’s individual impact on
ativity, inventiveness, spontaneity, flexibility, students’ academic progress – is, in fact, what
[and] interpersonal relations . . . not rote is educationally relevant. Teachers certainly
memorization” (p. 22). Similarly, in their book vary in effectiveness, and school leaders should
calling for a broader conceptualization of be cognizant of their teachers’ contribution to
school accountability, Rothstein, Jacobsen, and student success. Yet to the extent schooling is a
Wilder (2008) highlight the broad scope of group or team effort involving principals,
skills that students develop in school, including teachers, and other school professionals (e.g.,
“the ability to reason and think critically, an instructional coaches, librarians, counselors),
appreciation of the arts and literature, . . . Herculean efforts to isolate, report, and reward
social skills and a good work ethic, good citi- individual value-added ignores critical, interre-
zenship, and habits leading to good physical lated parts of the educational process. At worst,
and emotional health.” narrow interest in individual results may
This is not to say that value-added measure- undermine this process, a point I return to
ment cannot aid in evaluating certain basic – later.
and even critically important – skills. Rather, This concern is hardly unique to education.
they are simply too narrow to be relied upon as Statistics on narrow metrics of individual pro-
a meaningful representation of the range of ductivity have their place in many organiza-
skills, knowledge, and habits we expect teach- tions, from business and government to profes-
ers and schools to cultivate in their students. sional sports. Yet in most cases business leaders
and athletic coaches recognize that the success
of their organization is much more than the
sum of their individual employee or player sta-
tistics (Rothstein 2009). HISD, and, to a lesser
extent, the NYCDOE, with its small school-
based performance bonus program (Springer &
Winters 2009), have recognized that organiza-
tional outcomes are as important to recognize
as individual successes. As value-added systems
begin to be implemented in school systems
nationwide, policy-makers should be aware of
the potential educational costs of a narrow
focus on individual metrics.
16 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
for a variety of valid reasons. This variation were based on the same students, tested in the
may be due to the average ability level in their same subject, at approximately the same time
classroom, priorities of school leadership, of year, using two different tests.
parental demands, and so on. Given two teach-
We found that a teacher’s value-added can vary
ers of equal effectiveness, the teacher whose
considerably depending on which test is used.
classroom instruction happens to be most
This is illustrated in Figure 5, which shows
closely aligned with the test – for whatever rea-
how teachers ranked on the two reading tests.
son – will outperform the other in terms of
Teachers are grouped into five performance
value-added.
categories on each test (1 to 5), with the five
Evidence that the choice of test can make a TAKS categories on the horizontal axis.22 We
difference to value-added comes from recent see that teachers who had high value-added on
research comparing value-added measures on one test tended to have high value-added on
multiple tests of the same content area. Since the other, but there were many inconsistencies.
1998, Houston has administered two standard- For example, among those who ranked in the
ized tests every year: the state TAKS and the top category (5) on the TAKS reading test,
nationally normed Stanford Achievement Test. more than 17 percent ranked among the low-
Using HISD data, we calculated separate est two categories on the Stanford test. Simi-
value-added measures for fourth- and fifth- larly, more than 15 percent of the lowest value-
grade teachers for the two tests (Corcoran, added teachers on the TAKS were in the
Jennings & Beveridge 2010). These measures highest two categories on the Stanford.
Figure 5
Teacher value-added on two reading tests: Houston fourth- and fifth-grade teachers
50
Q1-Stanford
Q2-Stanford
Percentages of Stanford quintiles in TAKS
40
Q3-Stanford
Q4-Stanford
30 Q5-Stanford
20
10
0
Q1 Q2 Q3 Q4 Q5
Quintile of value-added: TAKS Reading
18 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
out, rightly, that teacher effectiveness varies Who counts?
across schools within a district and to focus
Another significant limitation of value-added
only on variation within schools would ignore
systems in practice is that they ignore a very
important variation in teacher quality across
large share of the educational enterprise. Not
schools (e.g., Gordon, Kane & Staiger 2006).
only do a minority of teachers teach tested
The cost of this view, however, is that teacher
subjects, but not all students are tested, and
effects end up confounded with school
not all tested students contribute to value-
influences.
added measures. In other words, from the
Recent research suggests that school-level fac- standpoint of value-added assessment of
tors can and do affect teachers’ value-added. teacher quality, these students do not count.25
Jackson and Bruegmann (2009), for example,
In most states, including New York and Texas,
found in a study of North Carolina teachers
students are tested in reading and mathematics
that students perform better, on average, when
annually in grades three to eight, and again in
their teachers have more effective colleagues.
high school. Other subjects, including science
That is, Mrs. Appleton might have higher
and social studies, are tested much less often.26
value-added when teaching next door to Mr.
Because value-added requires a recent, prior
Johnson, because she benefits from his example,
measure of achievement in the same subject
his mentoring, and his support. Other studies
(usually last year’s test score), only teachers of
have found effects of principal leadership on
reading and math in grades four to eight can
student outcomes (Clark, Martorell & Rockoff
be assessed using value-added. Without annual
2009). Consequently, teachers rewarded or
tests, teachers cannot be assessed in other sub-
punished for their value-added may, in part, be
rewarded or punished based on the teachers
23
Technically, the value-added model often does not include “school
with whom they work.24 This possibility cer- effects.”
tainly runs counter to the intended goal of 24
In another study, Rothstein (2010) finds that a student’s fifth-grade
value-added assessment. teacher has large effects on her fourth-grade achievement, a tech-
nical impossibility given that the student has not yet advanced to the
fifth grade. He suggests that this finding may be due to “dynamic
Finally, as argued earlier, in many contexts, tracking,” where a student’s assignment to a fifth-grade teacher
attempts to attribute achievement gains to indi- depends on their fourth-grade experience. When such assignment
occurs, it biases measures of value-added.
vidual teachers may not make sense in princi- 25
This is not a concern unique to teacher value-added measurement.
ple. This is most true in middle and high The same issue arises when considering the design and effects of
school accountability systems. When state testing systems (appropri-
school, when students receive instruction from ately) allow exclusions for certain categories of students, incentives
are created for schools to reclassify students such that they are
multiple teachers. To assume that none of these exempted from the tests (see Figlio & Getzler 2002; Jacob 2005;
teachers’ effects “spill over” into other course- and Jennings & Beveridge 2009).
26
work seems a strong – and unrealistic – In New York, students are tested in social studies in fifth and eighth
grade, and science in fourth and eighth grade. See <www.
assumption. Indeed, Koedel (2009) found that emsc.nysed.gov/osa/schedules/2011/3-8schedule1011-
021010.pdf>.
reading achievement in high school is influ-
enced by both English and math teachers.
Learning may simply not occur in the rigid way
assumed by current value-added models.
Figure 6
Percent of students with a test score and percent contributing to value-added esti-
mates, grades four to six, Houston,1998–2006
100
All students
90 % with a test score
80 % with a test score
and a lag score
70
60
Percentage
50
40
30
20
10
0
All Economically Black Hispanic Recent ESL
students disadvantaged immigrants
Quintile of value-added: TAKS Reading
Source: Author’s calculations using data from the Houston Independent School District. The percent of
students with a test score and a lag score is only calculated for students in grades four to six (third
grade is the first year of testing).
20 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Because of high rates of student mobility in Are value-added scores precise enough
this population (in addition to test exemption to be useful?
and absenteeism), the percentage of students As described in sections 2 and 3, value-added
who have both a current and prior year test is based on a statistical model that effectively
score – a prerequisite for value-added – is even compares actual with predicted achievement.
lower (see Figure 6). Among all grade four to The residual gains serve as an estimate of the
six students in HISD, only 66 percent had teacher’s value-added. Like all statistical esti-
both of these scores, a fraction that falls to mates, however, value-added has some level of
62 percent for Black students, 47 percent for uncertainty, or, a margin of error. In New York
ESL students, and 41 percent for recent City’s Teacher Data Reports, this uncertainty is
immigrants. expressed visually by a range of possible per-
The issue of missing data is more than a tech- centiles for each teacher’s performance (the
nical nuisance. To the extent that districts “confidence interval” for the value-added
reward or punish teachers on the basis of score). Some uncertainty is inevitable in value-
value-added, they risk ignoring teachers’ efforts added measurement, but for practical purposes
with a substantial share of their students. it is worth asking: Are value-added measures
Moreover, they provide little incentive for precise enough to be useful in high-stakes deci-
teachers to invest in students who will not sion-making or for professional development?
count toward their value-added. Unfortunately, Let’s return to the case of Mark Jones, whose
districts like New York City and Houston have data report is shown in Appendix B. Based on
very large numbers of highly mobile, routinely last year’s test results, we learned that Mr. Jones
exempted, and frequently absent students. ranked at the 43rd percentile among eighth-
Moreover, these students are unevenly distrib- grade teachers in math. Taking into account
uted across schools and classrooms. Teachers uncertainty in this estimate, however, his range
serving these students in disproportionate of plausible rankings range from the 15th to
numbers are most likely to be affected by a the 71st percentile. Although the 43rd per-
value-added system that – by necessity – centile is our best estimate of Mr. Jones’s per-
ignores many of their students. formance, we can’t formally rule out estimates
ranging from 15 to 71. Using the NYCDOE
27
The latter is calculated only for students in grades four to six.
performance categories, we can conclude that
Because third grade is the first year of testing, none of these stu- Mr. Jones is a “below average” teacher, an
dents have a prior year score.
“average” teacher, or perhaps a borderline
“above average” teacher.
What is the source of this uncertainty, exactly?
Recall that value-added measures are estimates
of a teacher’s contribution to student test-score
gains. The more certain we can be that gains
are attributable to a specific teacher, the more
precise our estimates will be (and the more
22 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
right and should be recognized; teachers per- Figure 7
sistently in the bottom 5 percent deserve Average confidence interval width, New York City Teacher Data
Reports, 2008-2009
immediate scrutiny. Still, it seems a great deal
of effort has been expended to identify a small
fraction of teachers. In the end, a tool designed ELA
for differentiating teacher effectiveness has Citywide (2007-2008)
done very little of the sort.
Bronx (3 years)
To get a better sense of the average level of
Manhattan (3 years)
uncertainty in the Teacher Data Reports, I
examined the full set of value-added estimates Brooklyn (3 years)
reported to more than 12,700 teachers on the Queens (3 years)
NYCDOE 2008-2009 reports. As we saw for
Citywide (3 years)
Mark Jones in Appendix B, each value-added
Staten Island (3 years)
ranking is accompanied by a range of possible
Citywide
estimates. To begin, I simply calculated the (teachers w/3 yrs data)
width of this interval for every teacher in read- 0 10 20 30 40 50 60 70
ing and math. Average widths across teachers Percentage points
.04
dence intervals are found in the Bronx – whose
schools serve many disadvantaged students – at
.02
37 percentile points in math and 47 points in
ELA (both based on up to three years of data;
0
0 20 40 60 80 100 see Figure 7 on page 23). The most precise
Percentage of overlapping confidence intervals estimates, in contrast, are observed in relatively
more advantaged Staten Island.
Another way of understanding the effects of
M AT H uncertainty is to compare two teachers’ ranges
of value-added estimates and ask whether or
.08
not they overlap. For example, suppose that
based on her value-added, Mrs. Appleton ranks
.06
in the 41st percentile of ELA teachers, with a
confidence interval ranging from 24 to 58 (on
Fraction
.04
the low end of the widths presented in Figure
7). And suppose Mr. Johnson ranks in the 51st
.02
percentile of ELA teachers, with an equally
wide confidence interval from 34 to 68. Based
0
0 20 40 60 80 100 on their “most likely” rankings, Mr. Johnson
Percentage of overlapping confidence intervals appears to have out-performed Mrs. Appleton.
However, because we can’t statistically rule out
Source: Author’s calculations using data from the 2008-2009 Teacher Data Reports, estimates in their overlapping intervals, we
based on up to three years of test results. Bars show the fraction of teachers that overlap
with X percent of other teachers in the district teaching the same grade and subject. can’t say with confidence that this is the case.
Using the 2008-2009 Teacher Data Report
estimates, I compared all possible pairs of
teacher percentile ranges in the city, within the
same grade and subject, to see how many
teachers could be statistically distinguished
from one another.30 For example, if Mrs.
24 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Appleton’s range of 24 to 58 overlaps with 56 simply be ignored; they represent the extent of
percent of all other fourth-grade teachers, we statistical precision with which the value-added
could not rule out the possibility that she was estimate was calculated. Confidence intervals
equally effective as these 56 percent of teach- such as these are reported in any academic
ers. The results are summarized in Figure 8, study that relies on inferential statistics, and
which shows the fraction of teachers who over- any academic study that attempted to ignore
lap with X percent of all other teachers. For these intervals in drawing between-group com-
example, the bar above 60 shows the fraction parisons would in most cases be rejected out-
of teachers who cannot be statistically distin- right.
guished from 60 percent of all other teachers
in the district.
Given the level of uncertainty reported in the
data reports, half of teachers in grades three to
eight who taught math have wide enough per-
formance ranges that they cannot be statisti-
cally distinguished from 60 percent or more of
all other teachers of math in the same grade.
One in four teachers cannot be distinguished
from 72 percent or more of all teachers. These
comparisons are even starker for ELA, as seen
in Figure 8. In this case, three out of four
teachers cannot be statistically distinguished
from 63 percent or more of all other teachers.
Only a tiny proportion of teachers – about 5
percent in math and less than 3 percent in
ELA – received precise enough percentile
ranges to be distinguished from 20 percent or
fewer other teachers.
As noted before, it is true that teachers’ per-
centile ranking is their “best” or “most likely”
estimate. But the ranges reported here cannot
30
In other words, I compared every teacher’s percentile range with
every other teacher’s percentile range. For example, if there are
1,000 teachers in the city, teacher 1’s percentile range is com-
pared to teachers 2 through 1,000; teacher 2’s percentile range is
compared with teachers 1 and 3 through 1,000, and so on, for
all possible pairs.
Figure 9
Year-to-year stability in value-added rankings: HISD reading test, 2000–2006
40
Percentages of this year's quintiles in last year's quintiles
This year’s Q1
35 This year’s Q2
This year’s Q3
30
This year’s Q4
25 This year’s Q5
20
15
10
0
Q1 Q2 Q3 Q4 Q5
Last year’s quintiles
26 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
of estimates. But this estimate Figure 10
is one of only a few made Year-to-year stability in ELA and math value-added rankings, New York City Teacher Data
Reports, 2007-2008
available to teachers on their
annual report, and thus they
are hard to ignore. Inexperi- ELA
28 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
References Boyd, Donald J., Hamilton Lankford, Susanna
Loeb, Jonah Rockoff, and James Wyckoff.
Alexander, Karl L., Doris R. Entwisle, and 2008. “The Narrowing Gap in New York
Linda S. Olsen. 2001. “Schools, Achieve- City Teacher Qualifications and Its Implica-
ment, and Inequality: A Seasonal Perspec- tions for Student Achievement in High-
tive,” Educational Evaluation and Policy Poverty Schools,” Journal of Policy Analysis
Analysis 23:171–191. and Management 27:793–818.
Amrein-Beardsley, Audrey. 2008. “Method- Braun, Henry I. 2005. Using Student Progress
ological Concerns about the Education to Evaluate Teachers: A Primer on Value-Added
Value-Added Assessment System,” Educa- Models. Policy Information Perspective.
tional Researcher 37:65–75. Princeton, NJ: Educational Testing Service.
ASPIRE. 2010. “2008-2009 ASPIRE Award Brosbe, Ruben. 2010. “My Disappointing
Program Highlights,” <portal.battelleforkids. Data and What to Do With It,” Gotham
org/ASPIRE/Recognize/ASPIRE_Award/ Schools Classroom Tales Blog (March 10),
2009_aspire_highlights.html>. <gothamschools.org/2010/03/10/my-
Associated Press. 2007. “School District Asks disappointing-data-and-what-to-do-with-it/>.
Teachers to Return Pay,” New York Times Buddin, Richard, Daniel F. McCaffrey, Sheila
(March 11). Nataraj Kirby, and Nailing Xia. 2007. “Merit
Blinder, Alan S. 2009. “Education for the Pay for Florida Teachers: Design and Imple-
Third Industrial Revolution.” In Creating a mentation Issues.” RAND Education Work-
New Teaching Profession, edited by Dan ing Paper, WR-508-FEA. Santa Monica, CA:
Goldhaber and Jane Hannaway, pp. 3–14. RAND.
Washington, DC: Urban Institute Press. Center for Educator Compensation Reform.
Blumenthal, Ralph. 2006. “Houston Ties 2008. Performance Pay in Houston. Washing-
Teacher Pay to Student Test Scores,” New ton DC: U.S. Department of Education.
York Times (January 13). Downloadable PDF at <www.cecr.ed.gov/
guides/summaries/HoustonCaseSummary.
Boyd, Donald, Pamela Grossman, Hamilton
pdf>.
Lankford, Susanna Loeb, and James Wyck-
off. 2006. “How Changes in Entry Require- Clark, Damon, Paco Martorell, and Jonah
ments Alter the Teacher Workforce and Rockoff. 2009. “School Principals and School
Affect Student Achievement,” Education Performance,” CALDER Working Paper No.
Finance and Policy 1:176–216. 38. Washington, DC: Urban Institute.
Boyd, Donald J., Pamela L. Grossman, Hamil- Clotfelter, Charles T., Helen F. Ladd, and Jacob
ton Lankford, Susanna Loeb, and James Vigdor. 2005. “Who Teaches Whom? Race
Wyckoff. 2009. “Teacher Preparation and and the Distribution of Novice Teachers,”
Student Achievement,” Educational Evalua- Economics of Education Review 24:377–392.
tion and Policy Analysis 31:416–440.
30 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Kane, Thomas J., Jonah E. Rockoff, and Dou- McCaffrey, Daniel F., Tim R. Sass, J. R. Lock-
glas O. Staiger. 2008. “What Does Certifica- wood, and Kata Mihaly. 2009. “The
tion Tell Us About Teacher Effectiveness? Intertemporal Variability of Teacher Effect
Evidence from New York City,” Economics of Estimates,” Education Finance and Policy
Education Review 27:615–631. 4:572–606.
Keller, Bess. 2008. “Drive On to Improve Eval- Medina, Jennifer. 2008a. “Bill Would Bar
uation Systems for Teachers,” Education Linking Class Test Scores to Tenure,” New
Week (January 15). York Times (March 18).
Klein, Joel, and Weingarten, Randi. 2008. Medina, Jennifer. 2008b. “New York Measur-
Joint letter. Principals’ Weekly (October 1). ing Teachers by Test Scores,” New York Times
Koedel, Cory. 2009. “An empirical analysis of (January 21).
teacher spillover effects in secondary school,” Medina, Jennifer. 2009. “A Two-Year Study to
Economics of Education Review 28:682–692. Learn What Makes Teachers Good,” New
Koretz, Daniel. 2008. “A Measured Approach,” York Times City Room Blog (September 1),
American Educator (Fall), 18–27, 39. <http://cityroom.blogs.nytimes.com/2009/
09/01/a-2-year-study-to-learn-what-makes-
Lankford, Hamilton, Susanna Loeb, and James
teachers-good/>.
Wyckoff. 2002. “Teacher Sorting and the
Plight of Urban Schools: A Descriptive Medina, Jennifer. 2010. “Agreement Will Alter
Analysis,” Educational Evaluation and Policy Teacher Evaluations,” New York Times (May
Analysis 24:37–62. 10).
Martinez, Barbara. 2010. “School Tenure Mellon, Ericka. 2010a. “HISD to Pay Out
Crackdown,” Wall Street Journal (July 30). More Than $40 Million in Bonuses,” Hous-
ton Chronicle (January 27).
McCaffrey, Daniel F., Daniel M. Koretz, J. R.
Lockwood, and Laura S. Hamilton. 2004. Mellon, Ericka, 2010b. “HISD Spells Out
Evaluating Value-Added Models for Teacher Teacher Dismissal Process, Part II,” Houston
Accountability. Santa Monica, CA: RAND. Chronicle School Zone Blog (February 8),
<http://blogs.chron.com/schoolzone/2010/
McNeil, Linda, and Angela Valenzuela. 2000.
02/hisd_spells_out_teacher_dismis.html>.
The Harmful Impact of the TAAS System of
Testing in Texas: Beneath the Accountability Murnane, Richard J., and David K. Cohen.
Rhetoric. Educational Resources Information 1986. “Merit Pay and the Evaluation Prob-
Center Report ED 443-872. Washington, lem: Why Most Merit Pay Plans Fail and a
DC: U.S. Department of Education. Few Survive,” Harvard Educational Review
56:1–17.
Nye, Barbara, Spyros Konstantopoulos, and
Larry V. Hedges. 2004. “How Large Are
Teacher Effects?” Educational Evaluation and
Policy Analysis 26:237–257.
32 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
ment.” In Grading Teachers, Grading Schools: Toch, Thomas, and Robert Rothman. 2008.
Is Student Achievement a Valid Evaluation Rush to Judgment: Teacher Evaluation in Pub-
Measure? edited by Jason Millman. Thousand lic Education. Washington, DC: Education
Oaks, CA: Corwin Press, Inc. Sector.
Sawchuk, Stephen. 2010. “Houston Approves U.S. Department of Education. 2010.
Use of Test Scores in Teacher Dismissals,” “Overview Information: Race to the Top
Education Week Teacher Beat Blog (February Fund; Notice Inviting Applications for New
12), <http://blogs.edweek.org/edweek/ Awards for Fiscal Year (FY) 2010,” Federal
teacherbeat/2010/02/houston_approves_ Register 75, no. 71 (April 14), Part III: U.S.
use_of_test_s.html>. Department of Education, pp. 19,499–
Sass, Tim R. 2008. The Stability of Value- 19,500. Washington DC: U.S. GPO.
Added Measures of Teacher Quality and Impli- Downloadable PDF at: <www2.ed.gov/
cations for Teacher Compensation Policy. legislation/FedRegister/announcements/
CALDER Policy Brief #4. Washington, DC: 2010-2/041410a.pdf>.
National Center for Analysis of Longitudinal Weisburg, Daniel, Susan Sexton, Susan Mulh-
Data in Education Research. ern, and David Keeling. 2009. The Widget
Shepard, Lorrie A. 1988. “The Harm of Meas- Effect: Our National Failure to Acknowledge
urement-Driven Instruction,” Paper pre- and Act on Differences in Teacher Effectiveness.
sented at the Annual Meeting of the Ameri- Brooklyn, NY: New Teacher Project.
can Educational Research Association,
Washington, DC.
Shepard, Lorrie A., and Katherine Dougherty.
1991. “Effects of High-Stakes Testing on
Instruction.” Paper presented at the annual
meetings of the American Educational
Research Association and the National
Council of Measurement in Education,
Chicago, IL.
Springer, Matthew G., and Marcus A. Winters.
2009. The NYC Teacher Pay-for-Performance
Program: Early Evidence from a Randomized
Trial. Civic Report No. 56. New York: Man-
hattan Institute.
34 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Sample New York City Teacher Data Report, 2010 APPENDIX B
36 Can Teachers be Evaluated by their Students’ Test Scores? Should They Be?
Providence
Brown University
Box 1985
Providence, RI 02912
T 401.863.7990
F 401.863.1290
New York
233 Broadway, Suite 720
New York, NY 10279
T 212.328.9290
F 212.964.1057
www.annenberginstitute.org