(Robert J. Marzano) Classroom Assessment Grading (B-Ok - CC)

Assessments That Encourage Learning
Of the four principles of effective classroom assessment discussed in Chapter 1, the second
principle—that it should encourage students to improve—is probably the most challenging to
implement. As we saw in Chapter 1, feedback can have varying effects on student learning. If done the
wrong way, it can discour- age learning. Figure 1.2 in Chapter 1 illustrates that simply telling students
their answers are right or wrong has a negative influence on student learning. The positive effects of
feedback are not automatic. This chapter presents three techniques that encourage learning. Tracking
Students' Progress
One of the most powerful and straightforward ways a teacher can provide feedback that encourages
learning is to have students keep track of their own progress on topics. An easy way to do this is to
provide students with a form like that shown in Figure 5.1 for each topic or selected topics addressed
during a grading period. Each column in the line chart represents a different assessment for the topic
probability. The first column represents the student's score on the first assessment, the second column
represents the score on the second assessment, and so on. This technique provides students with a visual
representation of their progress. It also provides a vehicle for students to establish their own learning
goals and to define success in terms of their own learning as opposed to their standing relative to other
students in the class. As discussed in Chapter 1, motivational psychologists such as Martin Covington
(1992) believe that this simple change in perspective can help motivate students. In the parlance of
motivational psychologists,
89
Classroom 90
Assessment & Grading That Work FIGURE 5.1 Student Progress Chart
Keeping Track of My Learning
Name
IH
Measurement Topic:
Probability
My score at the beginning: 1.5 My goal is to be at 3 by
Nov. 30
Specific things I am going to do to improve:
Work 15 min. three times a week
Measurement Topic: Probability 4
3
2
1
0
abcdefghij
a. Oct. 5 f. Nov. 26
b. Oct. 12 g.
c. Oct. 20 h.
d. Oct. 30 i.
e. Nov. 12 j.
allowing students to see their “knowledge gain” throughout a grading period elicits “intrinsic” motivation.
Figure 5.2 illustrates how a teacher might track the progress of her four lan- guage arts classes. This
chart is different in that it represents the percentage of students above a specific score point or
“performance standard” for the measurement topic effective paragraphs. Chapter 3 addressed the
concept of a performance standard. Briefly, it is the score on the scale (in this case the complete
nine-point scale) that is the desired level of performance or understanding for all
91 Assessments That Encourage Learning
FIGURE 5.2 Class Chart
Recording Student Achievement—Classroom
Teacher Name:
Ms. Braun
Measurement Topic:
Effective Paragraphs
Class/Subject: Lang. Arts Grading Period/Time Span:
Quarter 2
Total # of Students Represented on This Graph: 110 evob A rotneic fi or P
100
80
60
40
20
%
0
1 2 3 4 5 6 7 8 9 10
1. 11-2 Holiday Paragraph 6.
2. 11-15 New Year Paragraph 7.
3. 12-5 Science Paragraph 8.
4. 12-15 Hobby 9.
5. 1-6 Book Report 10.
students. In Figure 5.2, 50 percent of the students in Ms. Braun's class were at or above the performance
standard on November 2, as they were for the next two assessments. However, by December 15, 70
percent of her students were at the performance standard or above.
This type of aggregated data can provide teachers and administrators with a snapshot of the progress
of entire grade levels or an entire school. Individual teachers or teams of teachers can use such aggregated
data to identify future instructional emphases. If the aggregated data indicate that an insufficient per-
centage of students in a particular grade level are at or above the designated performance standard, then
the teachers at that grade level might mount a joint effort to enhance student progress for the measurement
topic.
92
Classroom Assessment & Grading That Work
Encouraging Self-Reflection
Another way to encourage student learning is to ensure that students have an opportunity to reflect on
their learning using information derived from classroom assessments. There are at least two ways to do
this.
The first way to encourage self-reflection is to allow students to engage in self-assessment. Student
self-assessment is mentioned quite frequently in the lit- erature on classroom assessment (see Stiggins,
Arter, Chappuis, & Chappuis, 2004), and a growing body of evidence supports its positive influence on
student learning (Andrade & Boulay, 2003; Butler & Winne, 1995; Ross, Hogaboam- Gray, & Rolheiser,
2002). In the context of this book, self-assessment refers to students assigning their own scores for each
assessment. For example, reconsider Figure 5.1, in which a student recorded the scores his teacher had
given him for a series of classroom assessments. For each of these assessments, students could be invited
to assign their own scores.
To facilitate self-assessment, the teacher can provide students with a simplified version of the
scoring scale. Figure 5.3 presents student versions of the simplified five-point and complete nine-point
scales.
One of the primary uses of student self-assessment is to provide a point of contrast with the teacher's
assessment. Specifically, the teacher would compare the scores she gave to students on a particular
assessment with the scores they gave themselves. Discrepancies provide an opportunity for teacher and
students to interact. If a student scored himself higher than the teacher, the teacher would point out areas
that need improvement before the student actually attained the score representing his perceived status. If
the student scored himself lower than the teacher, the teacher would point out areas of strength the student
might not be aware of.
A second way to stimulate self-reflection is to have students articulate their perceptions regarding
their learning. K. Patricia Cross (1998) has developed a number of techniques to this end. For example,
she offers the “minute paper” as a vehicle for self-reflection:
Shortly before the end of a class period, the instructor asks students to write brief answers to these two questions: What is the
most important thing that you learned in class today? and What is the main unanswered question you leave class with today? (p.
6)
A variation of the minute paper is the “muddiest point.” Here students simply describe what they are
most confused about in class. The teacher reads each
FIGURE 5.3 Student Versions of Scoring Scales
Simplified Scale Complete Scale
4.0 I know (can do) it well enough to make
connections that weren't taught.
3.0 I know (can do) everything that was
taught without making mistakes.
2.0 I know (can do) all the easy parts, but I don't know (can't do) the harder parts.
1.0 With help, I know (can do) some of
what was taught.
0.0 I don't know (can't do) any of it
student's muddiest point and uses the information to plan further instruction and organize students into
groups.
The student scales shown in Figure 5.3 can be used to help identify the muddiest point. To illustrate,
consider the score of 2.0 on the simplified scale and the complete scale. Students who assign themselves
this score are acknowledging that they are confused about some of the content. If students also were asked
to describe what they find confusing, they would be identifying the muddiest points. For Cross (1998), the
most sophisticated form of reflection is the “diagnostic learning log,” which involves students responding
to four questions:
4.0 I know (can do) it well enough to make connections that weren't taught, and I'm right about those connections.
3.5 I know (can do) it well enough to make connections that weren't taught, but I'm not always right about those connections.
3.0 I know (can do) everything that was
taught (the easy parts and the harder parts) without making mistakes.
2.5 I know (can do) all the easy parts and some (but not all) of the harder parts.
2.0 I know (can do) all the easy parts, but I don't know (can't do) the harder parts.
1.5 I know (can do) some of the easier
parts, but I make some mistakes.
1.0 With help I know (can do) some of the
harder parts and some of the easier parts.
0.5 With help, I know (can do) some of the
easier parts but not the harder parts.
0.0 I don't know (can't do) any of it.
94
1. Briefly describe the assignment you just completed. What do you think was the purpose of this assignment?
2. Give an example of one or two of your most successful responses. Explain what you did that made them successful.
3. Provide an example of where you made an error or where your responses were less complete. Why were these items
incorrect or less successful?
4. What can you do different when preparing next week's assignment? (p. 9)
Cross recommends that the teacher tabulate these responses, looking for patterns that will form the basis
for planning future interactions with the whole class, groups of students, and individuals.
These examples illustrate the basic nature of self-reflection—namely, students commenting on their
involvement and understanding of classroom tasks. Such behavior is what Deborah Butler and Philip
Winne (1995) refer to as “self- regulated learning.”
Focusing on Learning at the End of the Grading Period
The ultimate goal of assessing students on measurement topics is to estimate their learning at the end of
the grading period. To illustrate, consider Figure 5.4, which shows one student's scores on five
assessments over a nine-week period on the measurement topic probability. The student obtained a score
of 1.0 on each of the first two assessments, 2.5 on the third, and so on. At the end of the grading period,
the teacher will compute a final score that represents the student's performance on this topic. To do this,
a common approach is to average the scores. In fact, one might say that K–12 education has a “bias” in
favor of averaging. Many textbooks on classroom assessment explicitly or implicitly recommend
averaging (see Airasian, 1994; Haladyna, 1999). As we shall see in the next chapter, in some situations
computing an average makes sense. However, those situations generally do not apply to students'
formative assessment scores over a period of time. Figure 5.5 helps to illustrate why this is so. As before,
the bars represent the student's scores on each of the five assessments. The average—in this case 2.0—has
been added, represented by the dashed line. To understand the implication of using the average of 2.0 as
the final score for a student, recall the discussion in Chapter 3 about the concept of true score. Every score
that a student receives on every assessment is made up of two parts—the true score and the error score.
Ideally, the score a student receives on an assessment (referred to as the observed score) consists mostly
of the student's true score. However, the error part of a student's score can dramatically alter the observed
score. For example, a student might receive a score of 2.5 on an assessment but really deserve a
FIGURE 5.4 Bar Graph of Scores for One Student on One Topic over Time
Score 1 Score 2 Score 3 Score 4 Score 5
FIGURE 5.5 Bar Graph of Scores with Line for Average
Average Score = 2.0
Score 1 Score 2
Score 3 Score 4 Score 5
96
3.0. The 0.5 error is due to the fact that the student misread or misunderstood some items on the
assessment. Conversely, a student might receive a score of 2.5 but really deserve a 2.0 because she
guessed correctly about some items.
The final score a student receives for a given measurement topic is best thought of as a final estimate
of the student's true score for the topic. Returning to Figure 5.5, if we use the student's average score as an
estimate of her true score at the end of a grading period, we would have to conclude that her true score is
2.0. This implies that the student has mastered the simple details and processes but has virtually no
knowledge of the more complex ideas and processes. How- ever, this interpretation makes little sense
when we carefully examine all the scores over the grading period. In the first two assessments, the
student's responses indicate that without help she could do little. However, from the third assessment on,
the student never dropped below a score of 2.0, indicating that the simpler details and processes were not
problematic. In fact, on the third assessment the student demonstrated partial knowledge of the complex
information and processes, and on the fifth assessment the student demonstrated partial ability to go
beyond what was addressed in class. Clearly in this instance the average of 2.0 does not represent the
student's true score on the topic at the end of the grading period.
The main problem with averaging students' scores on formative assessments is that averaging
assumes that no learning has occurred from assessment to assessment. This concept is inherent in classical
test theory. Indeed, measurement theorists frequently define true score in terms of averaging test scores
for a specific student. To illustrate, Frederic Lord (1959), architect of much of the initial thinking
regarding classical test theory and item response theory, explains that the true score is “frequently defined
as the average of the scores that the exami- nee would make on all possible parallel tests if he did not
change during the testing process [emphasis added]” (p. 473). In this context, parallel tests can be thought
of as those for which a student might have different observed scores but identi- cal true scores.
Consequently, when a teacher averages test scores for a given student, she is making the tacit
assumption that the true score for the student is the same on each test. Another way of saying this is that
use of the average assumes the differences in observed scores from assessment to assessment are simply a
consequence of “random error,” and the act of averaging will “cancel out” the random error from test to
test (Magnusson, 1966, p. 64).
Unfortunately, the notion that a student's true score is the same from assessment to assessment
contradicts what we know about learning and the formative assessments that are designed to track that
learning. Learning theory and common
sense tell us that a student might start a grading period with little or no knowledge regarding a topic but
end the grading period with a great deal of knowledge. Learning theorists have described this
phenomenon in detail. Specifically, one of the most ubiquitous findings in the research in cognitive
psychology (for a discussion, see Anderson, 1995) is that learning resembles the curve shown in Figure
5.6. As depicted in the figure, the student in question begins with no understanding of the topic—with
zero knowledge. Although this situation is probably never the case, or is at least extremely rare, it
provides a useful perspective on the nature of learning. An interesting aspect of the learning curve is that
the amount of learning from session to session is large at first—for example, it goes from zero to more
than 20 percent after one learning session—but then it tapers off. In cognitive psychology, this trend in
learning (introduced by Newell & Rosenbloom, 1981) is referred to as “the power law of learning”
because the mathematical function describing the line in Figure 5.6 can be computed using a power
function.
Technical Note 5.1 provides a more detailed discussion of the power law. Briefly, though, it has been
used to describe learning in a wide variety of situations. Researcher John Anderson (1995) explains that
“since its identification by Newell and Rosenbloom, the power law has attracted a great deal of attention
in psychology, and researchers have tried to understand why learning should take the same form in all
experiments” (p. 196). In terms of its application to formative assessment, the power law of learning
suggests a great deal about the best estimate of a given student's true score at the end of a grading period.
Obviously it supports the earlier discussion that the average score probably doesn't provide a good
estimate of a student's score for a given measurement topic at the end of the grading period. In effect,
using the average is tantamount to saying to a student, “I don't think you've learned over this grading
period. The differences in your scores for this topic are due simply to measurement error.”
The power law of learning also suggests another way of estimating the student's true score at the end
of a grading period. Consider Figure 5.7, which depicts the score points for each assessment that one
would estimate using the power law. That is, the first observed score for the student was 1.0; however, the
power law estimates a true score of 0.85. The second observed score for the student was 1.0, but the
power law estimates the true score to be 1.49, and so on. At the end of the grading period, the power law
estimates the student's true score to be 3.07—much higher than the average score of 2.00. The power law
makes these estimates by examining the pattern of the five observed scores over the grading period. (See
Technical Note 5.1 for a discussion.) Given this pattern, it is
Classroom 98
Assessment & Grading That Work FIGURE 5.6 Depiction of the Power Law of Learning
# of Learning Sessions

FIGURE 5.7 Bar Graph with Power Law Scores
FIGURE 5.8 Comparisons of Observed Scores, Average Scores, and Estimated Power Law Scores
Total Assessment 1 2 3 4 5 Difference
Observed Score 1.00 1.00 2.50 2.00 3.50 n/a
Average Score 2.00 2.00 2.00 2.00 2.00 n/a
Estimated Power Law Score 0.85 1.49 1.95 2.32 3.07 n/a
Difference Between Observed
Score and Average Score 1.00 1.00 0.50 0.00 1.50 4.00
Difference Between Observed
Score and Estimated Power Law Score 0.15 0.49 0.55 0.32 0.43 1.94
(mathematically) reasonable to assume that the second observed score of 1.0 had some error that
artificially deflated the observed score, and the third observed score had some error that artificially
inflated the observed score.
It is important to note that these estimates of the true score are just that— estimates. In fact,
measurement theorists tell us that a student's true score on a given test is unobservable directly. We are
always trying to estimate it (see Gul- liksen, 1950; Lord & Novick, 1968; Magnusson, 1966). However,
within a measurement topic, the final power law estimate of a student's true score is almost always
superior to the true score estimate based on the average. To illustrate, consider Figure 5.8. The figure
dramatizes the superiority of the power law as an estimate of a student's true scores over the average by
contrasting the differences between the two true score estimates (average and power law) and the
observed scores. For the first observed score of 1.00, the average estimates the true score to be 2.00, but
the power law estimates the true score to be 0.85. The average is 1.00 units away from the observed score,
and the power law estimate is 0.15 units away. For the second observed score of 1.00, the average
estimates the true score to be 2.00 (the average will estimate the same true score for every observed
score), but the power law estimates it to be 1.49. The average is 1.00 units away from the observed score,
and the power law estimate is 0.49 units away. Look- ing at the last column in Figure 5.8, we see that the
total differences between estimated and observed scores for the five assessments is 4.00 for the average
and 1.94 for the power law. Taken as a set, the power law estimates are closer to the observed scores than
are the estimates based on the average. The power law
Classroom 100 Assessment & Grading That Work estimates “fit the observed data” better than the
estimates based on the average. We will consider this concept of “best fit” again in Chapter 6.
The discussion thus far makes a strong case for using the power law to estimate each student's true
score on each measurement topic at the end of a grading period. Obviously teachers should not be
expected to do the necessary calculations on their own. In Chapter 6 we consider some technology
solutions to this issue—computer software that does the calculations automatically. We might consider
this the high-tech way of addressing the issue. However, teachers can also use a low-tech solution that
does not require the use of specific computer software. I call this solution “the method of mounting
evidence.”
The Method of Mounting Evidence
The method of mounting evidence is fairly intuitive and straightforward. To fol- low it a teacher must use
a grade book like that shown in Figure 5.9, which is different from the typical grade book. One obvious
difference is that it has space for only about five students per page. (For ease of discussion, Figure 5.9
shows the scores for only one student.) Instead of one page accommodating all scores for a class of 30
students, this type of grade book would require six pages. A high school teacher working with five classes
of 30 students each, or 150 students over- all, would need a grade book with 30 pages—6 pages for each
class. Although this
FIGURE 5.9 Grade Book for Method of Mounting Evidence

Note: A circle indicates that the teacher gave the student an opportunity to raise his score from the previous assessment. A box
indicates that the student is judged to have reached a specific score level from that point on.
is more pages than the traditional grade book, it is still not inordinate; and it is easy to create blank forms
using standard word processing software. Additionally, it is important to keep in mind that a grade book
like this should be considered an interim step only, used by teachers who simply wish to try out the
system. Once a teacher becomes convinced that this system will be the permanent method of record
keeping, then appropriate computer software can be purchased, as discussed in Chapter 6.
The columns in Figure 5.9 show the various measurement topics that the teacher is addressing over a
given grading period. In this case the teacher has addressed five science topics: matter and energy, force
and motion, reproduction and heredity, earth processes, and adaptation. The teacher has also kept track of
the life skill topics behavior, work completion, and class participation. First we will consider the
academic topics.
To illustrate how this grade book is used, consider Aida's scores for the topic matter and energy. In
each cell of the grade book, the scores are listed in order of assignment, going from the top left to the
bottom and the top right to the bottom. Thus, for matter and energy Aida has received six scores, in the
following order: 1.5, 2.0, 2.0, 2.0, 2.5, and 2.5. Also note that the second score of 2.0 has a circle around
it. This represents a situation in which the teacher gave Aida an opportunity to raise her score on a given
assessment. This dynamic is at the heart of the method of mounting evidence. Aida received a score of 1.5
for the first assessment for this measurement topic. She demonstrated partial knowledge of the simpler
aspects of the topic by correctly answering some Type I items but incorrectly answering other Type I
items. However, after returning the assessment to Aida, the teacher talked with her and pointed out her
errors on the Type I items, explaining why Aida's paper was scored a 1.5. The teacher also offered Aida
the chance to demonstrate that her errors on the test for Type I items were not a true reflection of her
understanding of the topic. In other words, the teacher offered Aida an opportunity to demonstrate that 1.5
was not an accurate reflection of her true score. The teacher might have allowed Aida to complete some
exercises at the end of one of the textbook chapters that pertained to the topic, or she might have
constructed some exercises that Aida could complete, or she might have asked Aida to devise a way to
demonstrate her true knowledge.
Such an offer is made to students when their scores on a particular assessment for a particular topic
are not consistent with their behavior in class. For example, perhaps in class discussions about matter and
energy, Aida has exhibited an understanding of the basic details and processes, indicating that she
deserves a score of 2.0. The results on the first assessment, then, don't seem consistent with
102
the informal information the teacher has gained about Aida in class. The teacher uses this earlier
knowledge of Aida to guide her evaluation regarding this particular topic. Based on this prior
knowledge, the teacher has decided that she needs to gather more evidence about Aida's level of
understanding and skill on this particular topic. Notice that the teacher doesn't simply change the score
on the assessment. Rather, she gives Aida an opportunity to provide more information about this
particular measurement topic. If the new information provided by Aida corroborates the teacher's
perception that Aida is at level 2.0 for the topic, the teacher changes the score in the grade book and
circles it to indicate that it represents a judgment based on additional information.
Another convention to note in Figure 5.9 is that some scores—such as Aida's fourth score of 2.0—are
enclosed in a box. When a teacher uses this convention it means that she has seen enough evidence to
conclude that a student has reached a certain point on the scale. By the time the teacher entered the fourth
score for Aida, she was convinced that Aida had attained a score of 2.0. From that assessment on, the
teacher examined Aida's responses for evidence that she has exceeded this score. That is, from that point
on, the teacher examined Aida's assessments for evidence that she deserved a score greater than a 2.0.
This does not mean that Aida is allowed to miss Type I items. Indeed, any assessment on which Aida
does not correctly answer Type I items would be returned to her with the directions that she must correct
her errors in a way that demonstrates the accuracy of her assigned score of 2.0. However, the teacher
would consider these errors to be lapses in effort or reasoning or both, as opposed to an indication that
Aida's true score is less than 2.0.
The underlying dynamic of the method of mounting evidence, then, is that once a student has
provided enough evidence for the teacher to conclude that a certain score level has been reached, that
score is considered the student's true score for the topic at that point in time. Using this as a foundation,
the teacher seeks evidence for the next score level up. Once enough evidence has been gath- ered, the
teacher concludes that this next score level represents the true score, and so on until the end of the grading
period. Mounting evidence, then, provides the basis for a decision that a student has reached a certain
level of understanding or skill.
This approach has a strong underlying logic and can be supported from various research and
theoretical perspectives. First, recall from Figure 1.2 in Chap- ter 1 that a gain of 20 percentile points is
associated with the practice of asking students to repeat an activity until they demonstrate they can do it
correctly. The
method of mounting evidence certainly has aspects of this “mastery-oriented” approach. Indeed, some of
the early work of Benjamin Bloom (1968, 1976, 1984) and Tom Guskey (1980, 1985, 1987, 1996a) was
based on a similar approach. The method of mounting evidence can also be supported from the
perspective of a type of statistical inference referred to as “Bayesian inference.” For a more thor- ough
discussion of Bayesian inference, see Technical Note 5.2. Briefly, though, Bayesian inference takes the
perspective that the best estimate of a student's true score at any point in time must take into consideration
what we know about the student from past experiences. Each assessment is not thought of as an isolated
piece of information; rather, each assessment is evaluated from the perspective of what is already known
about the student relative to a specific measurement topic. In a sense, Bayesian inference asks the
question, “Given what is known about the student regarding this measurement topic, what is the best
estimate of her true score on this assessment?” It is a generative form of evaluation that seeks more
information when a teacher is uncertain about a specific score on a specific assessment.
The Life Skill Topics
Life skill topics might also be approached from the method of mounting evidence, but with a slight
variation on the theme. Consider Aida's life skill scores in Figure 5.9. These scores are not tied to specific
assessments. As mentioned in Chapter 4, once a week the teacher has scored students on these three
topics, perhaps using the last few minutes of class each Friday. The teacher has recorded nine scores for
behavior, one for each week of the grading period. Again, the scores are entered from the top left to the
bottom, and then from the top right to the bottom. Thus, Aida's scores in the order in which they were
assigned are 3.0, 3.0, 2.5, 3.0, 3.5, 3.5, 3.0, 3.5, and 3.5. Notice that a number of these scores have been
enclosed in a box. Again, the box signifies that the teacher judges it to be the student's true score at a
particular moment in time. Therefore, Aida's second score of 3.0, which is enclosed in a box, indicates
that at that point in time the teacher concluded it to be Aida's true score for behavior. Notice that the next
score is a 2.5—a half point lower than the teacher's estimate the previous week (assuming life skill scores
are recorded every week on Friday). Given the drop in performance, the teacher met with Aida and told
her that she must bring her score back up to a 3.0 by the next week. In this case, Aida did just that. The
teacher then enclosed that next score in a box to reaffirm that 3.0 was, in fact, Aida's true score.
104
Summary and Conclusions

Effective formative assessment should encourage students to improve. Three techniques can help
accomplish this goal. The first involves students tracking their progress on specific measurement topics
using graphs. The second engages students in different forms of self-reflection regarding their progress
on measurement topics. The third addresses estimating students' true scores at the end of a grading period.
In particular, the practice of averaging scores on formative assessments is a questionable way to produce a
valid estimate of final achievement status. Two alternatives are preferable. One uses the power law to
estimate students' final status. The second uses mounting evidence to estimate students' final status.

(Robert J. Marzano) Classroom Assessment Grading (B-Ok - CC)

Uploaded by

Copyright:

Available Formats

(Robert J. Marzano) Classroom Assessment Grading (B-Ok - CC)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Robert J. Marzano) Classroom Assessment Grading (B-Ok - CC)

Uploaded by

Copyright:

Available Formats

Assessments That Encourage Learning

Summary and Conclusions

You might also like