Teaching Statistics With Sports Examples
Teaching Statistics With Sports Examples
Teaching Statistics With Sports Examples
net/publication/251735421
CITATIONS READS
15 4,966
2 authors, including:
Paul H. Kvam
Georgia Institute of Technology
65 PUBLICATIONS 1,525 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Paul H. Kvam on 09 May 2014.
Abstract
Class material for introductory and advanced statistics can be colorfully illustrated by using appropriate data
and examples from sports. Specific methods, including statistical graphics (e.g., boxplots), ball-and-urn proba-
bilities, and statistical regression are demonstrated. Examples are drawn from popular American sports such
as baseball, basketball, soccer and American football. Classroom feedback indicates that that most students
enjoy sports examples as a way to learn abstract concepts using familiar, recreational settings.
Editor's note: This is a pdf copy of an html document which resides at http://archive.ite.journal.in-
forms.org/Vol5No1/KvamSokol/
3. Statistical Graphics
Graphs in statistics, including bar charts (e.g., Figure
1), pie charts and histograms, represent a broad inter-
face between statistics and the general public. Statisti-
cal graphics are mandatory in the print media, and it
is now commonplace to see a political candidate use
statistical charts to support their point of view, espe-
cially in debates. Ross Perot used charts in his presi-
dential bid in 1992. Dennis J. Kucinich, during his 2004
campaign for the Democratic presidential nomination,
actually came to a National Public Radio debate pre-
pared with a pie chart to argue his point about the
Pentagon budget (to show the other candidates, he
claimed).
(1) http://archive.ite.journal.informs.org/Vol5No1/KvamSokol/soccerAttendance.xls
INFORMS Transactions on Education 5:1(75-87) 77 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples
3.2. Boxplots for the 2003 season. In this case, outlying data points
(Alex Rodriguez - Texas, Carlos Delgado - Toronto)
(2)
Below is an example of how a box plot can summa- draw attention away from the bars, and a plot without
rize salary differences in Major League Baseball (MLB) plotted outliers (an option in most statistical packages)
can show more with respect to team salary quartiles.
3.3. Graphical Summary for Basketball Games NBA game between the Minnesota Timberwolves and
the Philadelphia 76ers. Minnesota won the game 106-
Innovative plots have been developed for special sets 101. The box score, shown in Table 1, fails to summa-
of data. Westfall (1990) presented a simple, yet reveal- rize what happened in the game: Minnesota overcame
ing graphical summary of a basketball game by plot- an 18-point deficit and pulled ahead for the first time
ting the point difference between the two teams' scores late in the game. Students can learn about the power
across time. In basketball, perhaps more than any of statistical graphics through such novel uses of
other of the mainstream American sports, the game is charts. We note that this type of chart can also be used
difficult to summarize in a simple box score. Figure 5 in a stochastics course to illustrate the idea of one-di-
below shows the summary of the February 1, 2004 mensional random walks with varying step sizes.
Table 1: Box score for NBA game between Minnesota and Philadelphia, 2/1/2004
(2) http://archive.ite.journal.informs.org/Vol5No1/KvamSokol/MLBSalaries.xls
INFORMS Transactions on Education 5:1(75-87) 78 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples
Figure 5: Point difference in NBA game between Minnesota Timberwolves and Philadelphia 76ers, 2/1/2004.
4. Teaching Simpson's Paradox with nesota Twins and San Francisco Giants) had a higher
batting average (hits per at-bat) than Darin Erstad
Sports Statistics
(Anaheim Angels) in both batting situations when
examined separately, but overall Erstad had a higher
Simpson's paradox occurs with categorical data that
batting average than Mohr. The key to the paradox,
has three variables when an association between two
of course, is that the proportions being compared are
of the variables is consistent across all of the levels of
based on different sample sizes. In this case, Erstad
the third variable, but is completely different if one
appeared with runners in scoring position a smaller
aggregates over the third.
proportion of the time (20%) than did Mohr (28%).
The paradox is best described using a pair of two-by- (The reason for the disparity in at-bats with runners
two contingency tables, and baseball presents many in scoring position is that Mohr generally batted after
examples of Simpson's paradox. The three variables, more players who were likely to get on base; see Sokol
each at two levels, are player (two batters), batting (2003) for more discussion of the effect of batting order
outcome (hit or out), and batting situation (runners in placement.) Other pairings that illustrate Simpson's
scoring position or not). Table 2 below shows one of paradox include Carl Everett vs. Hideki Matsui, Jose
56 pairings in which this paradox took place in the Reyes vs. Carlos Beltran, and Frank Thomas vs. Josh
2003 MLB season. It shows how Dustan Mohr (Min- Phelps.
(3) http://archive.ite.journal.informs.org/Vol5No1/KvamSokol/MLB_Regressiondata2003.xls
(4) http://sports.espn.go.com/mlb/stats/batting?league=mlb
(5) http://www.baseball1.com/statistics/
INFORMS Transactions on Education 5:1(75-87) 80 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples
where (a,b) are estimated as (-0.6228, 0.0292) with Unfortunately, there is no easy way to download and
standard errors (0.0231, 0.0017). Figure 7 charts the parse all of the scores; we wrote our own C and Unix
number of observations collected at the respective C-shell code, specialized for our system, to compile
point spreads. This data was collected from the daily the data.
college basketball scoreboard pages at Yahoo.com.
Figure 6: Observed win probabilities (blue) and logistic regression estimates (white) for home games at a given point
spread.
Table 3: Results of play in the Big Ten Conference, 1999 (winner's score is listed first).
This is an interesting example that makes students servation), then Michigan and Michigan State are much
think about the relative benefits of different statistical closer to Wisconsin. On the other hand, if the power
models. If the power ratings are defined using a linear ratings are defined using a linear regression with the
programming approach (where the error in a predic- error defined as the squared difference, then Wisconsin
tion is defined as its absolute difference from the ob- has a much larger advantage.
We describe both of these models in more detail in the course have already seen basic calculus and probabil-
appendix. ity, but may not have taken any other ISyE courses.
ISyE 2028, Basic Statistical Methods: A sophomore- In all of the undergraduate courses, most of our stu-
level course covering parameter estimation, statistical dents are American and have at least a basic under-
decision-making, and analysis and modeling of rela- standing of the sports involved. Even so, there are al-
tionships between variables. Students taking this ways some who are unfamiliar with even the basic
Appendix
1. Power Rating Models
In Section 7, we refer to two different statistical models that can be used to determine power ratings. In this
appendix, we describe each model mathematically. For both models, we define the following notation:
G set of games played, where each game g ∈ G is an unordered set of teams {i,j}.
pig points scored by team i in game g
ri power rating assigned to team i
Both models use the standard technique of minimizing a function of the total error in their predictions. We
define the error eg for a single game g = { i, j } to be the difference in predicted point spread and actual point
spread, or eg = (ri - rj) - (pig - pjg).
The only difference between the two models is that one finds ratings that minimize the total absolute error
2
∑ g∈G|eg| while the other finds ratings that minimize the total squared error ∑ g∈Ge g.
We can formulate the problem of finding ratings to minimize the total absolute error as a linear program:
ag ≥ eg ∀ g = {i,j} ∈ G, (3)
In this model, the variables ag denote the absolute error for game g. Constraints (3) and (4) ensure that ag ≥ |eg|
for each game g, while the minimization objective ensures that one constraints (3) and (4) will be binding at
optimality for each game g, and thus ag = |eg|.
We can formulate the standard regression problem of minimizing the squared error as a the following mathe-
matical program:
2
Minimize ∑ g∈Ge g (5)
This mathematical program is similar to the linear program in Section 9.1. In fact, it is well-known that we can
solve this problem by substituting (6) into the objective for eg, setting partial derivatives taken with respect to
each ri equal to zero, and solving the resulting system of linear equations; therefore, this model too can be op-
timized using linear programming software.