Teaching Statistics With Sports Examples

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/251735421

Teaching Statistics with Sports Examples

Article in INFORMS Transactions on Education · September 2004


DOI: 10.1287/ited.5.1.75

CITATIONS READS
15 4,966

2 authors, including:

Paul H. Kvam
Georgia Institute of Technology
65 PUBLICATIONS 1,525 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Reliability Estimation View project

All content following this page was uploaded by Paul H. Kvam on 09 May 2014.

The user has requested enhancement of the downloaded file.


KVAM&SOKOL
Teaching Statistics with Sports Examples

Teaching Statistics with Sports Examples


Paul H. Kvam
Joel Sokol
School of Industrial and Systems Engineering
Georgia Institue of Technology
[email protected]
[email protected]

Abstract
Class material for introductory and advanced statistics can be colorfully illustrated by using appropriate data
and examples from sports. Specific methods, including statistical graphics (e.g., boxplots), ball-and-urn proba-
bilities, and statistical regression are demonstrated. Examples are drawn from popular American sports such
as baseball, basketball, soccer and American football. Classroom feedback indicates that that most students
enjoy sports examples as a way to learn abstract concepts using familiar, recreational settings.

Editor's note: This is a pdf copy of an html document which resides at http://archive.ite.journal.in-
forms.org/Vol5No1/KvamSokol/

1. Introduction In our experience, however, when it comes to choosing


projects for various data analyses (regression, contin-
Modern statistics education has emphasized the appli- gency tables, analysis of variance), the most popular
cation of tangible and interesting examples to motivate themes, year after year, are sports related. We're sur-
students learning about statistical concepts. Introduc- prised to find students from China or India eager to
tory texts aimed at special audiences (e.g., business analyze attendance data for Atlanta Braves home
students, epidemiology students, or engineering stu- games or apply goodness-of-fit tests to National Colle-
dents) feature problems and illustrations relevant to giate Athletic Association (NCAA, 2003a) college bas-
those audiences, complementing course material from ketball outcomes. While engineering examples have
related classes. The current textbook (Hayter, 2002) a clear purpose in teaching students in our College of
used for the Georgia Institute of Technology's intro- Engineering, sports examples seem to bring an added
ductory statistics course in the School of Industrial level of excitement to the classroom experience.
and Systems Engineering includes a strong emphasis
on science and engineering; more than half of the ex- Introductory statistical techniques lend themselves to
ercises in the text are simple and illustrative examples endless applications in sports, especially baseball,
that are related to topics studied by engineering under- where statistics are collected on almost all aspects of
graduates. player performance. Albert (2002), a professor at
Bowling Green State University, outlines a basic
So why should one consider teaching statistics using statistics course that can be taught entirely through
sports examples? Clearly, an introductory course that baseball examples. Simonoff (1998) focused on the
is dominated by such examples is inappropriate for home run race between Sammy Sosa and Mark McG-
students who will apply statistical methods in busi- wire during the 1998 baseball season, and utilized both
ness, science, or engineering. Most sports examples introductory statistics (graphs, categorical data analy-
found in the statistics literature are based on sports sis, analysis of variance) along with more advanced
that are mainly popular in North America or Europe, methods (logistic regression and smoothing methods).
the most commonly cited topic being baseball. While
American and European instructors might be familiar The statistics literature features several more sports
with such sports examples, an increasing proportion examples; in general, they are used to motivate or il-
of students in western universities are not from west- lustrate new and advanced methods of statistical infer-
ern countries, and have less experience with these ence, e.g., Cochran (2002), Samaniego and Watnik
sports. (1997), Harville and Smith (1994), Crowder, et al.
INFORMS Transactions on Education 5:1(75-87) 75 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples
(2002), Gill (2000). For its eight most published sports intentionally lose games to ensure that top draft pick.
topics, the Current Index to Statistics (CIS) lists 230 Each of the seven teams that failed to make the playoffs
articles that appeared in statistics-related journals be- had an equal chance of drafting first. The first year
tween 1960 and 2002. Figure 1 charts the frequency of proved to be memorable as the New York Knicks re-
the eight sports in the database; although many of the ceived the first pick (with a one in seven chance) and
international journals in the CIS are published outside selected Patrick Ewing weeks later on draft day.
the United States, baseball still dominates the list. This
is partly due to baseball's close affinity with statistics After a few seasons, critics pointed out that the first
and statistical analysis, and because so much statistical selection in the draft generally had not gone to the
information about baseball is readily available on the worst or even second worst team in the league. In re-
Internet. Another reason U.S. sports dominate the lit- sponse, the draft lottery changed in 1990 to a weighted
erature is because mostly U.S. authors are submitting probability system. Since then, the NBA draft lottery
sports-related research papers to refereed journals has provided probability and statistics instructors with
(case in point: peruse the author list of this special is- non-trivial alternatives to the bland ball-in-urn
sue!). The modest goal of this article is to show differ- homework problems seen in most introductory text-
ent ways sports examples can be used to illustrate books.
simple statistical methods or to motivate project work
in an introductory class. Examples are limited to the
In the 1990 draft, the eleven worst teams participated
sports seen in Figure 1 as most frequently published
in the lottery and the ith "best" team (of the 11) would
on, notably baseball, (American) football, basketball,
receive a weight of wi = i. Although this change made
and soccer.
the worst team eleven times more likely to receive the
number one pick than the 11th worst team, luck came
to the Orlando Magic in 1993 (the 11th worst team)
when they received the first pick with the highly un-
likely chance of 1/(1+2+...+11)= 0.0152.

Critics again demanded a change in the system, per-


haps not fully understanding the rarity of occurrence
for the 1993 draft outcome, and this "catastrophic error"
rate changed from 0.015 to 0.005. The prerequisite for
understanding the draft lottery probabilities evolved
even more in subsequent years. Fourteen numbered
balls were placed in a drum, and four were chosen
without replacement (14-choose-4 = 1001 ways). One
Figure 1: No. Articles in CIS. thousand combinations were assigned to the 11 lottery
teams, with 250 of the combinations belonging to the
worst team and 5 to the best (one combination was
2. NBA Draft Lottery: A Tiring Exercise left over; drawing it would lead to a re-drawing).
in Probability
In 1995, the lottery brought in two more teams and
For teaching elementary probability, a colorful substi- reassigned some of the 1000 combinations, keeping
tute to the standard ball-and-urn examples can be 250 for the worst team and reducing the chances for
found in the National Basketball Association (NBA) the 2nd to 6th worst teams. Each augmentation pro-
draft-order determination held in spring before the vides different probability distributions for the lottery
summer draft. Prior to 1985, the last-place finishers in teams, and each one offers interesting insights to
each of the two conferences would flip a coin to deter- probability students computing and comparing the
mine which team picked first and which picked sec- probabilities associated with lottery ranking. The NBA
ond. A lottery system started in 1985 prevented the has posted several web pages associated with the draft
teams with the worst records from automatically re- lottery and the history of lottery picks and probabilities
ceiving the first two picks, so that teams would not (see National Basketball Association, 2003a-c).
INFORMS Transactions on Education 5:1(75-87) 76 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples

3. Statistical Graphics
Graphs in statistics, including bar charts (e.g., Figure
1), pie charts and histograms, represent a broad inter-
face between statistics and the general public. Statisti-
cal graphics are mandatory in the print media, and it
is now commonplace to see a political candidate use
statistical charts to support their point of view, espe-
cially in debates. Ross Perot used charts in his presi-
dential bid in 1992. Dennis J. Kucinich, during his 2004
campaign for the Democratic presidential nomination,
actually came to a National Public Radio debate pre-
pared with a pie chart to argue his point about the
Pentagon budget (to show the other candidates, he
claimed).

Television, magazines, and newspapers all rely on


charts to communicate data. The USA Today relies on
charts to communicate anything from national trends
to entertaining trivia. Occasionally, bar charts are used
Figure 2: Two different charts showing average atten-
in the sports pages. While sports examples can easily
dance at NCAA Women's Soccer (season) matches.
be used to motivate bar charts, there are less common
sports examples that show more powerfully how sta-
tistical graphics can communicate information. In fact, The reader's sense of proportion can be manipulated
sports provide numerous examples for illustrating further with image-based charts, which are standard
statistics with pie charts, scatter plots, Pareto charts, in publications such as USA Today. As an example,
bubble charts, surface plots and box plots. Figure 3 below graphs the season wins for the New
England Patriots using clip-art in place of vertical bars.
While the height of the football icons corresponds to
the information the graph is meant to communicate,
3.1. Uses and Misuses of Statistical Graphics the size of the footballs does not; the Patriots improved
56% in wins between 2002 and 2003, but the increase
Statistical lies are most frequently committed in in area of the football icons is over 140%.
graphical form, where the eyes can be more easily
deceived by spurious trends suggested in a picture. A
common abuse is manipulating scales on charts and
graphs by truncating, censoring or transforming the
axis values. Figure 2 shows two different charts
showing an increase in average attendance at NCAA
(1)
Women's Soccer games between 1998 and 2003. The
(blue) chart on the right is the default Microsoft Excel
chart; many statistical software packages, in fact, will
restrict both axes to a small set of values that contains
the data, which helps the reader focus on chart differ-
ences more clearly. However, it also removes the scale Figure 3: Regular season wins for the New England Pa-
of difference from the picture, which has potential to triots, 2002-2003.
mislead readers who pay little attention to the axis la-
bels.

(1) http://archive.ite.journal.informs.org/Vol5No1/KvamSokol/soccerAttendance.xls
INFORMS Transactions on Education 5:1(75-87) 77 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples

3.2. Boxplots for the 2003 season. In this case, outlying data points
(Alex Rodriguez - Texas, Carlos Delgado - Toronto)
(2)
Below is an example of how a box plot can summa- draw attention away from the bars, and a plot without
rize salary differences in Major League Baseball (MLB) plotted outliers (an option in most statistical packages)
can show more with respect to team salary quartiles.

Figure 4: Box plot for player salaries of MLB teams in 2003.

3.3. Graphical Summary for Basketball Games NBA game between the Minnesota Timberwolves and
the Philadelphia 76ers. Minnesota won the game 106-
Innovative plots have been developed for special sets 101. The box score, shown in Table 1, fails to summa-
of data. Westfall (1990) presented a simple, yet reveal- rize what happened in the game: Minnesota overcame
ing graphical summary of a basketball game by plot- an 18-point deficit and pulled ahead for the first time
ting the point difference between the two teams' scores late in the game. Students can learn about the power
across time. In basketball, perhaps more than any of statistical graphics through such novel uses of
other of the mainstream American sports, the game is charts. We note that this type of chart can also be used
difficult to summarize in a simple box score. Figure 5 in a stochastics course to illustrate the idea of one-di-
below shows the summary of the February 1, 2004 mensional random walks with varying step sizes.

Table 1: Box score for NBA game between Minnesota and Philadelphia, 2/1/2004

(2) http://archive.ite.journal.informs.org/Vol5No1/KvamSokol/MLBSalaries.xls
INFORMS Transactions on Education 5:1(75-87) 78 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples

Figure 5: Point difference in NBA game between Minnesota Timberwolves and Philadelphia 76ers, 2/1/2004.

4. Teaching Simpson's Paradox with nesota Twins and San Francisco Giants) had a higher
batting average (hits per at-bat) than Darin Erstad
Sports Statistics
(Anaheim Angels) in both batting situations when
examined separately, but overall Erstad had a higher
Simpson's paradox occurs with categorical data that
batting average than Mohr. The key to the paradox,
has three variables when an association between two
of course, is that the proportions being compared are
of the variables is consistent across all of the levels of
based on different sample sizes. In this case, Erstad
the third variable, but is completely different if one
appeared with runners in scoring position a smaller
aggregates over the third.
proportion of the time (20%) than did Mohr (28%).
The paradox is best described using a pair of two-by- (The reason for the disparity in at-bats with runners
two contingency tables, and baseball presents many in scoring position is that Mohr generally batted after
examples of Simpson's paradox. The three variables, more players who were likely to get on base; see Sokol
each at two levels, are player (two batters), batting (2003) for more discussion of the effect of batting order
outcome (hit or out), and batting situation (runners in placement.) Other pairings that illustrate Simpson's
scoring position or not). Table 2 below shows one of paradox include Carl Everett vs. Hideki Matsui, Jose
56 pairings in which this paradox took place in the Reyes vs. Carlos Beltran, and Frank Thomas vs. Josh
2003 MLB season. It shows how Dustan Mohr (Min- Phelps.

Table 2: Simpson's Paradox in MLB batting averages

INFORMS Transactions on Education 5:1(75-87) 79 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples

5. Regression Analysis to form general linear models, regression diagnostics,


and variable transformations to improve model fit.
Student projects involving large sets of real data are a
vital part of effective statistics classes. Projects are 6. Logistic Regression Analysis
ideal for teaching linear regression because students
have a high degree of freedom to select their own Examples from sports can also be used to teach more
models to characterize the relationship between the advanced regression techniques such as logistic regres-
response and the regressors. sion. Examples of logistic regressions are usually lim-
ited to biostatistics and other life sciences, but the fol-
One of the richest examples we have found for use in lowing example, which examines the effects of home
a statistics class is the problem of modeling a baseball court advantage in college basketball, shows how
player's value based on their individual statistics. For statistics can be used to provide students with new
each player, batter or pitcher, there are dozens of po- insights into a familiar problem.
tential regressor variables to consider in the model.
(3)
The Microsoft Excel file MLB.xls contains the 2003 Many NCAA basketball conferences play full or partial
MLB batting statistics for 336 major league batters and home-and-home round-robin schedules, so that the
lists 23 basic statistics (more refined databases have conference teams play each other twice during the
many more statistics to consider). We used a "fantasy season, once at each school. Using data collected from
league value" as the response of interest. This fantasy the 1999-2000 season through the 2002-2003 season,
league value, from The Sporting News 2003 Fantasy we seek to answer the question "Given that team A
Players Guide, is related to player performance via beat team B at home (or on the road) by X points, how
statistics such as hits, RBI, runs, home runs, stolen likely are they to win the return match on the road (or
bases, but the functional link cannot easily be charac- at home)?"
terized in a linear or nonlinear regression because
many other variables influence the response. Other College students, especially those at a school like
variables that influence fantasy value are age, team, Georgia Tech with a major basketball program, often
position, injury history, and consensus findings from give a question like this much more passionate thought
scouting reports. Up-to-date data sets can be obtained than it might deserve (especially when asked close to
(4)
from many on-line sources such as ESPN.com . His- NCAA tournament selection time), so it might make
torical data (every player, every season) can be capturing their attention an easier task. However, an-
(5)
downloaded from the The Baseball Archive . swering the question might not be as easy as they
would expect, because the model is more complex
Students usually work in pairs, and with so many than they first imagine - in addition to modeling bino-
possible regression models, it is possible that no two mial data by linking the success probability to the ob-
groups arrive at the same model. As instructors, we served point difference, students observe grossly un-
could not help but notice that students who knew the equal sample sizes; that is, there are very few observa-
most about baseball did not derive the best fitting tions of extreme cases because few teams ever win or
model. Often, a pair of students knowing little about lose a game by more than 40 points. Figure 6 shows
the nuances of the game would garner the best model the observed probability of winning a road game given
(with a small number of regressors) relying entirely the previously observed point spread in the home
on empirical results of the data to guide their model game (blue bars) along with the estimated probability
selection. Some baseball fans, on the other hand, based on the logistic regression model (white bars)
tended to interject regressors they subjectively pre- with
ferred but were not optimal variables to add into the
regression model. More advanced students can consid-
er categorical (or nominal) inputs (e.g., player's team)

(3) http://archive.ite.journal.informs.org/Vol5No1/KvamSokol/MLB_Regressiondata2003.xls
(4) http://sports.espn.go.com/mlb/stats/batting?league=mlb
(5) http://www.baseball1.com/statistics/
INFORMS Transactions on Education 5:1(75-87) 80 © INFORMS ISSN: 1532-0545
KVAM&SOKOL
Teaching Statistics with Sports Examples
where (a,b) are estimated as (-0.6228, 0.0292) with Unfortunately, there is no easy way to download and
standard errors (0.0231, 0.0017). Figure 7 charts the parse all of the scores; we wrote our own C and Unix
number of observations collected at the respective C-shell code, specialized for our system, to compile
point spreads. This data was collected from the daily the data.
college basketball scoreboard pages at Yahoo.com.

Figure 6: Observed win probabilities (blue) and logistic regression estimates (white) for home games at a given point
spread.

Figure 7: Number of games at various point spreads.

INFORMS Transactions on Education 5:1(75-87) 81 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples
A benefit of using this example to teach statistics is students now know it's worthwhile walking across
that, in addition to learning more about statistics, stu- campus to the basketball arena when that "unbeatable"
dents also see how properly applied statistical methods opponent comes to play; with a 12-point advantage,
can give sports fans a new understanding of an old who knows what might happen!
problem. In this case, they can see for themselves that
home-court advantage, usually valued at 3-5 points 7. Regression Vs. Linear Programming
(see, for example, Sagarin (2004)), is probably really
worth about 10-12 points. The model indicates that a
team needs to win a home game by 20-24 points, or Statistical regression methods are also often used to
twice the home-court advantage, in order to have an obtain relative ratings of sports teams. In statistics
approximately 50% chance of beating that same oppo- classes (and in optimization classes), power-rating (a
nent in a road game. (Mathematically, suppose h is widely-used measure for predicting a game's point
the value of home-court advantage. If Team A is p differential; see, for example, Sagarin (2003)) examples
points more skillful than Team B, then we would ex- from sports can help teach students this use of regres-
pect Team A to win at home by p+h points and have sion as well. For example, in college football many
a p-h point-differential on the road. Therefore, when conferences are too large for full round-robin play.
we observe (see Figure 6) that p-h is approximately The conference winner is still determined by won-lost
zero (a 50% chance of winning the road game) when record within the conference, but some teams play
p+h is in the 20-24 range, it is easy to deduce that h more difficult schedules than others. In the 1999 Big
must be between 10 and 12.) Ten example below, for example, students might
wonder whether Wisconsin's easier schedule led to
The moral of the story? In addition to having learned their finishing with a better record than Michigan
(and gained an appreciation for) statistical methods, and/or Michigan State.

Table 3: Results of play in the Big Ten Conference, 1999 (winner's score is listed first).

This is an interesting example that makes students servation), then Michigan and Michigan State are much
think about the relative benefits of different statistical closer to Wisconsin. On the other hand, if the power
models. If the power ratings are defined using a linear ratings are defined using a linear regression with the
programming approach (where the error in a predic- error defined as the squared difference, then Wisconsin
tion is defined as its absolute difference from the ob- has a much larger advantage.

INFORMS Transactions on Education 5:1(75-87) 82 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples

Table 4: Power ratings calculated using two simple regression models.

We describe both of these models in more detail in the course have already seen basic calculus and probabil-
appendix. ity, but may not have taken any other ISyE courses.

8. Classroom Experience ISyE 3039, Methods of Quality Improvement: A ju-


nior-level course covering design of experiments,
In this section, we describe our experiences with using measurement, statistical process analysis and control,
these examples in the classroom. Because our experi- and acceptance sampling. Students in this course must
ence covers multiple courses, we first describe the have already taken statistics (see ISyE 2028 above) and
courses in which we have used this material, and stochastics.
where those courses fit into the curriculum.
ISyE 4231, Engineering Optimization: A senior-level
All of the courses in which we have used this material course covering optimization modeling and solution
are in the School of Industrial and Systems Engineering techniques, mathematical programming, and network
(ISyE) at Georgia Tech. The undergraduate courses and graph models. Students in this course are usually
are both required for and restricted to ISyE majors, so near the end of their ISyE curriculum, all have taken
we have a relatively homogeneous set of students. The ISyE 2028, and most have taken ISyE 3039.
graduate-level course is required for ISyE students
who are pursuing a Master's degree in Operations ISyE 6669, Deterministic Optimization: A Master's-
Research (MSOR), and is taken by most ISyE students level course covering linear, discrete, and nonlinear
who are pursuing a Master's degree in Industrial En- optimization models, algorithms, and computations.
gineering (MSIE). The course also attracts first-year The students in this course have a nominal require-
ISyE PhD students who may not have seen mathemat- ment of ISyE 4231 (or an equivalent course from their
ical programming in their undergraduate curricula, undergraduate institution), but many or most actually
as well as Master's and PhD students from other disci- take the course without having taken the prerequisite.
plines whose research relates to optimization. This course also attracts Master's and PhD students
from other disciplines, giving us a very diverse set of
student backgrounds.
We have used this material in the following set of
courses:
ISyE 6739, Basic Statistical Methods: A Master's-level
(service) course intended for graduate students who
ISyE 2027, Probability with Applications: A sopho-
want an overview of basic tools for probability and
more-level course covering conditional probability,
statistics, and covers most of the material in courses
probability distributions and Poisson processes. Basic
ISyE 2027-2028.
calculus is required.

ISyE 2028, Basic Statistical Methods: A sophomore- In all of the undergraduate courses, most of our stu-
level course covering parameter estimation, statistical dents are American and have at least a basic under-
decision-making, and analysis and modeling of rela- standing of the sports involved. Even so, there are al-
tionships between variables. Students taking this ways some who are unfamiliar with even the basic

INFORMS Transactions on Education 5:1(75-87) 83 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples
rules of the games, either from lack of exposure (in the that standard least-squares regression can be formulat-
case of many foreign exchange students) or lack of ed as a convex quadratic program or a linear program
interest. Interestingly, as we noted in Section 5, those (see Appendix), and also that statistical parameter es-
students who are most familiar with the sports in timation in general is a type of optimization problem.
question are not necessarily the ones who do the best We find that the students enjoy this type of example
analysis. Often, they bring their own learned biases to quite a bit, because it makes the curriculum seem more
the analysis, whereas students who are unfamiliar unified rather than a set of unrelated methodologies.
with the application can approach the problem with
a fresh perspective. (We note that we have observed 9. Conclusion
the same phenomenon with other, non-sports applica-
tions as well; for example, students who have co-op In this paper, we have described several ways in which
or internship experience in logistics sometimes get introductory and advanced statistical concepts can be
bogged down in minor details, e.g., where the truck illustrated using examples from sports. Based on stu-
driver will stop for lunch, and miss the overall analyt- dent feedback, we find that most students enjoy sports
ical benefit of using a mathematical model.) Moreover, examples. The fact that the abstract concepts they learn
even in the graduate course where the majority of can be applied in recreational ways often gets them
students may not be from the US, students have no thinking about other real-life situations, not just tradi-
difficulty understanding the concept of the underlying tional industrial engineering applications, where
model of sports or games, just as they can quickly pick statistics can be useful. In fact, when local television
up models of factories despite generally having little and radio stations reported on the success of a predic-
to no experience inside of one. tive model (Kvam and Sokol, 2004) we created based
partially on the logistic regression example of Section
Before enrolling in the probability and statistics 6, we even had several students approach us asking
courses, students have taken a year of basic calculus. if we would supervise them in independent research
Fundamental probability is covered in ISyE 2027 (and on these topics.
ISyE 6739), where the lottery example is used to illus-
trate basic counting problems. Statistical graphics are Overall, we have had a lot of success using these and
a core subject for ISyE 2028, and also introduced in other sports examples in the classroom. We find that
more abbreviated form in ISyE 6739. Both of these students are very receptive to the application of
courses finish the term with a regression project, and statistics to sports, even if they are not sports fans
every usage of the MLB data set has proved successful. themselves, and that they enjoy seeing how the mate-
More than textbook data sets, the MLB example intro- rial they learn can be applied in settings other than
duces students to the gray issues of over-fitting versus those of traditional industrial engineering. Education-
parsimony. In final course evaluations, the baseball ally, we have observed that the students' enjoyment
project receives more praise than any other specific leads to increased interest in the material and therefore,
project or homework assignments. we hope, increased learning.

In the optimization courses, we teach students who


are further along in their curriculum; almost all of the
students in ISyE 4231 are seniors, and students in ISyE
6669 are all Master's or PhD-level. Therefore, the more
advanced examples described in this paper are useful
for two reasons. First, the students are more advanced
and can understand more complex models. Second,
students in both courses will have previously seen
statistical concepts such as regression: undergraduates
will have already taken ISyE 2028, and graduate stu-
dents should have seen regression as undergraduates.
Therefore, the football power-rating example of Section
7 has the benefit of showing linkage between optimiza-
tion and statistics. From this exercise, students learn

INFORMS Transactions on Education 5:1(75-87) 84 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples

References Samaniego, F.J. and Watnik, M.R. (1997), "The Separa-


tion Principle in Linear Regression," Journal of
Albert, J. (2002), "A Baseball Statistics Course," Journal Statistics Education, Vol. 5, No. 3.
of Statistics Education, Vol. 10, No. 2.
Simonoff, J.S. (1998), "Move Over, Roger Maris:
Cochran, J. (2002), "Data Management, Exploratory Breaking Baseball's Most Famous Record,"
Data Analysis, and Regression Analysis with Journal of Statistics Education, Vol. 6, No. 3.
1969-2000 Major League Baseball Attendance,"
Journal of Statistics Education, Vol. 10, No. 2. Sokol, J.S. (2003), "A Robust Heuristic for Batting Order
Optimization Under Uncertainty," Journal of
Crowder, M., Dixon, M., Ledford, A. and Robinson, Heuristics, Vol. 9, pp. 353-370.
M. (2002), "Dynamic Modelling and Prediction
of English League Football Matches for Bet- Westfall, P. H. (1990) "Graphical Presentation of a
ting," The Statistician, Vol. 51, No. 2, pp. 157- Basketball Game," The American Statistician,
168. Vol. 44, No. 1, pp. 35-38.

Gill, P.S. (2000), "Late-game Reversals in Professional


Basketball, Football, and Hockey," The Ameri-
can Statistician, Vol. 54, No. 2, pp. 94-99.
Harville, D. A. and Smith, M. H. (1994), "The Home
Court Advantage: How large is it, and does it
vary from team to team?," The American
Statistician, Vol. 48, No. 1, pp. 22-28.
Hayter, Anthony J. (2002), Probability and Statistics
for Engineers and Scientists, 2nd edition.
Duxbury Press.
Kvam, P. and Sokol, J.S. (2004), "A Successful Logistic
Regression/Markov Chain Model for NCAA
Basketball," Working paper, School of Industri-
al and Systems Engineering, Georgia Institute
of Technology.
National Basketball Association (2003a), Evolution of
the Draft Lottery,
http://www.nba.com/history/draft_evolu-
tion.html
National Basketball Association (2003b), Year by Year
Lottery Picks,
http://www.nba.com/history/lottery_picks.html
National Basketball Association (2003c), Year by Year
Lottery Probabilities,
http://www.nba.com/history/lottery_probabil-
ities.html
Sagarin, J. (2003), Jeff Sagarin NCAA Football Ratings,
http://www.usato-
day.com/sports/sagarin/fbt03.htm
Sagarin, J. (2004), Jeff Sagarin NCAA Basketball Rat-
ings,
http://www.usato-
day.com/sports/sagarin/bkt0304.htm

INFORMS Transactions on Education 5:1(75-87) 85 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples

Appendix
1. Power Rating Models

In Section 7, we refer to two different statistical models that can be used to determine power ratings. In this
appendix, we describe each model mathematically. For both models, we define the following notation:

G set of games played, where each game g ∈ G is an unordered set of teams {i,j}.
pig points scored by team i in game g
ri power rating assigned to team i

Both models use the standard technique of minimizing a function of the total error in their predictions. We
define the error eg for a single game g = { i, j } to be the difference in predicted point spread and actual point
spread, or eg = (ri - rj) - (pig - pjg).

The only difference between the two models is that one finds ratings that minimize the total absolute error
2
∑ g∈G|eg| while the other finds ratings that minimize the total squared error ∑ g∈Ge g.

1.1. Linear Programming Using Absolute Error

We can formulate the problem of finding ratings to minimize the total absolute error as a linear program:

Minimize ∑ g∈Gag (1)

Subject to eg = (ri - rj) - (pig - pjg) ∀ g = {i,j} ∈ G, (2)

ag ≥ eg ∀ g = {i,j} ∈ G, (3)

ag ≥ -eg ∀ g = {i,j} ∈ G. (4)

In this model, the variables ag denote the absolute error for game g. Constraints (3) and (4) ensure that ag ≥ |eg|
for each game g, while the minimization objective ensures that one constraints (3) and (4) will be binding at
optimality for each game g, and thus ag = |eg|.

1.2. Mathematical Programming Using Squared Error

We can formulate the standard regression problem of minimizing the squared error as a the following mathe-
matical program:

2
Minimize ∑ g∈Ge g (5)

INFORMS Transactions on Education 5:1(75-87) 86 © INFORMS ISSN: 1532-0545


KVAM&SOKOL
Teaching Statistics with Sports Examples

Subject to eg = (ri - rj) - (pig - pjg) ∀ g = {i,j} ∈ G, (6)

This mathematical program is similar to the linear program in Section 9.1. In fact, it is well-known that we can
solve this problem by substituting (6) into the objective for eg, setting partial derivatives taken with respect to
each ri equal to zero, and solving the resulting system of linear equations; therefore, this model too can be op-
timized using linear programming software.

INFORMS Transactions on Education 5:1(75-87) 87 © INFORMS ISSN: 1532-0545

View publication stats