CHAPTER 1 Examining Distributions

Moore-3620020 psbe August 16, 2010 23:30
CHAPTER 1
CORBIS
An iPod can hold thousands of

songs. Apple has developed the
Examining Distributions iTunes playlist to organize data

about the songs on an iPod.
Example 1.1 discusses how these
data are organized.
Introduction CHAPTER OUTLINE

Statistics is the science of data. Data are numerical facts. In this chapter, we 1.1 Displaying Distributions
will master the art of examining data. with Graphs
A statistical analysis starts with a set of data. We construct a set of data
1.2 Describing Distributions
by first deciding what cases or units that we want to study. For each case, we
record information about characteristics that we call variables. with Numbers
1.3 Density Curves and the
Normal Distributions
Cases, Labels, Variables, and Values
Cases are the objects described by a set of data. Cases may be customers,
companies, subjects in a study, or other objects.
A label is a special variable used in some data sets to distinguish the
different cases.
A variable is a characteristic of a case.
Different cases can have different values for the variables.
EXAMPLE 1.1 Over 5 Billion Sold

Apples music-related products and services generated $1.05 billion in the first quar-
ter of 2008 and accounted for 13% of the companys revenue. Since Apple started
marketing iTunes in 2003, they have sold over 5 billion songs. Lets take a look at this
remarkable product. Figure 1.1 is part of an iTunes playlist named PSBE. The four songs
shown are cases. They are numbered from 1 to 4 in the first column. These numbers are
the labels that distinguish the four songs. The following five columns give the name (of
the song), time (the length of time it takes to play the song), artist, album, and genre.
4 CHAPTER 1 Examining Distributions
FIGURE 1.1 Part of an iTunes playlist, for Example 1.1.
Some variables, like the name of a song and the artist simply place cases into
categories. Others, like the length of a song, take numerical values for which we can do
arithmetic. It makes sense to give an average length of time for a collection of songs, but
it does not make sense to give an average album. We can, however, count the numbers
of songs for different albums, and we can do arithmetic with these counts.
Categorical and Quantitative Variables

A categorical variable places a case into one of several groups or categories.
A quantitative variable takes numerical values for which arithmetic operations such
as adding and averaging make sense.
The distribution of a variable tells us what values it takes and how often it takes these
values.
EXAMPLE 1.2 Categorical and Quantitative Variables in the iTunes Playlist

The PSBE iTunes playlist contains five variables. These are the name, time, artist, album, and
genre. The time is a quantitative variable. Name, artist, album, and genre are categorical variables.
An appropriate label for your cases should be chosen carefully. In our iTunes ex-
ample, a natural choice of a label would be the name of the song. However, if you have
more than one artist performing the same song, or the same artist performing the same
song on different albums, then the name of the song would not uniquely label each of
the songs in your playlist.
A quantitative variable such as the time in the iTunes playlist requires some special
attention before we can do arithmetic with its values. The first song in the playlist has
time equal to 3:29that is, 3 minutes and 29 seconds. To do arithmetic with this variable,
we should first convert all of the values so that they have a single unit of measurement.
We could convert to seconds; 3 minutes is 180 seconds, so the total time is 180 + 29, or
209 seconds. An alternative would be to convert to minutes; 29 seconds is .483 minutes,
so the time calculated in this way is 3.483 minutes.
APPLY YOUR KNOWLEDGE
1.1 Time in the iTunes playlist. In the iTunes playlist, do you prefer to convert the
time to seconds or minutes? Give a reason for your answer.
Introduction 5
In practice, any set of data is accompanied by background information that helps

us understand the data. When you plan a statistical study or explore data from someone
elses work, ask yourself the following questions:
1. Who? What cases do the data describe? How many cases appear in the data?
2. What? How many variables do the data contain? What are the exact definitions of
these variables? In what unit of measurement is each variable recorded?
3. Why? What purpose do the data have? Do we hope to answer some specific
questions? Do we want to draw conclusions about cases other than the ones we
actually have data for? Are the variables that are recorded suitable for the intended
purpose?
EXAMPLE 1.3 Data for Students in a Statistics Class

Figure 1.2 shows part of a data set for students enrolled in an introductory statistics class. Each
row gives data on one student. The values for the different variables are in the columns. This data
set has eight variables. ID is an identifier, or label, for each student. Exam1, Exam2, Homework,
Final, and Project give the points earned, out of a total of 100 possible, for each of these course
requirements. Final grades are based on a possible 200 points for each exam and the final, 300 points
for Homework, and 100 points for Project. TotalPoints is the variable that gives the composite
score. It is computed by adding 2 times Exam1, Exam2, and Final, 3 times Homework, and 1 times
Project. Grade is the grade earned in the course. This instructor used cutoffs of 900, 800, 700, etc.
for the letter grades.
Microsoft Excel
A B C D E F G H
1 ID Exam1 Exam2 Homework Final Project Total Points Grade
2 101 89 94 88 87 95 899 A
3 102 78 84 90 89 94 866 B
4 103 71 80 75 79 95 780 C
5 104 95 98 97 96 93 962 A
6 105 79 88 85 88 96 861 B
FIGURE 1.2 Spreadsheet for Example 1.3.
1.2 Who, what, and why for the statistics class data. Answer the Who, What, and
Why questions for the statistics class data set.
1.3 Read the spreadsheet. Refer to Figure 1.2. Give the values of the variables
Exam1, Exam2, and Final for the student with ID equal to 103.
1.4 Calculate the grade. A student whose data do not appear on the spreadsheet
scored 88 on Exam1, 85 on Exam2, 77 for Homework, 90 on the Final, and 80 on the
Project. Find TotalPoints for this student and give the grade earned.
spreadsheet The display in Figure 1.2 is from an Excel spreadsheet. Spreadsheets are very useful
for doing the kind of simple computations that you did in Exercise 1.4. You can type in
a formula and have the same computation performed for each row.
Note that the names we have chosen for the variables in our spreadsheet do not
have spaces. For example, we could have used the name Exam 1 for the first exam
score rather than Exam1. In some statistical software packages, however, spaces are not
allowed in variable names. For this reason, when creating spreadsheets for eventual use
with statistical software, it is best to avoid spaces in variable names. Another convention
is to use an underscore ( ) where you would normally use a space. For our data set, we
could use Exam 1, Exam 2, and Final Exam.
EXAMPLE 1.4 Cases and Variables for the Statistics Class Data
The data set in Figure 1.2 was constructed to keep track of the grades for students in an introductory
statistics course. The cases are the students in the class. There are 8 variables in this data set. These
include an identifier for each student and scores for the various course requirements. There are no
units of measurement for ID and grade; they are categorical variables. The other variables all are
measured in points; since it makes sense to do arithmetic with these values, these variables are
quantitative variables.
EXAMPLE 1.5 Statistics Class Data for a Different Purpose

Suppose the data for the students in the introductory statistics class were also to be used to study
relationships between student characteristics and success in the course. For this purpose, we might
want to use a data set that includes other variables such as Gender, PrevStat (whether or not the
student has taken a statistics course previously), and Year (student classification as first, second,
third, or fourth year). ID is a categorical variable, total points is a quantitative variable, and the
remaining variables are all categorical.
In our example, the possible values for the grade variable are A, B, C, D, and F.
When computing grade point averages, many colleges and universities translate these
letter grades into numbers using A = 4, B = 3, C = 2, D = 1, and F = 0. The transformed
variable with numeric values is considered to be quantitative because we can average
the numerical values across different courses to obtain a grade point average.
Sometimes, experts argue about numerical scales such as this. They ask whether
or not the difference between an A and a B is the same as the difference between a D
and an F. Similarly, many questionnaires ask people to respond on a 1 to 5 scale with 1
representing strongly agree, 2 representing agree, etc. Again we could ask whether or not
the five possible values for this scale are equally spaced in some sense. From a practical
point of view, the averages that can be computed when we convert categorical scales such
as these to numerical values frequently provide a very useful way to summarize data.
1.5 Apartment rentals for students. A data set lists apartments available for students
to rent. Information provided includes the monthly rent, whether or not cable is included
free of charge, whether or not pets are allowed, the number of bedrooms, and the
distance to the campus. Describe the cases in the data set, give the number of variables,
and specify whether each variable is categorical or quantitative.
1.1 Displaying Distributions with Graphs 7
Knowledge of the context of data includes an understanding of the variables that

are recorded. Often the variables in a statistical study are easy to understand: height in
centimeters, study time in minutes, and so on. But each area of work also has its own
special variables. A psychologist uses the Minnesota Multiphasic Personality Inventory
(MMPI), and a physical fitness expert measures VO2 max, the volume of oxygen
consumed per minute while exercising at your maximum capacity. Both of these variables
instrument are measured with special instruments. VO2 max is measured by exercising while
breathing into a mouthpiece connected to an apparatus that measures oxygen consumed.
Scores on the MMPI are based on a long questionnaire, which is also an instrument. Part
of mastering your field of work is learning what variables are important and how they are
best measured. Because details of particular measurements usually require knowledge
of the particular field of study, we will say little about them.
Be sure that each variable really does measure what you want it to. A poor choice
rate of variables can lead to misleading conclusions. Often, for example, the rate at which
something occurs is a more meaningful measure than a simple count of occurrences.
EXAMPLE 1.6 Insurance for Passenger Cars and Motorcycles

Should insurance rates be higher for passenger cars than for motorcycles or should they be lower?
Part of the answer to this question can be found by examining accidents for these two types of
vehicles. The governments Fatal Accident Reporting System says that 22,856 passenger cars
were involved in fatal accidents in 2007. Only 5306 motorcycles had fatal accidents that year.1
Does this mean that motorcycles are safer than cars? Not at allthere are many more cars than
motorcycles, so we expect cars to have a higher count of fatal accidents.
A better measure of the dangers of driving is a rate, the number of fatal accidents divided by
the number of vehicles on the road. In 2007, passenger cars had about 16.6 fatal accidents for each
100,000 vehicles registered. There were about 74.3 fatal accidents for each 100,000 motorcycles
registered. The rate for motorcycles is more than three times the rate for cars. Motorcycles are, as
we might guess, much more dangerous than cars.
1.1 Displaying Distributions with Graphs

Statistical tools and ideas help us examine data to describe their main features. This
exploratory data analysis examination is called exploratory data analysis. Like an explorer crossing unknown
lands, we want first to simply describe what we see. Here are two basic strategies that
help us organize our exploration of a set of data:
Begin by examining each variable by itself. Then move on to study the relationships
among the variables.
Begin with a graph or graphs. Then add numerical summaries of specific aspects of
the data.
We will follow these principles in organizing our learning. This chapter presents methods
for describing a single variable. We study relationships among two or more variables in
Chapter 2. Within each chapter, we begin with graphical displays, then add numerical
summaries for a more complete description.
Categorical variables: bar graphs and pie charts

The values of a categorical variable are labels for the categories, such as Yes and No.
distribution of a categorical The distribution of a categorical variable lists the categories and gives either the count
variable or the percent of cases that fall in each category.
EXAMPLE 1.7 GPS Market Share

ATA FIL
DATADATA
The Global Positioning System (GPS) uses satellites to transmit microwave signals that enable
E
DATADATADATA
DATADATADATADATADATA
GPS
GPS receivers to determine the exact location of the receiver. Here are the market shares for the
DATADATADATA
major GPS receiver brands sold in the United States.2
Company Percent
Garmin 47
TomTom 19
Magellan 17
Mio 7
Courtesy Garmin
Other 10
Company is the categorical variable in this example, and the values of this variable are the names
of the companies that provide GPS receivers in this market.
Note that the last value of the variable Company is Other, which includes all
receivers sold by companies other than the four listed by name. For data sets that have a
large number of values for a categorical variable, we often create a category such as this
that includes categories that have relatively small counts or percents. Careful judgment is
needed when doing this. You dont want to cover up some important piece of information
contained in the data by combining data in this way.
When we look at the GPS market share data set, we see that Garmin dominates the
market with almost half of the sales. By using graphical methods, we can easily see this
information and other characteristics of the data easily. We now examine two graphical
ways to do this.
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA GPS EXAMPLE 1.8 Bar Graph for the GPS Market Share Data
DATADATADATA
Figure 1.3 displays the GPS market share data using a bar graph. The heights of the five bars
bar graph show the market shares for the four companies and the Other category.
FIGURE 1.3 Bar graph for the

GPS data in Example 1.8. 50
45
40
Market Share
35
30
25
20
15
10
5
0
in
io
er
la
m
To
th
el
ar
O
m
ag
G
To
M
The categories in a bar graph can be put in any order. In Figure 1.3, we ordered
the companies based on their market share, with the Other category coming last. For
other data sets, an alphabetical ordering or some other arrangement might produce a
more useful graphical display. You should always consider the best way to order the
values of the categorical variable in a bar graph. Choose an ordering that will be useful
to you. If you are uncertain, ask a friend whether your choice communicates what you
expect.
ATA FIL
DATADATA
D
E
DATADATADATA
GPS
EXAMPLE 1.9 Pie Chart for the GPS Market Share Data
DATADATADATA
pie chart The pie chart in Figure 1.4 helps us see what part of the whole each group forms. Even if we did
not include the percents, it would be very easy to see that Garmin has about half of the market.
FIGURE 1.4 Pie chart for the

GPS data in Example 1.9. Market Share
Other
10%
Mio
7%
Garmin
47%
Magellan
17%
TomTom
19%
To make a pie chart, you must include all the categories that make up a whole. A
category such as Other in this example can be used, but the sum of the percents for all
of the categories should be 100%.
Bar graphs are more flexible. For example, you can use a bar graph to compare the
numbers of students at your college majoring in biology, business, and political science.
A pie chart cannot make this comparison, because not all students fall into one of these
three majors.
We use graphical displays to help us learn things from data. Here is another example.
EXAMPLE 1.10 The Cost Is $164 Billion!
ATA FIL
DATADATA
Auto accidents cost $164 billion each year.3 How can this enormous burden on the economy
D
DATADATADATA
DATADATADATADATADATA CRASHES be reduced? Lets look at some data.4 Figure 1.5 is a bar graph that gives the percents of auto
DATADATADATA
accidents for each day of the week. What do we learn from this graph? The highest percent is
on Saturday, about 17%, and the lowest is on Monday, about 10%. If we were to seek govern-
ment funding for a program to reduce accidents, we might do some research on the Saturday
accidents.
FIGURE 1.5 Bar graph for the

automobile accident data, for 18
Example 1.10.
16
14
12
Percent
10
8
6
4
2
0
ay
ay
ay
ay
ay
da
da
d
sd
id
nd
on
es
ur
Fr
ne
ur
Su
t
Tu
M
Sa
Th
ed
W
The categories in Figure 1.5 are ordered by the days of the week, Monday through
Sunday. In exploring what these data tell us about accidents, we focused on the day of the
week with the highest percent of accidents. Lets pursue this idea a little further and order
the categories from highest percent to lowest percent. A bar graph whose categories are
Pareto chart ordered from most frequent to least frequent is called a Pareto chart.5
EXAMPLE 1.11 Pareto Chart for Automobile Accidents

ATA FIL
DATADATA
D
DATADATADATA
Figure 1.6 displays the Pareto chart for the automobile accident data. Here it is easy to see that
CRASHES
DATADATADATA
Saturday is the highest. Friday, Wednesday, and Thursday are also relatively high. Tuesday and
Sunday are a bit lower. Monday is the lowest.
FIGURE 1.6 Pareto chart for the

automobile accident data, for 18
Example 1.11.
16
14
12
Percent
10
8
6
4
2
0
ay
ay
ay
ay
ay
y
da
da
rd
id
sd
sd
nd
es
on
Fr
tu
ne
ur
Su
Tu
M
Sa
Th
ed
W
Pareto charts are frequently used in quality control settings. Here, the purpose is
often to identify common types of defects in a manufactured product. Deciding upon
strategies for corrective action can then be based on what would be most effective.
Chapter 12 gives more examples of settings where Pareto charts are used.
Bar graphs, pie charts, and Pareto charts help an audience grasp a distribution
quickly. When you prepare them, keep in mind this purpose. We will move on to quan-
titative variables, where graphs are essential tools.
ATA FIL
DATADATA
D
DATADATADATA
CANADIAN POPULATION
DATADATADATA
1.6 Population of Canadian provinces and territories. Here are populations of
13 Canadian provinces and territories based on the 2006 census:6
Province/territory Population
Alberta 3,290,350
British Columbia 4,113,487
Manitoba 1,148,401
New Brunswick 729,997
Newfoundland and Labrador 505,469
Northwest Territories 41,464
Nova Scotia 913,462
Nunavut 29,474
Ontario 12,160,282
Prince Edward Island 135,851
Quebec 7,546,131
Saskatchewan 968,157
Yukon 30,372
(a) Display these data in a bar graph using the alphabetical order of provinces and
territories in the table.
(b) Use a Pareto chart to display these data.
(c) Compare the two graphs. Which do you prefer? Give a reason for your answer.
ATA FIL
DATADATA
1.7 GPS market share in Europe. In Examples 1.7 to 1.9 (pages 8 to 9), we examined
D
DATADATADATA
the U.S. market share of several companies that sell GPS receivers. Here is a similar
GPSEUROPE
table for the European market:7

DATADATADATA
Company Market share (%)

TomTom 38
Other 26
Garmin 19
(a) Display the data in a bar graph. Be sure to choose the ordering for the companies
carefully. Explain why you made this choice.
(b) Compare this graph with the bar graph in Figure 1.3. Garmin has its world
headquarters in Olathe, Kansas, while TomToms registered address is
Amsterdam, the Netherlands. Explain how this information helps you to
understand the differences between the two bar graphs.
Quantitative variables: histograms

Quantitative variables often take many values. A graph of the distribution is clearer if
nearby values are grouped together. The most common graph of the distribution of a
histogram single quantitative variable is a histogram.
ATA FIL
Treasury Bills Treasury bills, also known as T-bills, are bonds issued by the U.S. Depart-
DATADATA
D
DATADATADATA
CASE 1.1
TBILLRATES ment of the Treasury. You buy them at a discount from their face value, and they mature in a
DATADATADATA
fixed period of time. For example, you might buy a $1000 T-bill for $980. When it matures,
six months later, you would receive $1000your original $980 investment plus $20 interest.
This interest rate is $20 divided by $980, which is 2.04% for six months. Interest is usually
reported as a rate per year, so for this example the interest rate would be 4.08%. Rates are
determined by an auction that is held every four weeks. The data set contains the interest
rates for T-bills for each auction from December 12, 1958, to October 3, 2008.8
Our data set contains 2600 cases. The two variables in the data set are the date of
the auction and the interest rate. To learn something about T-bill interest rates, we begin
with a histogram.
EXAMPLE 1.12 A Histogram of T-bill Interest Rates
CASE 1.1 To make a histogram of the T-bill interest rates, we proceed as follows.
classes Step 1. Divide the range of the interest rates into classes of equal width. The T-bill interest rates
range from 0.85% to 15.76%, so we choose as our classes
ATA FIL
DATADATA
0.00 rate < 2.00
D
DATADATADATA
TBILLRATES 2.00 rate < 4.00
..
DATADATADATA
.
14.00 rate < 16.00
Be sure to specify the classes precisely so that each case falls into exactly one class. An interest
rate of 1.98% would fall into the first class, but 2.00% would falls into the second.
Step 2. Count the number of cases in each class. Here are the counts:
Class Count Class Count

0.00 rate < 2.00 178 8.00 rate < 10.00 235
2.00 rate < 4.00 575 10.00 rate < 12.00 64
4.00 rate < 6.00 951 12.00 rate < 14.00 58
6.00 rate < 8.00 501 14.00 rate < 16.00 38
Step 3. Draw the histogram. Mark on the horizontal axis the scale for the variable whose dis-
tribution you are displaying. The variable is interest rate in this example. The scale runs from
0 to 16 to span the data. The vertical axis contains the scale of counts. Each bar represents a class.
The base of the bar covers the class, and the bar height is the class count. Notice that the scale on
the vertical axis runs from 0 to 1000 to accommodate the tallest bar, which has a height of 951.
There is no horizontal space between the bars unless a class is empty, so that its bar has height
zero. Figure 1.7 is our histogram.
FIGURE 1.7 Histogram for

T-bill interest rates, for 1000
Example 1.12.
800
600
Count
400
200
0
1 3 5 7 9 11 13 15
Interest rate (%)
Our eyes respond to the area of the bars in a histogram.9 Because the classes are
all the same width, area is determined by height and all classes are fairly represented.
There is no one right choice of the classes in a histogram. Too few classes will give
a skyscraper graph, with all values in a few classes with tall bars. Too many will
produce a pancake graph, with most classes having one or no observations. Neither
choice will give a good picture of the shape of the distribution. You must always use your
judgment in choosing classes to display the shape. Statistics software will choose the
classes for you. The computers choice is usually a good one. Sometimes, however, the
classes chosen by software differ from the natural choices that you would make. Usually,
options are available for you to change them. The next example illustrates a situation
where the wrong choice of classes will cause you to miss a very important characteristic
of a data set.
EXAMPLE 1.13 Calls to a Customer Service Center
ATA FIL
DATADATA
Many businesses operate call centers to serve customers who want to place an order or make an
D
DATADATADATA
CALLCENTER80
inquiry. Customers want their requests handled thoroughly. Businesses want to treat customers
DATADATADATA
well, but they also want to avoid wasted time on the phone. They therefore monitor the length of
calls and encourage their representatives to keep calls short.
We have data on the length of all 31,492 calls made to the customer service center of a small
bank in a month. Table 1.1 displays the lengths of the first 80 calls.10
Take a look at the data in Table 1.1. In this data set the cases are calls made to the banks call
center. The variable recorded is the length of each call. The units of measurement are seconds. We
see that the call lengths vary a great deal. The longest call lasted 2631 seconds, almost 44 minutes.
More striking is that 8 of these 80 calls lasted less than 10 seconds. Whats going on?
TABLE 1.1 Service times (seconds) for calls to a customer service center
77 289 128 59 19 148 157 203
126 118 104 141 290 48 3 2
372 140 438 56 44 274 479 211
179 1 68 386 2631 90 30 57
89 116 225 700 40 73 75 51
148 9 115 19 76 138 178 76
67 102 35 80 143 951 106 55
4 54 137 367 277 201 52 9
700 182 73 199 325 75 103 64
121 11 9 88 1148 2 465 25
We started our study of the customer service center data by examining a few cases,
the ones displayed in Table 1.1. It would be very difficult to examine all 31,492 cases in
this way. We need a better method. Lets try a histogram.
EXAMPLE 1.14 Histogram for Customer Service Center Call Lengths
ATA FIL
DATADATA
Figure 1.8 is a histogram of the lengths of all 31,492 calls. We did not plot the few lengths greater
D
DATADATADATA
DATADATADATADATADATA than 1200 seconds (20 minutes). As expected, the graph shows that most calls last between about 1
CALLCENTER and 5 minutes, with some lasting much longer when customers have complicated problems. More
DATADATADATA
striking is the fact that 7.6% of all calls are no more than 10 seconds long. It turns out that the bank
penalized representatives whose average call length was too longso some representatives just
hung up on customers in order to bring their average length down. Neither the customers nor the
bank were happy about this. The bank changed its policy, and later data showed that calls under
10 seconds had almost disappeared.
FIGURE 1.8 The distribution of

call lengths for 31,492 calls 2500
to a banks customer service 7.6% of all calls
center, for Example 1.14. The are 10 seconds long.
data show a surprising number 2000
of very short calls. These are
mostly due to representatives
deliberately hanging up in
Count of calls
1500
order to bring down their
average call length.
1000
500
0
0 200 400 600 800 1000 1200
Service time (seconds)
The choice of the classes is an important part of making a histogram. Lets look at
the customer service center call lengths again.
EXAMPLE 1.15 Another Histogram for Customer Service Center Call Lengths
ATA FIL
DATADATA
Figure 1.9 is a histogram of the lengths of all 31,492 calls with class boundaries of 0, 100,
D
E
DATADATADATA
200, etc. seconds. Statistical software made this choice as a default option. Notice that the spike
CALLCENTER
representing the very brief calls that appears in Figure 1.8 is covered up in the 0 to 100 seconds
DATADATADATA
class in Figure 1.9.
FIGURE 1.9 The default

histogram produced by 14,000
software for the call lengths,
for Example 1.15. This choice 12,000
of classes hides the large
number of very short calls that 10,000
Count of calls
is revealed by the histogram of

8,000
the same data in Figure 1.8.
6,000
4,000
2,000
0
0 200 400 600 800 1000 1200
Service time (seconds)
If we let software choose the classes, we would miss one of the most important
features of the data, the calls of very short duration. We were alerted to this unexpected
characteristic of the data by our examination of the 80 cases displayed in Table 1.1.
Beware of letting statistical software do your thinking for you. Example 1.15 illustrates
the danger of doing this. To do an effective analysis of data, we often need to look at
data in more than one way. For histograms, looking at several choices of classes will
lead us to a good choice. Fortunately, with software, examining choices such as this is
relatively easy.
1.8 Exam grades in a statistics course. The table below summarizes the exam scores
of students in an introductory statistics course. Use the summary to sketch a histogram
that shows the distribution of scores.
Class Count
60 score < 70 11
70 score < 80 36
80 score < 90 57
90 score < 100 29
1.9 Suppose some students scored 100. No students earned a perfect score of 100
on the exam described in the previous exercise. Note that the last class included only
scores that were greater than or equal to 90 and less than 100. Explain how you would
change the class definitions for a similar exam on which some students earned a perfect
score.
Quantitative variables: stemplots

Histograms are not the only graphical display of distributions of quantitative variables.
For small data sets, a stemplot is quicker to make and presents more detailed information.
It is sometimes referred to as a back-of-the-envelope technique. Popularized by the
statistician John Tukey, it was designed to give a quick and informative look at the
distribution of a quantitative variable. A stemplot was originally designed to be made by
hand, although many statistical software packages include this capability.
Stemplot
To make a stemplot:
1. Separate each observation into a stem consisting of all but the final (rightmost)
digit and a leaf, the final digit. Stems may have as many digits as needed, but
each leaf contains only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and draw a
vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing order out from
the stem.
EXAMPLE 1.16 A Stemplot of T-bill Interest Rates
CASE 1.1 The histogram that we produced in Example 1.12 to examine the T-bill interest rates used all 2600
cases in the data set. To illustrate the idea of a stemplot, we will take a simple random sample of
ATA FIL
DATADATA
size 50 from this data set. We will learn more about how to take such samples in Chapter 3. Here
D
DATADATADATA
TBILLRATES50 are the data:
DATADATADATA
7.2 5.7 6.0 5.0 12.8 7.8 11.6 4.6 2.7 4.9
5.8 13.8 1.5 4.6 3.7 8.3 7.0 3.2 5.8 1.0
7.2 8.0 3.2 7.5 5.4 5.3 6.9 5.8 5.0 9.4
10.4 4.3 6.8 1.0 5.5 5.1 4.6 6.6 4.7 6.1
5.7 1.0 3.8 7.3 6.5 3.0 3.9 8.0 3.0 7.9
The original data set gave the interest rates with two digits after the decimal point. To make the job
of preparing our stemplot easier, we first rounded the values to one place following the decimal.
Figure 1.10 illustrates the key steps in constructing the stemplot for these data. How does the
stemplot for this sample of size 50 compare with the histogram based on all 2600 interest rates
that we examined in Figure 1.7 (page 13)?
You can choose the classes in a histogram. The classes (the stems) of a stemplot are
rounding given to you. When the observed values have many digits, it is often best to round the
numbers to just a few digits before making a stemplot, as we did in Example 1.16.
FIGURE 1.10 Steps in creating

a stemplot for the sample of 1 1 5000 1 0005
50 T-bill interest rates, for 2 2 7 2 7
Example 1.16. (a) Write the 3 3 7228090 3 0022789
stems in a column, from 4 4 696367 4 366679
smallest to largest, and draw 5 5 70884380517 5 00134577888
a vertical line to their right. 6 6 098615 6 015689
(b) Add each leaf to the right 7 7 2802539 7 0223589
of its stem. (c) Arrange each 8 8 300 8 003
leaf in increasing order out
9 9 4 9 4
from its stem.
10 10 4 10 4
11 11 6 11 6
12 12 8 12 8
13 13 8 13 8
(a) (b) (c)
splitting stems You can also split stems to double the number of stems when all the leaves would
otherwise fall on just a few stems. Each stem then appears twice. Leaves 0 to 4 go on
the upper stem and leaves 5 to 9 go on the lower stem. Rounding and splitting stems are
matters for judgment, like choosing the classes in a histogram. Stemplots work well for
small sets of data. When there are more than 100 observations, a histogram is almost
always a better choice.
Special considerations apply for very large data sets. It is often useful to take a
sample and examine it in detail as a first step. This is what we did in Example 1.16.
Sampling can be done in many different ways. A company with a very large number of
customer records, for example, might look at those from a particular region or country
for an initial analysis.
Interpreting histograms and stemplots

Making a statistical graph is not an end in itself. The purpose of the graph is to help us
understand the data. After you make a graph, always ask, What do I see? Once you
have displayed a distribution, you can see its important features as follows.
Examining a Distribution
In any graph of data, look for the overall pattern and for striking deviations from that
pattern.
You can describe the overall pattern of a histogram by its shape, center, and spread.
An important kind of deviation is an outlier, an individual value that falls outside the
overall pattern.
We will learn how to describe center and spread numerically in Section 1.2. For now,
we can describe the center of a distribution by its midpoint, the value with roughly half
the observations taking smaller values and half taking larger values. We can describe the
spread of a distribution by giving the smallest and largest values.
EXAMPLE 1.17 The Distribution of T-bill Interest Rates
CASE 1.1 Lets look again at the histogram in Figure 1.7. There appear to be some relatively large interest
rates. The largest is 15.76%. What do we think about this value? Is it so extreme relative to the
ATA FIL
DATADATA
other values that we would call it an outlier? To qualify for this status an observation should stand
D
E
DATADATADATA
TBILLRATES50 apart from the other observations either alone or with very few other cases. A careful examination
DATADATADATA of the data indicates that this 15.76% does not qualify for outlier status. There are interest rates of
15.72%, 15.68%, and 15.58%. In fact, there are 15 auctions with interest rates of 15% or higher.
The distribution has a single peak at around 5%. The distribution is somewhat right-skewed
that is, the right tail extends farther from the peak than does the left tail.
When you describe a distribution, concentrate on the main features. Look for major
peaks, not for minor ups and downs in the bars of the histogram. Look for clear outliers,
not just for the smallest and largest observations. Look for rough symmetry or clear
skewness.
Symmetric and Skewed Distributions

A distribution is symmetric if the right and left sides of the histogram are approximately
mirror images of each other.
A distribution is skewed to the right if the right side of the histogram (containing the
half of the observations with larger values) extends much farther out than the left side. It is
skewed to the left if the left side of the histogram extends much farther out than the right
side. We also use the term skewed toward large values for distributions that are
skewed to the right. This is the most common type of skewness seen in real data.
EXAMPLE 1.18 IQ Scores of Fifth-Grade Students
ATA FIL
DATADATA
Figure 1.11 displays a histogram of the IQ scores of 60 fifth-grade students. There is a single peak
D
DATADATADATA
DATADATADATADATADATA around 110 and the distribution is approximately symmetric. The tails decrease smoothly as we
DATADATADATADATADATA IQ
move away from the peak. Measures such as this are usually constructed so that they have nice
DATADATADATA
distributions like the one shown in Figure 1.11.
FIGURE 1.11 Histogram of the

IQ scores of 60 fifth-grade
students, for Example 1.18. 15
Count of students
10
0
80 90 100 110 120 130 140 150
IQ score
The overall shape of a distribution is important information about a variable. Some

types of data regularly produce distributions that are symmetric or skewed. For example,
data on the diameters of ball bearings produced by a manufacturing process tend to be
symmetric. Data on incomes (whether of individuals, companies, or nations) are usually
strongly skewed to the right. There are many moderate incomes, some large incomes,
and a few very large incomes. Do remember that many distributions have shapes that
are neither symmetric nor skewed. Some data show other patterns. Scores on an exam,
for example, may have a cluster near the top of the scale if many students did well. Or
they may show two distinct peaks if a tough problem divided the class into those who
did and didnt solve it. Use your eyes and describe what you see.
1.10 Make a stemplot. Make a stemplot for a distribution that has a single peak,
approximately symmetric with one high and two low outliers.
1.11 Make another one. Make a stemplot of a distribution that is skewed toward
large values.
Time plots
Many variables are measured at intervals over time. We might, for example, measure the
cost of raw materials for a manufacturing process each month or the price of a stock at
the end of each day. In these examples, our main interest is change over time. To display
change over time, make a time plot.
Time Plot
A time plot of a variable plots each observation against the time at which it was
measured. Always put time on the horizontal scale of your plot and the variable you are
measuring on the vertical scale. Connecting the data points by lines helps emphasize any
change over time.
More details about how to analyze data that vary over time are given in Chapter 13,
Time Series Forecasting. For now, we will examine how a time plot can reveal some
additional important information about T-bill interest rates.
EXAMPLE 1.19 A Time Plot for T-bill Interest Rates
CASE 1.1 The Web site of the Federal Reserve Bank of St. Louis provided a very interesting graph of T-bill
interest rates.11 It is shown in Figure 1.12. A time plot shows us the relationship between two
variables, in this case interest rate and the auctions that occurred at four-week intervals. Notice
how the Federal Reserve Bank included information about a third variable in this plot. The third
variable is a categorical variable that indicates whether or not the United States was in a recession.
It is indicated by the shaded areas in the plot.
CASE 1.1 1.12 What does the time plot show? Carefully examine the time plot in
Figure 1.12.
(a) How do the T-bill interest rates vary over time?
(b) What can you say about the relationship between the rates and the recession
periods?
FIGURE 1.12 Time plot for the

T-bill interest rates, for 6-Month Treasury Bill: Secondary Market Rate (WTB6MS)
Example 1.19. 20
15
Percent
10
0
1950 1960 1970 1980 1990 2000 2010
Source: Board of Governors of the Federal Reserve System

Shaded areas indicate US recessions as determined by the NBER.
2008 Federal Reserve Bank of St. Louis: research.stlouisfed.org
In Example 1.12 (page 12) we examined the distribution of T-bill interest rates for
the period December 12, 1958, to October 3, 2008. The histogram in Figure 1.7 showed
us the shape of the distribution. By looking at the time plot in Figure 1.12, we now
see that there is more to this data set than is revealed by the histogram. This scenario
illustrates the types of steps used in an effective statistical analysis of data. We are rarely
able to completely plan our analysis in advance, set up the appropriate steps to be taken,
and then click on the appropriate buttons in a software package to obtain useful results.
An effective analysis requires that we proceed in an organized way, use a variety of
analytical tools as we proceed, and exercise careful judgment at each step in the process.
SECTION 1.1 Summary
A data set contains information on a number of cases. Cases may be people, animals,
or things. For each case, the data give values for one or more variables. A variable
describes some characteristic of an individual, such as a persons height, gender, or
salary. Variables can have different values for different cases.
Some variables are categorical and others are quantitative. A categorical variable
places each case into a category, such as male or female. A quantitative variable has
numerical values that measure some characteristic of each case, such as height in
centimeters or salary in dollars per year.
Exploratory data analysis uses graphs and numerical summaries to describe the
variables in a data set and the relations among them.
The distribution of a variable describes what values the variable takes and how often
it takes these values.
To describe a distribution, begin with a graph. Bar graphs and pie charts describe
the distribution of a categorical variable, and Pareto charts identify the most im-
portant categories for a categorical variable. Histograms and stemplots graph the
distributions of quantitative variables.
When examining any graph, look for an overall pattern and for notable deviations
from the pattern.
Shape, center, and spread describe the overall pattern of a distribution. Some dis-
tributions have simple shapes, such as symmetric and skewed. Not all distributions
have a simple overall shape, especially when there are few observations.
Outliers are observations that lie outside the overall pattern of a distribution. Always
look for outliers and try to explain them.
When observations on a variable are taken over time, make a time plot that graphs
time horizontally and the values of the variable vertically. A time plot can reveal
interesting patterns in a set of data.
SECTION 1.1 Exercises
For Exercise 1.1, see page 4; for 1.2 to 1.4, see page 5; for 1.5, tive variables that you might measure for each student. Give the
see page 6; for 1.6 and 1.7, see pages 1112; for 1.8 and 1.9, see units of measurement for the quantitative variables.
pages 1516; for 1.10 and 1.11, see page 19; and for 1.12, see
1.18 What color should you use for your product? What
page 19.
is your favorite color? One survey produced the following sum-
1.13 Employee application data. The personnel department mary of responses to that question: blue, 42%; green, 14%; pur-
keeps records on all employees in a company. Here is the infor- ple, 14%; red, 8%; black, 7%; orange, 5%; yellow, 3%; brown,
mation that they keep in one of their data files: employee identifi- 3%; gray, 2%; and white, 2%.12 Make a bar graph of the percents
cation number, last name, first name, middle initial, department, and write a short summary of the major features of your graph.
ATA FIL
DATADATA
FAVORITECOLORS
D
number of years with the company, salary, education (coded as

E
DATADATADATA
DATADATADATA
high school, some college, or college degree), and age.

1.19 Least-favorite colors. Refer to the previous exercise. The
(a) What are the cases for this data set?
same study also asked people about their least-favorite color.
(b) Identify each item kept in the data files as a label, a quanti-
Here are the results: orange, 30%; brown, 23%; purple, 13%;
tative variable, or a categorical variable.
yellow, 13%; gray, 12%; green, 4%; white, 4%; red, 1%; black,
(c) Set up a spreadsheet that could be used to record the data.
0%; and blue, 0%. Make a bar graph of these percents and write
Give appropriate column headings and five sample cases. ATA FIL
DATADATA
LEASTFAVORITECOLORS
D
a summary of the results.

E
DATADATADATA
DATADATADATA
1.14 Where should you locate your business? You are in-
1.20 Market share doubles in a year. The market share of
terested in choosing a new location for your business. Create a
iPhones doubled from 5.3% to 10.8% between the first quarter of
list of criteria that you would use to rank cities. Include at least
2008 and the first quarter of 2009.13 One of the attractions of the
eight variables and give reasons for your choices. Classify each
iPhone is the Web browser, which they market as the most ad-
variable as quantitative or categorical.
vanced Web browser on a mobile device. Users of iPhones were
1.15 Survey of students. A survey of students in an introduc- asked to respond to the statement I do a lot more browsing on
tory statistics class asked the following questions: (a) age; (b) do the iPhone than I did on my previous mobile phone. Here are the
you like to dance? (yes, no); (c) can you play a musical instru- results:14
ment (not at all, a little, pretty well); (d) how much did you spend
on food last week? (e) height; (f) do you like broccoli? (yes, no). Response Percent
Classify each of these variables as categorical or quantitative and
Strongly agree 54
give reasons for your answers.
Mildly agree 22
1.16 What questions would you ask? Refer to the previous Mildly disagree 16
exercise. Make up your own survey questions with at least six Strongly disagree 8
questions. Include at least two categorical variables and at least
two quantitative variables. Tell which variables are categorical
(a) Make a bar graph to display the distribution of the responses.
and which are quantitative. Give reasons for your answers.
(b) Display the distribution with a pie chart.
1.17 Study habits of students. You are planning a sur- (c) Summarize the information in these charts.
vey to collect information about the study habits of college (d) Do you prefer the bar graph or the pie chart? Give a reason
ATA FIL
DATADATA
students. Describe two categorical variables and two quantita- BROWSING

D
for your answer.

E
DATADATADATA
DATADATADATA
1.21 What did the iPhone replace? The survey in the previ- (a) Use a bar graph to display the market shares.
ous exercise also asked iPhone users what phone, if any, did the (b) Summarize what the graph tells you about market shares for
iPhone replace. Here are the responses: search engines.
1.24 Market share for computer operating systems. The
Response Percent Response Percent following table gives the market share for the major computer
ATA FIL
operating systems.17
DATADATA
OPERATINGSYSTEMS
E
DATADATADATA
Motorola Razr 23.8 Blackberry 13.0

DATADATADATA
Symbian 3.9 Windows Mobile 13.9

Sidekick 4.1 Replaced nothing 10.0
Operating Market Operating Market
Palm 6.7 Other phone 24.5
system share system share
Make a bar graph for these data. Carefully consider how you Windows 90.29% Playstation 0.03%
will order the responses. Explain why you chose the ordering that Mac 8.23% SunOS 0.01%
ATA FIL
PHONEREPLACEMENT
DATADATA
Linux 0.91% Other 0.21%

D
you did.
E
DATADATADATA
DATADATADATA
iPhone 0.32%
1.22 Garbage is big business. The formal name for garbage
is municipal solid waste. In the United States, approximately
254 million tons of garbage are generated in a year. Below is a (a) Make a bar graph of this market share data.
breakdown of the materials that made up American municipal (b) Write a short paragraph summarizing these data.
ATA FIL
solid waste in 2007.15

DATADATA
GARBAGE
D
DATADATADATA
DATADATADATA
1.25 Your Facebook app can generate a million dollars a

month. A report on Facebook suggests that Facebook apps can
Weight Percent generate large amounts of money, as much as one million dollars
Material (million tons) of total
a month.18 The market is international. The following table gives
Food scraps 31.7 12.5 the numbers of Facebook users by country for the top 20 coun-
Glass 13.6 5.3 tries (excluding the United States) as of September 29, 2008.19
ATA FIL
DATADATA
FACEBOOKBYCOUNTRY
D
Metals 20.8 8.2

E
DATADATADATA
DATADATADATA
Paper, paperboard 83.0 32.7

Plastics 30.7 12.1
Rubber, leather, textiles 19.4 7.6 Facebook Facebook
Wood 14.2 5.6 users users
Yard trimmings 32.6 12.8 Country (in millions) Country (in millions)
Other 8.2 3.2 United Kingdom 11.39 Venezuela 1.01
Total 254.1 100.0 Canada 9.51 South Africa 0.97
Turkey 3.50 Hong Kong 0.91
Australia 3.36 Egypt 0.80
(a) Add the weights. The sum is not exactly equal to the value
Colombia 2.69 Denmark 0.79
of 254.1 million tons given in the table. Why?
(b) Make a bar graph of the percents. The graph gives a clearer Chile 2.46 Spain 0.77
picture of the main contributors to garbage if you order the bars France 2.45 India 0.77
from tallest to shortest. Norway 1.14 Germany 0.70
(c) Also make a pie chart of the percents. Comparing the two Sweden 1.14 Israel 0.61
graphs, notice that it is easier to see the small differences among Mexico 1.01 Italy 0.57
Food scraps, Plastics, and Yard trimmings in the bar graph.
1.23 Market share for search engines. The following ta- (a) Use a bar graph to describe these data.
ble gives the market share for the major search engines.16 (b) Describe the major features of your chart in a short paragraph.
ATA FIL
DATADATA
SEARCHENGINES
D
DATADATADATA
1.26 Facebook use increases, by country. Facebook use

DATADATADATA
Search Market Search Market has been increasing rapidly. Data are available on the in-
engine share engine share creases between February 8, 2008, and September 29, 2008.20
ATA FIL
DATADATA
FACEBOOKINCREASES The table below gives the percent increase in

D
DATADATADATA
Google-Global 79.9% Microsoft Live 1.6%

DATADATADATA
the numbers of Facebook users for the same 20 countries that we

Yahoo-Global 11.3% Search
studied in the previous exercise. Note that there is no entry for
MSN-Global 3.4% Ask-Global 1.2%
Hong Kong, because the number of users as of February 8, 2008,
AOL-Global 2.4% Other 0.2% is not reported.
expressed as a percent. Table 1.2 gives the U.S. unemployment

Increase in Increase in ATA FIL
rates for each state as of August 2008.21

DATADATA
UNEMPLOYMENT
E
DATADATADATA
Facebook Facebook
DATADATADATA
Country users Country users (a) Construct a histogram of these rates.

(b) Prepare a stemplot of the rates.
United Kingdom 31% Venezuela 683%
(c) Discuss the advantages and disadvantages of (a) and (b).
Canada 9% South Africa 33% Which do you prefer for this set of data? Explain your answer.
Turkey 23% Hong Kong
Australia 43% Egypt 31% 1.28 Unemployment rates in Canadian provinces. Here
Colombia 246% Denmark 92% are 2007 unemployment rates for 10 Canadian provinces:22
ATA FIL
DATADATA
UNEMPLOYMENTCANADA
E
DATADATADATA
Chile 2197% Spain 132%

DATADATADATA
France 92% India 42%

Norway 7% Germany 44%
Province Unemployment rate
Sweden 4% Israel 42%
Mexico 69% Italy 139% Alberta 3.5%
British Columbia 4.2%
Manitoba 4.4%
New Brunswick 7.5%
(a) Summarize the data by carefully examining the table. Are
Newfoundland and Labrador 13.6%
there any extreme outliers? Which ones would you classify in
this way? Nova Scotia 8.0%
(b) Use a stemplot to describe these data. You can list any ex- Ontario 6.4%
treme outliers separately from the plot. Prince Edward Island 10.3%
(c) Describe the major features of these data using your plot and Quebec 7.2%
your list of outliers. Saskatchewan 4.2%
(d) How effective is the stemplot for summarizing these data?
Give reasons for your answer.
(a) Construct a histogram of these rates.
1.27 U.S. unemployment rates. An unemployment rate is the (b) Prepare a stemplot of the rates.
number of people who are not working but who are available for (c) Discuss the advantages and disadvantages of (a) and (b).
work divided by the total number of people in the workforce, Which do you prefer for this set of data? Explain your answer.
TABLE 1.2 Unemployment rates by state, August 2008

State Rate State Rate State Rate
Alabama 4.9 Louisiana 4.7 Ohio 7.4
Alaska 6.9 Maine 5.5 Oklahoma 4.0
Arizona 5.6 Maryland 4.5 Oregon 6.5
Arkansas 4.8 Massachusetts 5.3 Pennsylvania 5.8
California 7.7 Michigan 8.9 Rhode Island 8.5
Colorado 5.4 Minnesota 6.2 South Carolina 7.6
Connecticut 6.5 Mississippi 7.7 South Dakota 3.3
Delaware 4.9 Missouri 6.6 Tennessee 6.6
Florida 6.5 Montana 4.4 Texas 5.0
Georgia 6.3 Nebraska 3.5 Utah 3.7
Hawaii 4.2 Nevada 7.1 Vermont 4.9
Idaho 4.6 New Hampshire 4.2 Virginia 4.6
Illinois 7.3 New Jersey 5.9 Washington 6.0
Indiana 6.4 New Mexico 4.6 West Virginia 4.1
Iowa 4.6 New York 5.8 Wisconsin 5.1
Kansas 4.7 North Carolina 6.9 Wyoming 3.9
Kentucky 6.8 North Dakota 3.6
1.29 Vehicle colors. Vehicle colors differ among types of vehi- 7 0

cle. Here are data on the most popular colors in 2007 for lux- 8 8
ury cars and for intermediate-price cars in North America:23
ATA FIL
DATADATA
9 9
VEHICLECOLORS
D
DATADATADATA
DATADATADATA
10 01
11 0177889
Luxury Intermediate-price 12 122225567799
Color car (%) car (%) 13 0000112333345556699
Black 22 10 14 033678
Silver 16 25 15 25
White Pearl 14 4 16
Gray 12 12 17 0
White 11 8
(a) There is an outlier: Florida has the highest percent of resi-
Blue 7 13
dents aged 65 and older and clearly stands out. Alaska has the
Red 7 10
lowest percent, but it is at the end of a relatively flat tail on the
Yellow/Gold 6 4 low end of the distribution. What are the percents for these two
Other 5 14 states?
(b) Describe the shape, center, and spread of this distribution.
1.32 U.S. population 65 and older. Make a stemplot of the per-
(a) Make a bar graph for the luxury car percents.
cent of residents aged 65 and older in the states other than Alaska
(b) Make a bar graph for the intermediate-price car percents.
and Florida by splitting stems 8 to 15 in the plot from the previous
(c) Now, be creative: make one bar graph that compares ATA FIL
DATADATA
POPOVER65BYSTATE
D
exercise. Which plot do you prefer? Why?
E
DATADATADATA
the two vehicle types as well as comparing colors. Arrange

DATADATADATA
your graph so that it is easy to compare the two types of 1.33 The Canadian market. Refer to Exercise 1.31. Here are
vehicle. similar data for the 13 Canadian provinces and territories:26
ATA FIL
DATADATA
CANADIANPOPULATION
D
DATADATADATA
1.30 Procter & Gamble sales. The 2007 annual report of the
DATADATADATA
Procter & Gamble Company (P&G) states that global net sales
were over $76 billion. The sales information is organized into Province/Territory Percent over 65
global segments. The following summary gives the net sales for Alberta 10.7
ATA FIL
each global segment of P&G:24

DATADATA
PANDGSALES
D
DATADATADATA
British Columbia 14.6

DATADATADATA
Manitoba 14.1
New Brunswick 14.7
Net sales Newfoundland and Labrador 13.9
Segment ($ millions) Northwest Territories 4.8
Beauty 22,981 Nova Scotia 15.1
Health care 8,964 Nunavut 2.7
Fabric care and home care 18,971 Ontario 13.6
Baby care and family care 12,726 Prince Edward Island 14.9
Snacks, coffee, and pet care 4,537 Quebec 14.3
Blades and razors 5,229 Saskatchewan 15.4
Duracell and Braun 4,031 Yukon 7.5
(a) Display the data graphically and describe the major features
Summarize these data graphically and write a paragraph describ- of your plot.
ing the net sales of P&G. (b) Explain why you chose the particular format for your graphi-
cal display. What other types of graph could you have used? What
1.31 Products for senior citizens. The market for products
are the strengths and weaknesses of each for displaying this set
designed for senior citizens in the United States is expanding.
of data?
Here is a stemplot of the percents of residents aged 65 and older
in the 50 states, for 2006, as estimated by the U.S. Census Bu- 1.34 Left-skew. Sketch a histogram for a distribution that is
reau.25 The stems are whole percents and the leaves are tenths of a skewed to the left. Suppose that you and your friends emptied
ATA FIL
DATADATA
POPOVER65BYSTATE
D
percent.
E
DATADATADATA
your pockets of coins and recorded the year marked on each coin.
DATADATADATA
The distribution of dates would be skewed to the left. Explain In particular, look at the percent of children relative to the rest of
why. the population.
(c) Make a histogram with vertical scale in percents of the pro-
1.35 Is the supply adequate? How much oil the wells in a
jected age distribution for the year 2075. Use the same scales as
given field will ultimately produce is key information in de-
in (b) for easy comparison. What are the most important changes
ciding whether to drill more wells. Here are the estimated to-
in the U.S. age distribution projected for the years between 1950
tal amounts of oil recovered from 64 wells in the Devonian
and 2075?
Richmond Dolomite area of the Michigan basin, in thousands
ATA FIL
of barrels:27
DATADATA
OILWELLS
D
1.37 Reliability of household appliances. Always ask whether

E
DATADATADATA
DATADATADATA
a particular variable is really a suitable measure for your purpose.

You are writing an article for a consumer magazine based on a
21.7 53.2 46.4 42.7 50.4 97.7 103.1 51.9 43.4 69.5 survey of the magazines readers on the reliability of their house-
156.5 34.6 37.9 12.9 2.5 31.4 79.5 26.9 18.5 14.7 hold appliances. Of 13,376 readers who reported owning Brand
32.9 196.0 24.9 118.2 82.2 35.1 47.6 54.2 63.1 69.8 A dishwashers, 2942 required a service call during the past year.
57.4 65.6 56.4 49.4 44.9 34.6 92.2 37.0 58.8 21.3 Only 192 service calls were reported by the 480 readers who
36.6 64.9 14.8 17.6 29.1 61.4 38.6 32.5 12.0 28.3 owned Brand B dishwashers.
(a) Why is the count of service calls (2942 versus 192) not
204.9 44.5 10.3 37.7 33.7 81.1 12.1 20.1 30.5 7.1
a good measure of the reliability of these two brands of
10.1 18.0 3.0 2.0
dishwashers?
(b) Use the information given to calculate a suitable measure of
reliability. What do you conclude about the reliability of Brand
Graph the distribution and describe its main features. A and Brand B?
1.36 The changing age distribution of the United States. The 1.38 A multimillion-dollar business is threatened. Bristol
distribution of the ages of a nations population has a strong influ- Bay of Alaska, has typically produced more wild-caught sock-
ence on economic and social conditions. Table 1.3 shows the age eye salmon, Oncorhynchus nerka, than any other region in the
distribution of U.S. residents in 1950 and 2075, in millions of peo- world. In good years, the runs typically exceed 50 million fish.
ple. The 1950 data come from that years census, while the 2075 The sockeye salmon industry here provides thousands of jobs and
ATA FIL
generates millions of dollars per year.28 Here are the numbers of

DATADATA
USPOPULATION
D
data are projections made by the Census Bureau.

E
DATADATADATA
DATADATADATA
(a) Because the total population in 2075 is much larger than the sockeye salmon in runs at Bristol Bay between 1988 and 2007:29
ATA FIL
DATADATA
1950 population, comparing percents in each age group is clearer BERINGSEAFISH

D
DATADATADATA
DATADATADATA
than comparing counts. Make a table of the percent of the total

population in each age group for both 1950 and 2075.
(b) Make a histogram with vertical scale in percents of the 1950
age distribution. Describe the main features of the distribution.
TABLE 1.3 Age distribution in the United

States, 1950 and 2075
(in millions of persons)
Natalie Fobes/Corbis
Age group 1950 2075
Under 10 years 29.3 53.3

1019 years 21.8 53.2
2029 years 24.0 51.2
3039 years 22.8 50.5
4049 years 19.3 47.5
5059 years 15.5 44.8
Runs Runs Runs Runs
6069 years 11.0 40.7
Year (millions) Year (millions) Year (millions) Year (millions)
7079 years 5.5 30.9
8089 years 1.6 21.7 1988 22.9 1993 52.7 1998 18.1 2003 26.5
9099 years 0.1 8.8 1989 44.5 1994 50.3 1999 39.5 2004 43.5
100109 years 0.0 1.1 1990 47.1 1995 60.8 2000 28.4 2005 39.3
1991 42.0 1996 37.0 2001 22.0 2006 43.1
Total 151.1 403.7 1992 45.6 1997 18.8 2002 17.2 2007 44.3
(a) Make a graph to display the distribution of salmon run vertical axis and compress the time axis, data appear to be more
size, then describe the pattern and any striking deviations that variable. Compressing the vertical axis and stretching the time
you see. axis make variations appear to be smaller. Make two time plots of
(b) Make a time plot of run size and describe its pattern. As is the data in the previous exercise to illustrate this idea. Make one
often the case with data measured at specific time intervals, a plot that makes variability appear to be larger and one plot that
time plot is needed to understand what is happening. makes variability appear to be smaller. The moral of this exercise
1.39 Watch those scales! The impression that a time plot gives is: pay close attention to the scales when you look at a time plot.
ATA FIL
DATADATA
BERINGSEAFISH
E
DATADATADATA
depends on the scales you use on the two axes. If you stretch the
DATADATADATA
1.2 Describing Distributions with Numbers

In the previous section, we used the shape, center, and spread as ways to describe the
overall pattern of any distribution for a quantitative variable. In this section, we will
learn specific ways to use numbers to measure the center and the spread of a distribution.
The numbers, like the graphs of Section 1.1, are aids to understanding the data, not the
answer in themselves.
Dejan Patic/Getty Images
Time to Start a Business An entrepreneur faces many bureaucratic and legal hur-
CASE 1.2
dles when starting a new business. The World Bank collects information about starting
businesses throughout the world. It has determined the time, in days, to complete all of the
procedures required to start a business.30 Data for 195 countries are included in the data set.
For this section we will examine data for a sample of 24 of these countries. Here are the
data:
ATA FIL
DATADATA
D
DATADATADATA
TIMETOSTART24
23 4 29 44 47 24 40 23 23 44 33 27
DATADATADATA
60 46 61 11 23 62 31 44 77 14 65 42
EXAMPLE 1.20 The Distribution of Business Start Times

CASE 1.2 The stemplot in Figure 1.13 shows us the shape, center, and spread of the business start times.
The stems are tens of days, and the leaves are days. As is often the case when there are few
ATA FIL
DATADATA
observations, the shape of the distribution is irregular. There are peaks in the 20s and the 40s. The
D
DATADATADATA
TIMETOSTART24
values range from 4 to 77 days, with a center somewhere in the middle of these two extremes.
DATADATADATA
There do not appear to be any outliers.
0 4
1 14
2 3333479
3 13
4 0244467
5
6 0125
7 7
FIGURE 1.13 Stemplot for sample of 24 business start times, for Example 1.20.
1.2 Describing Distributions with Numbers 27
Measuring center: the mean

A description of a distribution almost always includes a measure of its center. The most
common measure of center is the ordinary arithmetic average, or mean.
The Mean x
To find the mean of a set of observations, add their values and divide by the number of
observations. If the n observations are x1 , x2 , . . . , xn , their mean is
x1 + x2 + + xn
x=
n
or, in more compact notation,
1
x= xi
n

The (capital Greek sigma) in the formula for the mean is short for add them all
up. The subscripts on the observations xi are just a way of keeping the n observations
distinct. They do not necessarily indicate order or any other special facts about the data.
The bar over the x indicates the mean of all the x-values. Pronounce the mean x as
x-bar. This notation is very common. When writers who are discussing data use x or
y, they are talking about a mean.
EXAMPLE 1.21 Mean Time to Start a Business
CASE 1.2 The mean time to start a business is

x1 + x2 + + xn
ATA FIL x =
DATADATA
n
D
DATADATADATA
23 + 4 + + 42
DATADATADATADATADATA TIMETOSTART24
=
DATADATADATA
24
897
= = 37.375
24
The mean time to start a business for the 24 countries in our data set is 37.4 days. Note that we
have rounded the answer. Our goal in using the mean to describe the center of a distribution is not
to demonstrate that we can compute with great accuracy. The additional digits do not provide any
additional useful information. In fact, they distract our attention from the important digits that are
meaningful. Do you think it would be better to report the mean as 37 days?
In practice, you can key the data into your calculator and hit the Mean key. You dont
have to actually add and divide. But you should know that this is what the calculator is
doing.
CASE 1.2 1.40 Include the outlier. The complete business start time data set with
195 countries has a few with very large start times. In constructing the data set for
ATA FIL
DATADATA
Case 1.2 a random sample of 25 countries was selected. This sample included the
D
DATADATADATA
TIMETOSTART25 South American country of Suriname, where the start time is 694 days. This case was
DATADATADATA
deleted for Case 1.2. Reconstruct the original random sample by including Suriname.
Show that the mean has increased to 64 days. (This is a rounded number. You should
report the mean with two digits after the decimal.)
Exercise 1.40 illustrates an important fact about the mean as a measure of center: it
is sensitive to the influence of one or more extreme observations. These may be outliers,
but a skewed distribution that has no outliers will also pull the mean toward its long tail.
Because the mean cannot resist the influence of extreme observations, we say that it is
resistant measure not a resistant measure of center.
ATA FIL
DATADATA
1.41 Calls to a customer service center. The service times for 80 calls to a customer
D
DATADATADATA
DATADATADATADATADATA CALLCENTER80 service center are given in Table 1.1 (page 14). Use these data to compute the mean
DATADATADATA
service time.
1.42 Find the mean of the first-exam scores. Here are the scores on the first exam
in an introductory statistics course for 10 students:
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA 80 73 92 85 75 98 93 55 80 90
STATCOURSE
DATADATADATA
Find the mean first-exam score for these students.
Measuring center: the median

In Section 1.1, we used the midpoint of a distribution as an informal measure of center.
The median is the formal version of the midpoint, with a specific rule for calculation.
The Median M
The median M is the midpoint of a distribution, the number such that half the
observations are smaller and the other half are larger. To find the median of a distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median M is the center observation in
the ordered list. Find the location of the median by counting (n + 1)/2
observations up from the bottom of the list.
3. If the number of observations n is even, the median M is the mean of the two
center observations in the ordered list. The location of the median is again
(n + 1)/2 from the bottom of the list.
Note that the formula (n + 1)/2 does not give the median, just the location of the
median in the ordered list. Medians require little arithmetic, so they are easy to find by
hand for small sets of data. Arranging even a moderate number of observations in order
is very tedious, however, so that finding the median by hand for larger sets of data is
unpleasant. Even simple calculators have an x button, but you will need software or a
graphing calculator to automate finding the median.
EXAMPLE 1.22 Median Time to Start a Business
CASE 1.2 To find the median time to start a business for our 24 countries, we first arrange the data in order
from smallest to largest:
ATA FIL
DATADATA
D
DATADATADATA
TIMETOSTART24
4 11 14 23 23 23 23 24 27 29 31 33
DATADATADATA 40 42 44 44 44 46 47 60 61 62 65 77
The count of observations n = 24 is even. The median, then, is the average of the two center
observations in the ordered list. To find the location of the center observations, we first compute
n+1 25
location of M = = = 12.5
2 2
Therefore, the center observations are the 12th and 13th observations in the ordered list. The
median is
33 + 40
M= = 36.5
2
Note that you can use the stemplot directly to compute the median. In the stemplot
the cases are already ordered and you simply need to count from the top or the bottom
to the desired location.
CASE 1.2 1.43 Include the outlier. Include Suriname, where the start time is
694 days, in the data set and show that the median is 40 days. Note that with this
ATA FIL
DATADATA case included, the sample size is now 25 and the median is the 13th observation in the
D
DATADATADATA
TIMETOSTART25 ordered list. Write out the ordered list and circle the outlier. Describe the effect of the
DATADATADATA outlier on the median for this set of data.
ATA FIL
DATADATA 1.44 Calls to a customer service center. The service times for 80 calls to a customer
D
DATADATADATA
CALLCENTER80 service center are given in Table 1.1 (page 14). Use these data to compute the median
DATADATADATA service time.
ATA FIL
DATADATA
1.45 Find the median of the first-exam scores. Here are the scores on the first exam
D
DATADATADATA
DATADATADATADATADATA STATCOURSE in an introductory statistics course for 10 students:
DATADATADATA
80 73 92 85 75 98 93 55 80 90
Find the median first-exam score for these students.
Comparing the mean and the median

Exercises 1.40 and 1.43 illustrate an important difference between the mean and the
median. Suriname pulls the mean time to start a business up from 37 days to 64 days.
The increase in the median is a lot less, from 36 days to 40 days.
The median is more resistant than the mean. If the largest starting time in the data
set was 1200 days, the median for all 25 countries would still be 40 days. The largest
observation just counts as one observation above the center, no matter how far above the
center it lies. The mean uses the actual value of each observation and so will chase a
single large observation upward.
The best way to compare the response of the mean and median to extreme observa-
APPLET tions is to use an interactive applet that allows you to place points on a line and then drag
them with your computers mouse. Exercises 1.68 to 1.70 use the Mean and Median
applet on the Web site for this book, www.whfreeman.com/psbe, to compare mean and
median.
The mean and median of a symmetric distribution are close together. If the distri-
bution is exactly symmetric, the mean and median are exactly the same. In a skewed
distribution, the mean is farther out in the long tail than is the median.
Consider the prices of existing single-family homes in the United States. The mean
price in 2007 was $266,200 while the median was $217,900. This distribution is strongly
skewed to the right. There are many moderately priced houses and a few very ex-
pensive mansions. The few expensive houses pull the mean up but do not affect the
median.
Reports about house prices, incomes, and other strongly skewed distributions usually
give the median (midpoint) rather than the mean (arithmetic average). However, if
you are a tax assessor interested in the total value of houses in your area, use the mean.
The total is the mean times the number of houses, but it has no connection with the
median. The mean and median measure center in different ways, and both are useful.
ATA FIL
DATADATA
1.46 Gross domestic product. The success of companies expanding to developing
D
DATADATADATA
DATADATADATADATADATA GDP12 regions of the world depends in part on the prosperity of the countries in those regions.
DATADATADATA
Here are World Bank data on the growth of gross domestic product (percent per year)
for the period 2000 to 2004 in countries in Asia (not including Japan):
Country Growth
Bangladesh 5.2
China 9.4
Hong Kong 3.2
India 6.2
Indonesia 4.6
Korea (South) 4.7
Malaysia 4.4
Pakistan 4.1
Philippines 3.9
Singapore 2.9
Thailand 5.4
Vietnam 7.2
(a) Make a stemplot of the data. Note the high outlier.

(b) Find the mean and median growth rates. How does the outlier explain the
difference between your two results?
(c) Find the mean and median growth rates without the outlier. How does comparing
your results in (b) and (c) illustrate the resistance of the median and the lack of
resistance of the mean?
Measuring spread: the quartiles

A measure of center alone can be misleading. Two nations with the same median house-
hold income are very different if one has extremes of wealth and poverty and the other
has little variation among households. A drug with the correct mean concentration of
active ingredient is dangerous if some batches are much too high and others much too
low. We are interested in the spread or variability of incomes and drug potencies as well
as their centers. The simplest useful numerical description of a distribution consists
of both a measure of center and a measure of spread.
One way to measure spread is to give the smallest and largest observations. For
example, the times to start a business in our data set that included Suriname ranged from
4 to 694 days. Without Suriname, the range is 4 to 77 days. These largest and smallest
observations show the full spread of the data and are highly influenced by outliers.
We can improve our description of spread by also giving several percentiles. The
percentile pth percentile of a distribution is the value such that p percent of the observations fall
at or below it. The median is just the 50th percentile, so the use of percentiles to report
spread is particularly appropriate when the median is our measure of center.
The most commonly used percentiles other than the median are the quartiles. The
first quartile is the 25th percentile, and the third quartile is the 75th percentile. That
is, the first and third quartiles show the spread of the middle half of the data. (The
second quartile is the median itself.) To calculate a percentile, arrange the observations
in increasing order and count up the required percent from the bottom of the list. Our
definition of percentiles is a bit inexact because there is not always a value with exactly
p percent of the data at or below it. We will be content to take the nearest observation
for most percentiles, but the quartiles are important enough to require an exact recipe.
The rule for calculating the quartiles uses the rule for the median.
The Quartiles Q 1 and Q 3

To calculate the quartiles:
1. Arrange the observations in increasing order and locate the median M in the
ordered list of observations.
2. The first quartile Q 1 is the median of the observations whose position in the
ordered list is to the left of the location of the overall median.
3. The third quartile Q 3 is the median of the observations whose position in the
ordered list is to the right of the location of the overall median.
Here is an example that shows how the rules for the quartiles work for both odd and
even numbers of observations.
EXAMPLE 1.23 Finding the Quartiles
CASE 1.2 Here is the ordered list of the times to start a business in our sample of 24 countries:
4 11 14 23 23 23 23 24 27 29 31 33
40 42 44 44 44 46 47 60 61 62 65 77
ATA FIL
DATADATA
D
DATADATADATA
The count of observations n = 24 is even, so the median is at position (24 + 1)/2 = 12.5, that
TIMETOSTART24
DATADATADATA
is, between the 12th and the 13th observation in the ordered list. There are 12 cases above this
position and 12 below it. The first quartile is the median of the first 12 observations, and the third
quartile is the median of the last 12 observations. Check that Q 1 = 23 and Q 3 = 46.5.
Notice that the quartiles are resistant. For example, Q 3 would have the same value
if the highest start time was 770 days rather than 77 days.
There are slight differences in the methods used by software to compute percentiles.
However, the results will generally be quite similar except in cases where the sample
sizes are very small.
Be careful when several observations take the same numerical value. Write down
all the observations and apply the rules just as if they all had distinct values.
The five-number summary and boxplots

The smallest and largest observations tell us little about the distribution as a whole, but
they give information about the tails of the distribution that is missing if we know only
Q 1 , M, and Q 3 . To get a quick summary of both center and spread, combine all five
numbers. The result is the five-number summary and a graph based on it.
The Five-Number Summary and Boxplots

The five-number summary of a distribution consists of the smallest observation, the first
quartile, the median, the third quartile, and the largest observation, written in order from
smallest to largest. In symbols, the five-number summary is
Minimum Q 1 M Q 3 Maximum
A boxplot is a graph of the five-number summary.
A central box spans the quartiles.
A line in the box marks the median.
Lines extend from the box out to the smallest and largest observations.
Boxplots are most useful for side-by-side comparison of several distributions.
You can draw boxplots either horizontally or vertically. Be sure to include a numer-
ical scale in the graph. When you look at a boxplot, first locate the median, which marks
the center of the distribution. Then look at the spread. The quartiles show the spread of the
middle half of the data, and the extremes (the smallest and largest observations) show
the spread of the entire data set. We now have the tools for a preliminary examination of
the customer service center call lengths.
EXAMPLE 1.24 Service Center Call Lengths

ATA FIL
DATADATA Table 1.1 (page 14) displays the customer service center call lengths for a random sample of 80
D
DATADATADATA
CALLCENTER80 calls that we discussed in Example 1.13. The five-number summary for these data is 1.0, 54.4,
DATADATADATA
103.5, 200, 2631. The distribution is highly skewed. The mean is 197 seconds, a value that is very
close to the third quartile. The boxplot is displayed in Figure 1.14. The skewness of the distribution
is the major feature that we see in this plot. Note that the mean is marked with a + and appears
very close to the upper edge of the box.
FIGURE 1.14 Boxplot for

sample of 80 service center 3000
call lengths, for Example 1.24.
2500
Call length (seconds)
2000
1500
1000
500
+
0
n = 80
Because of the skewness in this distribution, we selected a software option to plot

extreme points individually in Figure 1.14. This is one of several different ways to
improve the appearance of boxplots for particular data sets. These variations are called
modified boxplots modified boxplots.
Boxplots can show the symmetry or skewness of a distribution. In a symmetric
distribution, the first and third quartiles are equally distant from the median. This is not
what we see in Figure 1.14. Here, the distribution is skewed to the right. The third quartile
is farther above the median than the first quartile is below it. The extremes behave the
same way. Boxplots do not always give a clear indication of the nature of a skewed set of
data. For example, the quartiles may indicate right-skewness while the whiskers indicate
left-skewness.
Boxplots are particularly useful for comparing several distributions. Here is an
example.
EXAMPLE 1.25 Fuel Efficiency Sells Cars

ATA FIL
DATADATA Fuel efficiency has become a major issue for people thinking about buying a new car. The Envi-
D
DATADATADATA
DATADATADATADATADATA MPG ronmental Protection Agency provides data on the fuel efficiencies of vehicles sold in the United
DATADATADATA
States each year.31 Figure 1.15 gives side-by-side boxplots of the miles per gallon (mpg) for four
vehicle classes: convertibles, pickup trucks, SUVs, and small cars. Small cars appear to have better
efficiency than the other three classes. Pickup trucks show less variation than the other classes; the
range of mpg values is less, and the first and third quartiles are closer together. The distributions
for SUVs and small cars show some skewness, with some vehicles having particularly good fuel
efficiency. However, note that the mean (marked with a +) and the median are very close for all
four classes.
FIGURE 1.15 Side-by-side

boxplots of fuel efficiency for 40
selected model year 2009
vehicle classes, for
Example 1.25. 30
Miles per gallon
+
20
+ +
+
10
0
Convertible Pickup truck SUV Small car
Car class
CASE 1.2 1.47 Time to start a business. Refer to the data on times to start a business
ATA FIL
DATADATA
in 24 countries described in Case 1.2 on page 26. Use a boxplot to display the distribu-
D
DATADATADATA
TIMETOSTART24 tion. Discuss the features of the data that you see in the boxplot, and compare it with
DATADATADATA
the stemplot in Figure 1.13. Which do you prefer? Give reasons for your answer.
1.48 First-exam scores. Here are the scores on the first exam in an introductory
ATA FIL
DATADATA statistics course for 10 students:
D
E
DATADATADATA
STATCOURSE
80 73 92 85 75 98 93 55 80 90
DATADATADATA
Display the distribution with a boxplot. Discuss whether or not a stemplot would
provide a better way to look at this distribution.
Measuring spread: the standard deviation

The five-number summary is not the most common numerical description of a distri-
bution. That distinction belongs to the combination of the mean to measure center and
the standard deviation to measure spread. The standard deviation measures spread by
looking at how far the observations are from their mean.
The Standard Deviation s

The variance s2 of a set of observations is essentially the average of the squares of the
deviations of the observations from their mean. In symbols, the variance of n observations
x1 , x2 , . . . , xn is
(x1 x)2 + (x2 x)2 + + (xn x)2
s2 =
n1
or, more compactly,
1
s2 = (xi x)2
n1
The standard deviation s is the square root of the variance s 2 :

1
s= (xi x)2
n1
Notice that the average in the variance s 2 divides the sum by 1 less than the
number of observations, that is, n 1 rather than n. The reason is that the deviations
xi x always sum to exactly 0, so that knowing n 1 of them determines the last one.
Only n 1 of the squared deviations can vary freely, and we average by dividing the
degrees of freedom total by n 1. The number n 1 is called the degrees of freedom of the variance or
standard deviation. Many calculators offer a choice between dividing by n and dividing
by n 1, so be sure to use n 1.
In practice, use software or your calculator to obtain the standard deviation from
keyed-in data. Doing an example step-by-step will help you understand how the variance
and standard deviation work, however.
EXAMPLE 1.26 Hourly Wages

ATA FIL
DATADATA Planning to be a lawyer or other legal professional? The Bureau of Labor Statistics lists average
D
DATADATADATA
DATADATADATADATADATA BLSWAGES hourly wages for 9 categories of law-related occupations (OCC Code 23-0000) (the units are
DATADATADATA
dollars per hour):32
75 38 27 48 23 23 20 20 26
First find the mean:

75 + 38 + 27 + 48 + 23 + 23 + 20 + 20 + 26
x =
9
300
= = 33.33 dollars per hour
9
We organize the rest of the arithmetic in a table. This is a good way to do calculations such as this
when you need to work through all the details.
Observations Deviations Squared deviations

xi xi x (xi x)2
75 75 33.33 = 41.67 (41.67)2 = 1736.39
38 38 33.33 = 4.67 (4.67)2 = 21.81
27 27 33.33 = 6.33 (6.33)2 = 40.07
48 48 33.33 = 14.67 (14.67)2 = 215.21
23 23 33.33 = 10.33 (10.33)2 = 106.71
23 23 33.33 = 10.33 (10.33)2 = 106.71
20 20 33.33 = 13.33 (13.33)2 = 177.69
20 20 33.33 = 13.33 (13.33)2 = 177.69
26 26 33.33 = 7.33 (7.33)2 = 53.73
sum = 0.03 sum = 2636.01
The variance is the sum of the squared deviations divided by 1 less than the number of observations:
2636.01
s2 = = 329.5
8
The standard deviation is the square root of the variance:

s = 329.5 = 18.15 dollars per hour
More important than the details of hand calculation are the properties that determine
the usefulness of the standard deviation:
s measures spread about the mean and should be used only when the mean is chosen
as the measure of center.
s = 0 only when there is no spread. This happens only when all observations have
the same value. Otherwise, s is greater than zero. As the observations become more
spread out about their mean, s gets larger.
s has the same units of measurement as the original observations. For example, if
you measure wages in dollars per hour, s is also in dollars per hour.
Like the mean x, s is not resistant. Strong skewness or a few outliers can greatly
increase s.
CASE 1.2 1.49 Time to start a business. Verify the statement in the last bullet
ATA FIL
DATADATA
above using the data on the time to start a business. First, use the 24 cases from Case
D
DATADATADATA
TIMETOSTART24 1.2 (page 26) to calculate a standard deviation. Next, include the country Suriname,
DATADATADATA
where the time to start a business is 694 days. Show that the inclusion of this single
outlier increases the standard deviation from 19 to 133.
ATA FIL
DATADATA
D
DATADATADATA
TIMETOSTART25 You may rightly feel that the importance of the standard deviation is not yet clear.
DATADATADATA
We will see in the next section that the standard deviation is the natural measure of
spread for an important class of symmetric distributions, the Normal distributions. The
usefulness of many statistical procedures is tied to distributions with particular shapes.
This is certainly true of the standard deviation.
Choosing measures of center and spread

How do we choose between the five-number summary and x and s to describe the center
and spread of a distribution? Because the two sides of a strongly skewed distribution have
different spreads, no single number such as s describes the spread well. The five-number
summary, with its two quartiles and two extremes, does a better job.
Choosing a Summary
The five-number summary is usually better than the mean and standard deviation for
describing a skewed distribution or a distribution with extreme outliers. Use x and s only
for reasonably symmetric distributions that are free of outliers.

ATA FIL
DATADATA
D
DATADATADATA
STATCOURSE 1.50 First-exam scores. Below are the scores on the first exam in an introductory
DATADATADATA statistics course for 10 students. We found the mean of these scores in Exercise 1.42
(page 28) and the median in Exercise 1.45 (page 29).
80 73 92 85 75 98 93 55 80 90
(a) Make a stemplot of these data.
(b) Compute the standard deviation.
(c) Are the mean and the standard deviation effective in describing the distribution of
these scores? Explain your answer.
1.51 Calls to a customer service center. We displayed the distribution of the lengths
ATA FIL
DATADATA
of 80 calls to a customer service center in Figure 1.14 (page 32).

D
DATADATADATA
CALLCENTER80
(a) Compute the mean and the standard deviation for these 80 calls (the data are
DATADATADATA
given in Table 1.1, page 14).

(b) Find the five-number summary.
(c) Which summary does a better job of describing the distribution of these calls?
Give reasons for your answer.
BEYOND THE BASICS: Risk and Return

A central principle in the study of investments is that taking bigger risks is rewarded by
higher returns, at least on the average over long periods of time. It is usual in finance
to measure risk by the standard deviation of returns, on the grounds that investments
whose returns show a large spread from year to year are less predictable and therefore
more risky than those whose returns have a small spread. Compare, for example, the
approximate mean and standard deviation of the annual percent returns on American
common stocks and U.S. Treasury bills over a fifty-year period starting in 1950:
Investment Mean return Standard deviation

Common stocks 14.0% 16.9%
Treasury bills 5.2% 2.9%
Stocks are risky. They went up 14% per year on the average during this period, but they
dropped almost 28% in the worst year. The large standard deviation reflects the fact that
stocks have produced both large gains and large losses. When you buy a Treasury bill,
FIGURE 1.16(a) Stemplot of the 0 9

annual returns on Treasury bills 1 255668
for 50 years. The stems are 2 15779
percents. 3 01155899
4 24778
5 112225668
6 24569
(a) T-bills 7 278
8 048
9 8
10 45
11 3
12
13
14 7
FIGURE 1.16(b) Stemplot of the 2 8

annual returns on common 1 9 1100
stocks for 50 years. The stems 0 9643
are percents. 0 000123899
(b) Stocks 1 1 33 4466678
2 0 1123444 57799
3 0 113467
4 5
5 0
on the other hand, you are lending money to the government for one year. You know
that the government will pay you back with interest. That is much less risky than buying
stocks, so (on the average) you get a smaller return.
Are x and s good summaries for distributions of investment returns? Figures 1.16(a)
and 1.16(b) display stemplots of the annual returns for both investments. You see that
returns on Treasury bills have a right-skewed distribution. Convention in the financial
world calls for x and s because some parts of investment theory use them. For describ-
ing this right-skewed distribution, however, the five-number summary would be more
informative.
Remember that a graph gives the best overall picture of a distribution. Numerical
measures of center and spread report specific facts about a distribution, but they do not
describe its entire shape. Numerical summaries do not disclose the presence of multiple
peaks or gaps, for example. Always plot your data.
SECTION 1.2 Summary
A numerical summary of a distribution should report its center and its spread or
variability.
The mean x and the median M describe the center of a distribution in different
ways. The mean is the arithmetic average of the observations, and the median is the
midpoint of the values.
When you use the median to indicate the center of the distribution, describe its spread
by giving the quartiles. The first quartile Q 1 has one-fourth of the observations
below it, and the third quartile Q 3 has three-fourths of the observations below it.
The five-number summary consisting of the median, the quartiles, and the high
and low extremes provides a quick overall description of a distribution. The median
describes the center, and the quartiles and extremes show the spread.
Boxplots based on the five-number summary are useful for comparing several distri-
butions. The box spans the quartiles and shows the spread of the central half of the
distribution. The median is marked within the box. Lines extend from the box to the
extremes and show the full spread of the data.
The variance s2 and especially its square root, the standard deviation s, are common
measures of spread about the mean as center. The standard deviation s is zero when
there is no spread and gets larger as the spread increases.
A resistant measure of any aspect of a distribution is relatively unaffected by changes
in the numerical value of a small proportion of the total number of observations, no
matter how large these changes are. The median and quartiles are resistant, but the
mean and the standard deviation are not.
The mean and standard deviation are good descriptions for symmetric distributions
without outliers. They are most useful for the Normal distributions, introduced in the
next section. The five-number summary is a better exploratory summary for skewed
distributions.
For Exercises 1.41 and 1.42, see page 28; for 1.43 to 1.45, (a) Describe the distribution of trade balance using the mean and
see page 29; for 1.46, see page 30; for 1.47 and 1.48, see the standard deviation.
pages 3334; for 1.49, see page 35; and for 1.50 and 1.51, see (b) Do the same using the median and the quartiles.
page 36. (c) Using only the information from parts (a) and (b), give a
description of the data.
1.52 Gross domestic product growth in 120 countries. The
Do not look at any graphical summaries or other numerical sum-
gross domestic product (GDP) of a country is the total value of
maries for this part of the exercise.
all goods and services produced in the country. It is an important
measure of the health of a countrys economy. For this exercise, 1.55 What do the trade balance graphical summaries show?
ATA FIL
you will analyze the growth in GDP, expressed as a percent, for DATADATA
COUNTRIES120
D
Refer to the previous exercise.

E
DATADATADATA
ATA FIL
120 countries.33
DATADATA DATADATADATADATADATA
COUNTRIES120
D
DATADATADATA DATADATADATA
(a) Use graphical summaries to describe the distribution of the

DATADATADATA
(a) Compute the mean and the standard deviation. trade balance for these countries.
(b) Which two countries are outliers for this variable? (b) Give the names of the countries that correspond to extreme
(c) Recompute the mean and standard deviation without the out- values in this distribution.
liers. Explain how the mean and standard deviation changed when (c) Reanalyze the data without the outliers.
you deleted the outliers. (d) Summarize what you have learned about the distribution of
1.53 Use the resistant measures for GDP. Repeat parts (a) the trade balance for these countries. Include appropriate graph-
and (c) of the previous exercise using the median and the quar- ical and numerical summaries as well as comments about the
tiles. Summarize your results and compare them with those of outliers.
ATA FIL
DATADATA
COUNTRIES120
D
the previous exercise.

E
DATADATADATA
1.56 U.S. unemployment rates. Refer to Exercise 1.27 and

DATADATADATA
1.54 Trade balance for 120 countries. Trade balance is an- Table 1.2 (page 23) for the U.S. unemployment rates for each of
ATA FIL
DATADATA
UNEMPLOYMENT
D
other important variable that describes a countrys economy. It is the 50 states.

E
DATADATADATA
DATADATADATA
defined as the difference between the value of a countrys exports (a) Find the mean and the standard deviation.
and its imports. A negative trade balance occurs when a country (b) Find the five-number summary.
imports more than it exports. Similarly, the trade balance will be (c) Draw a boxplot.
positive for a country that exports more than it imports. Note that (d) How do you prefer to summarize these data? Include numer-
values of this variable are missing for five countries. In this data ical and graphical summaries and explain the reasons for your
ATA FIL
DATADATA
COUNTRIES120
D
set, missing values are coded as a periods. preference.

E
DATADATADATA
DATADATADATA
1.57 Canadian unemployment rates. Unemployment rates for by IBM, at $59,031 million; Microsoft, at $59,007 million; GE,
10 Canadian provinces are given in Exercise 1.28 (page 23). An- at $53,086 million; and Toyota, at $34,050 million. For this exer-
swer the questions in the previous exercise for these data. The cise you will use the brand values, reported in millions of dollars,
ATA FIL
DATADATA
BRANDS
D
U.S. data set has 50 cases while the Canadian data set has 10 for the top 100 brands.
E
DATADATADATA
DATADATADATA
cases. Discuss how this difference influences the way in which (a) Graphically display the distribution of the values of these
ATA FIL
DATADATA
UNEMPLOYMENTCANADA
D
you summarize the data. brands.
E
DATADATADATA
DATADATADATA
(b) Use numerical measures to summarize the distribution.

1.58 Compare U.S. and Canadian unemployment rates.
ATA FIL
DATADATA (c) Write a short paragraph discussing the dollar values of the
UNEMPLOYMENT, UNEMPLOY-
D
Refer to the previous two exercises.
E
DATADATADATA
top 100 brands. Include the results of your analysis.

DATADATADATA
MENTCANADA
(a) Use side-by-side boxplots to give a graphical summary of 1.62 The alcohol content of beer. Brewing beer involves a va-
the two sets of unemployment rates. riety of steps that can affect the alcohol content. A Web site gives
ATA FIL
(b) Use a back-to-back stemplot to compare the two sets of the percent alcohol for 86 domestic brands of beer.35
DATADATA
BEER
E
DATADATADATA
DATADATADATA
rates. A back-to-back stemplot has a single stem with leaves on (a) Use graphical and numerical summaries of your choice to
the left for one group and leaves on the right for the other. describe these data. Give reasons for your choice.
(c) Summarize the major differences and similarities between (b) The data set contains an outlier. Explain why this particular
the two sets of unemployment rates. beer is unusual and how its outlier status is related to how it is
(d) Which graphical comparison do you prefer? Give reasons for marketed.
your answer.
1.63 An outlier for alcohol content of beer. Refer to the pre-
ATA FIL
DATADATA
BEER
D
1.59 Recoverable oil. The estimated amounts of recoverable vious exercise.
E
DATADATADATA
DATADATADATA
oil from 64 oil wells in the Devonian Richmond Dolomite area (a) Calculate the mean with and without the outlier. Do the same
ATA FIL
DATADATA
OILWELLS
D
of Michigan are given Exercise 1.35 (page 25). for the median. Explain how these values change when the outlier
E
DATADATADATA
DATADATADATA
(a) Find the mean and the standard deviation. is excluded.

(b) Find the five-number summary. (b) Calculate the standard deviation with and without the outlier.
(c) Draw a boxplot. Do the same for the quartiles. Explain how these values change
(d) How do you prefer to summarize these data? Include numer- when the outlier is excluded.
ical and graphical summaries and explain the reasons for your (c) Write a short paragraph summarizing what you have learned
preference. in this exercise.
1.60 Variability of an agricultural product. A quality prod- 1.64 Calories in beer. Refer to the previous two exercises. The
ATA FIL
DATADATA
BEER
D
uct is one that is consistent and has very little variability in its data set also gives the calories per 12 ounces of beverage.
E
DATADATADATA
DATADATADATA
characteristics. Controlling variability can be more difficult with (a) Analyze the data and summarize the distribution of calories
agricultural products than with those that are manufactured. The for these 86 brands of beer.
following table gives the weights, in ounces, of the 25 potatoes (b) In Exercise 1.62 you identified one brand of beer as an out-
ATA FIL
DATADATA
POTATOES
D
sold in a 10-pound bag. lier. To what extent is this brand an outlier in the distribution of
E
DATADATADATA
DATADATADATA
calories? Explain your answer.

(c) The distribution of calories suggests that there may be two
7.8 7.9 8.2 7.3 6.7 7.9 7.9 7.9 7.6 7.8 7.0 4.7 7.6
groups of beers that might be marketed differently. Examine
6.3 4.7 4.7 4.7 6.3 6.0 5.3 4.3 7.9 5.2 6.0 3.7 the data file carefully and explain the characteristics of the two
groups.
(a) Summarize the data graphically and numerically. Give rea- 1.65 Create a data set. Create a data set for which the median
sons for the methods you chose to use in your summaries. would change by a large amount if the smallest observation is
(b) Do you think that your numerical summaries do an effective deleted.
job of describing these data? Why or why not?
(c) There appear to be two distinct clusters of weights for these 1.66 Salaries of the Chicago Cubs. The mean salary of the
potatoes. Divide the sample into two subsamples based on the players on the 2008 Chicago Cubs baseball team is $5,274,108,
clustering. Give the mean and standard deviation for each sub- while the median salary is $4,350,000. What explains the differ-
sample. Do you think that this way of summarizing these data is ence between these two measures of center?
better than a numerical summary that uses all the data as a single
1.67 Discovering outliers. Whether an observation is an out-
sample? Give a reason for your answer.
lier is a matter of judgment. It is convenient to have a rule for
1.61 The value of brands. A brand is a symbol or images that identifying suspected outliers. The 1.5 IQR rule is in common
are associated with a company. An effective brand identifies the use:
company and its products. Using a variety of measures, dollar 1. The interquartile range IQR is the distance between the first
values for brands can be calculated.34 The most valuable brand and third quartiles, IQR = Q 3 Q 1 . This is the spread of the
is Coca-Cola, with a value of $66,667 million. Coke is followed middle half of the data.
2. An observation is a suspected outlier if it lies more than (b) The mean of these returns is about 5.19%. Explain from the
1.5 IQR below the first quartile Q 1 or above the third quar- shape of the distribution why the mean return is larger than the
tile Q 3 . median return.
The stemplot in Exercise 1.31 (page 24) displays the dis-
1.73 Salary increase for the owners. Last year a small ac-
tribution of the percents of residents aged 65 and older in
counting firm paid each of its five clerks $30,000, two junior
the 50 states. Stemplots help you find the five-number sum-
accountants $65,000 each, and the firms owner $355,000.
mary because they arrange the observations in increasing order.
ATA FIL
DATADATA
(a) What is the mean salary paid at this firm? How many of the
POPOVER65BYSTATE
D
DATADATADATA
employees earn less than the mean? What is the median salary?
DATADATADATA
(a) Give the five-number summary of this distribution.

(b) This year the firm gives no raises to the clerks and junior
(b) Does the 1.5 IQR rule identify Alaska and Florida as sus-
accountants, while the owners take increases to $455,000. How
pected outliers? Does it also flag any other states?
does this change affect the mean? How does it affect the median?
The following three exercises use the Mean and Median applet
available at www.whfreeman.com/psbe to explore the behavior 1.74 A skewed distribution. Sketch a distribution that is skewed
of the mean and median. to the left. On your sketch, indicate the approximate position of
the mean and the median. Explain why these two values are not
1.68 Mean = median? Place two observations on the line
APPLET
equal.
by clicking below it. Why does only one arrow appear?
1.75 A standard deviation contest. You must choose four num-
1.69 Extreme observations. Place three observations
APPLET
bers from the whole numbers 10 to 20, with repeats allowed.
on the line by clicking below it, two close together near the (a) Choose four numbers that have the smallest possible standard
center of the line and one somewhat to the right of these two. deviation.
(a) Pull the rightmost observation out to the right. (Place the cur- (b) Choose four numbers that have the largest possible standard
sor on the point, hold down a mouse button, and drag the point.) deviation.
How does the mean behave? How does the median behave? Ex- (c) Is more than one choice possible in either (a) or (b)? Explain.
plain briefly why each measure acts as it does.
(b) Now drag the rightmost point to the left as far as you can. 1.76 Imputation. Various problems with data collection can
What happens to the mean? What happens to the median as you cause some observations to be missing. Suppose a data set has 20
drag this point past the other two (watch carefully)? cases. Here are the values of the variable x for 10 of these cases:
ATA FIL
DATADATA
IMPUTATION
D
DATADATADATA
1.70 Dont change the median. Place 5 observations on

DATADATADATA
APPLET
the line by clicking below it. 17 6 12 14 20 23 9 12 16 21
(a) Add 1 additional observation without changing the median.
Where is your new point? The values for the other 10 cases are missing. One way to deal
(b) Use the applet to convince yourself that when you add yet with missing data is called imputation. The basic idea is that miss-
another observation (there are now 7 in all), the median does not ing values are replaced, or imputed, with values that are based on
change no matter where you put the 7th point. Explain why this an analysis of the data that are not missing. For a data set with a
must be true. single variable, the usual choice of a value for imputation is the
mean of the values that are not missing. The mean for this data
1.71 x and s are not enough. The mean x and standard deviation set is 15.
s measure center and spread but are not a complete description of (a) Verify that the mean is 15 and find the standard deviation for
a distribution. Data sets with different shapes can have the same the 10 cases for which x is not missing.
mean and standard deviation. To demonstrate this fact, find x and (b) Create a new data set with 20 cases by setting the values
s for these two small data sets. Then make a stemplot of each and for the 10 missing cases to 15. Compute the mean and standard
ATA FIL
DATADATA
ABDATA
D
comment on the shape of each distribution.

E
DATADATADATA
deviation for this data set.

DATADATADATA
(c) Summarize what you have learned about the possible effects
Data A: 9.14 8.14 8.74 8.77 9.26 8.10 of this type of imputation on the mean and the standard deviation.
6.13 3.10 9.13 7.26 4.74
1.77 A different type of mean. The trimmed mean is a mea-
Data B: 6.58 5.76 7.71 8.84 8.47 7.04 sure of center that is more resistant than the mean but uses more
5.25 5.56 7.91 6.89 12.50 of the available information than the median. To compute the
5% trimmed mean, discard the highest 5% and the lowest 5% of
CASE 1.1 1.72 Returns on Treasury bills. Figure 1.16(a) the observations and compute the mean of the remaining 90%.
(page 37) is a stemplot of the annual returns on U.S. Treasury Trimming eliminates the effect of a small number of outliers. Use
bills for fifty years. (The entries are rounded to the nearest tenth the data on the values of the top 100 brands that we studied in
ATA FIL
Exercise 1.61 (page 39) to find the 5% trimmed mean. Compare

DATADATA
TBILLRATES50
D
of a percent.)
E
DATADATADATA
DATADATADATA
(a) Use the stemplot to find the five-number summary of T-bill this result with the value of the mean computed in the usual way.
ATA FIL
DATADATA
BRANDS
D
DATADATADATA
returns.
DATADATADATA
1.3 Density Curves and the Normal Distributions 41
1.3 Density Curves and the Normal Distributions

We now have a kit of graphical and numerical tools for describing distributions. What
is more, we have a clear strategy for exploring data on a single quantitative variable:
1. Always plot your data: make a graph, usually a histogram or a stemplot.
2. Look for the overall pattern (shape, center, spread) and for striking deviations
such as outliers.
3. Calculate a numerical summary to briefly describe center and spread.
Here is one more step to add to this strategy:
4. Sometimes the overall pattern of a large number of observations is so regular that
we can describe it by a smooth curve.
Density curves
mathematical model A density curve is a mathematical model for a distribution. Mathematical models are
idealized descriptions. They allow us to easily make many statements in an idealized
world. The statements are useful when the idealized world is similar to the real world.
The density curves that we will study give a compact picture of the overall pattern of
data. They ignore minor irregularities as well as outliers. For some situations, we are
able to capture all of the essential characteristics of a distribution with a density curve.
For other situations, our idealized model misses some important characteristics. As with
so many things in statistics, your careful judgment is needed to decide what is important
and how close is good enough.
EXAMPLE 1.27 Gas Mileage

ATA FIL
DATADATA
D
DATADATADATA
Figure 1.17 is a histogram of the city gas mileage achieved by all 1140 motor vehicles (2009 model
MPG2009 year) listed in the governments annual fuel economy report.36 Superimposed on the histogram is
DATADATADATA
FIGURE 1.17 Histogram of fuel

efficiency (miles per gallon) of
35
1140 autos (model year
2009), for Example 1.27. The 30
smooth curve shows the overall
shape of the distribution. 25
Percent
20
15
10
0
5 10 15 20 25 30 35 40 45 50
Miles per gallon
a density curve. The histogram shows that there are a few vehicles with very good fuel efficiency.
These are high outliers in the distribution. The distribution is somewhat skewed to the right,
reflecting the successful attempts of the auto industry to produce high-fuel-efficiency vehicles.
There is a single peak around 15 miles per gallon. Both tails fall off quite smoothly. The density
curve in Figure 1.17 is close to the histogram in many places but fails to capture some important
characteristics of the distribution displayed by the histogram.
If we use a density curve that ignores vehicles that are outliers, we would capture the
main features of the distribution of fuel efficiency for 2009 vehicles. On the other hand,
we would miss the fact that some of these vehicles have been engineered to give excellent
fuel efficiency. A marketing campaign based on this outstanding performance could be
very effective for selling vehicles in an economy with high fuel prices. Be careful about
how you deal with outliers. They may be data errors or they may be the most important
feature of the distribution. Computer software cannot make this judgment. Only you
can.
Here are some details about density curves. We need these basic ideas to understand
the rest of this chapter.
Density Curve
A density curve is a curve that
is always on or above the horizontal axis and
has area exactly 1 underneath it.
A density curve describes the overall pattern of a distribution. The area under the curve
and above any range of values is the proportion of all observations that fall in that range.
The median and mean of a density curve

Our measures of center and spread apply to density curves as well as to actual sets of
observations. The median and quartiles are easy. Areas under a density curve represent
proportions of the total number of observations. The median is the point with half the
observations on either side. So the median of a density curve is the equal-areas
point, the point with half the area under the curve to its left and the remaining half of
the area to its right. The quartiles divide the area under the curve into quarters. One-
fourth of the area under the curve is to the left of the first quartile, and three-fourths
of the area is to the left of the third quartile. You can roughly locate the median and
quartiles of any density curve by eye by dividing the area under the curve into four equal
parts.
EXAMPLE 1.28 Symmetric Density Curves

Because density curves are idealized patterns, a symmetric density curve is exactly symmetric.
The median of a symmetric density curve is therefore at its center. Figure 1.18(a) shows the median
of a symmetric curve.
FIGURE 1.18(a) The median

and mean of a symmetric
density curve, for
Example 1.28.
Median and mean
The situation is different for skewed density curves. Here is an example.
EXAMPLE 1.29 Skewed Density Curves

It isnt so easy to spot the equal-areas point on a skewed curve. There are mathematical ways of
finding the median for any density curve. We did that to mark the median on the skewed curve in
Figure 1.18(b).
FIGURE 1.18(b) The median

and mean of a right-skewed
density curve, for
Example 1.29.
Mean
Median
1.78 Another skewed curve. Sketch a curve similar to Figure 1.18(b) for a left-
skewed density curve. Be sure to mark the location of the mean and the median.
What about the mean? The mean of a set of observations is their arithmetic average.
If we think of the observations as weights strung out along a thin rod, the mean is the
point at which the rod would balance. This fact is also true of density curves. The mean
is the point at which the curve would balance if made of solid material.
EXAMPLE 1.30 Mean and Median

Figure 1.19 illustrates this fact about the mean. A symmetric curve balances at its center because
the two sides are identical. The mean and median of a symmetric density curve are equal,
as in Figure 1.18(a). We know that the mean of a skewed distribution is pulled toward the long
tail. Figure 1.18(b) shows how the mean of a skewed density curve is pulled toward the long tail
more than is the median. Its hard to locate the balance point by eye on a skewed curve. There are
FIGURE 1.19 The mean is the

balance point of a density
curve.
mathematical ways of calculating the mean for any density curve, so we are able to mark the mean
as well as the median in Figure 1.18(b).
Median and Mean of a Density Curve

The median of a density curve is the equal-areas point, the point that divides the area
under the curve in half.
The mean of a density curve is the balance point, at which the curve would balance if
made of solid material.
The median and mean are the same for a symmetric density curve. They both lie at the
center of the curve. The mean of a skewed curve is pulled away from the median in the
direction of the long tail.
We can roughly locate the mean, median, and quartiles of any density curve by eye.
This is not true of the standard deviation. When necessary, we can once again call on
more advanced mathematics to learn the value of the standard deviation. The study of
mathematical methods for doing calculations with density curves is part of theoretical
statistics. Though we are concentrating on statistical practice, we often make use of the
results of mathematical study.
Because a density curve is an idealized description of the distribution of data, we
need to distinguish between the mean and standard deviation of the density curve and
the mean x and standard deviation s computed from the actual observations. The usual
mean notation for the mean of an idealized distribution is (the Greek letter mu). We write
standard deviation the standard deviation of a density curve as (the Greek letter sigma).
1.79 A symmetric curve. Sketch a density curve that is symmetric but has a shape
different from that of the curve in Figure 1.18(a).
uniform distribution 1.80 A uniform distribution. Figure 1.20 displays the density curve of a uniform
distribution. The curve takes the constant value 1 over the interval from 0 to 1 and is
FIGURE 1.20 The density curve

of a uniform distribution, for
Exercise 1.80.
0 1
FIGURE 1.21 Three density

curves, for Exercise 1.81.
A BC A B C AB C
(a) (b) (c)
0 outside that range of values. This means that data described by this distribution take
values that are uniformly spread between 0 and 1. Use areas under this density curve
to answer the following questions.
(a) Why is the total area under this curve equal to 1?
(b) What percent of the observations lie above 0.8?
(c) What percent of the observations lie below 0.6?
(d) What percent of the observations lie between 0.25 and 0.75?
(e) What is the mean of this distribution?
1.81 Three curves. Figure 1.21 displays three density curves, each with three points
marked. At which of these points on each curve do the mean and the median fall?
Normal distributions
One particularly important class of density curves has already appeared in Figure 1.18(a).
These density curves are symmetric, single-peaked, and bell-shaped. They are called
Normal distributions Normal curves, and they describe Normal distributions. All Normal distributions have
the same overall shape. The exact density curve for a particular Normal distribution is
described by giving its mean and its standard deviation . The mean is located at
the center of the symmetric curve and is the same as the median. Changing without
changing moves the Normal curve along the horizontal axis without changing its
spread. The standard deviation controls the spread of a Normal curve. Figure 1.22
shows two Normal curves with different values of . The curve with the larger standard
deviation is more spread out.
The standard deviation is the natural measure of spread for Normal distributions.
Not only do and completely determine the shape of a Normal curve, but we can
FIGURE 1.22 Two Normal curves, showing the mean and the standard deviation .
locate by eye on the curve. Heres how. Imagine that you are skiing down a mountain
that has the shape of a Normal curve. At first, you descend at an ever-steeper angle as
you go out from the peak:
Fortunately, before you find yourself going straight down, the slope begins to grow flatter
rather than steeper as you go out and down:
The points at which this change of curvature takes place are located along the
horizontal axis at distance on either side of the mean . Remember that and
alone do not specify the shape of most distributions, and that the shape of density curves
in general does not reveal . These are special properties of Normal distributions.
Why are the Normal distributions important in statistics? Here are three reasons.
First, Normal distributions are good descriptions for some distributions of real data. Dis-
tributions that are often close to Normal include scores on tests taken by many people
(such as GMAT exams), repeated careful measurements of the same quantity (such as
measurements taken from a production process), and characteristics of biological popu-
lations (such as yields of corn). Second, Normal distributions are good approximations to
the results of many kinds of chance outcomes, such as tossing a coin many times. Third,
and most important many of the statistical inference procedures that we will study in
later chapters are based on Normal distributions.
The 68--95--99.7 rule

Although there are many Normal curves, they all have common properties. In particular,
all Normal distributions obey the following rule.
The 689599.7 Rule

In the Normal distribution with mean and standard deviation :
68% of the observations fall within of the mean .
95% of the observations fall within 2 of .
99.7% of the observations fall within 3 of .
Figure 1.23 illustrates the 689599.7 rule. By remembering these three num-
bers, you can think about Normal distributions without constantly making detailed
calculations.
EXAMPLE 1.31 Using the 689599.7 Rule

The distribution of weights of 9-ounce bags of a particular brand of potato chips is approximately
Normal with mean = 9.12 ounces and standard deviation = 0.15 ounce. Figure 1.24 shows
what the 689599.7 rule says about this distribution.
FIGURE 1.23 The 68--95--99.7

rule for Normal distributions.
68% of data
95% of data
99.7% of data
3 2 1 0 1 2 3
FIGURE 1.24 The 68--95--99.7

rule applied to the distribution
of weights of bags of potato
chips, for Example 1.31.
68%
95%
99.7%
8.67 8.82 8.97 9.12 9.27 9.42 9.57
Two standard deviations is 0.3 ounces for this distribution. The 95 part of the 689599.7
rule says that the middle 95% of 9-ounce bags weigh between 9.12 0.3 and 9.12 + 0.3 ounces,
that is, between 8.82 ounces and 9.42 ounces. This fact is exactly true for an exactly Normal dis-
tribution. It is approximately true for the weights of 9-ounce bags of chips because the distribution
of these weights is approximately Normal.
The other 5% of bags have weights outside the range from 8.82 to 9.42 ounces. Because the
Normal distributions are symmetric, half of these bags are on the heavy side. So the heaviest 2.5%
of 9-ounce bags are heavier than 9.42 ounces.
The 99.7 part of the 689599.7 rule says that almost all bags (99.7% of them) have weights
between 3 and + 3 . This range of weights is 8.67 to 9.57 ounces.
Because we will mention Normal distributions often, a short notation is helpful. We

abbreviate the Normal distribution with mean and standard deviation as N (, ).
For example, the distribution of weights in the previous example is N (9.12, 0.15).
1.82 Heights of young men. Product designers often must consider physical char-
acteristics of their target population. For example, the distribution of heights of men
aged 20 to 29 years is approximately Normal with mean 69 inches and standard de-
viation 2.5 inches. Draw a Normal curve on which this mean and standard deviation
are correctly located. (Hint: Draw the curve first, locate the points where the curvature
changes, then mark the horizontal axis.)
1.83 More on young mens heights. The distribution of heights of young men is
approximately Normal with mean 69 inches and standard deviation 2.5 inches. Use the
689599.7 rule to answer the following questions.
(a) What percent of these men are taller than 74 inches?
(b) Between what heights do the middle 95% of young men fall?
(c) What percent of young men are shorter than 66.5 inches?
1.84 Test scores. Many states have programs for assessing the skills of students in
various grades. The Indiana Statewide Testing for Educational Progress (ISTEP) is
one such program.37 In a recent year, 76,531, tenth-grade Indiana students took the
English/language arts exam. The mean score was 572 and the standard deviation was
51. Assuming that these scores are approximately Normally distributed, N (572, 51),
use the 689599.7 rule to give a range of scores that includes 95% of these students.
1.85 Use the 689599.7 rule. Refer to the previous exercise. Use the 689599.7
rule to give a range of scores that includes 99.7% of these students.
The standard Normal distribution

As the 689599.7 rule suggests, all Normal distributions share many common proper-
ties. In fact, all Normal distributions are the same if we measure in units of size about
the mean as center. Changing to these units is called standardizing. To standardize a
value, subtract the mean of the distribution and then divide by the standard deviation.
Standardizing and z-Scores

If x is an observation from a distribution that has mean and standard deviation , the
standardized value of x is
x
z=

A standardized value is often called a z-score.
A z-score tells us how many standard deviations the original observation falls away
from the mean, and in which direction. Observations larger than the mean are pos-
itive when standardized, and observations smaller than the mean are negative when
standardized.
EXAMPLE 1.32 Standardizing Potato Chip Bag Weights

The weights of 9-ounce potato chip bags are approximately Normal with = 9.12 ounces and
= 0.15 ounce. The standardized weight is
weight 9.12
z=
0.15
A bags standardized weight is the number of standard deviations by which its weight differs from
the mean weight of all bags. A bag weighing 9.3 ounces, for example, has standardized weight
9.3 9.12
z= = 1.2
0.15
or 1.2 standard deviations above the mean. Similarly, a bag weighing 8.7 ounces has standardized
weight
8.7 9.12
z= = 2.8
0.15
or 2.8 standard deviations below the mean bag weight.
If the variable we standardize has a Normal distribution, standardizing does more

than give a common scale. It makes all Normal distributions into a single distribution, and
this distribution is still Normal. Standardizing a variable that has any Normal distribution
produces a new variable that has the standard Normal distribution.
Standard Normal Distribution

The standard Normal distribution is the Normal distribution N (0, 1) with mean 0 and
standard deviation 1.
If a variable x has any Normal distribution N (, ) with mean and standard
deviation , then the standardized variable
x
z=

has the standard Normal distribution.
1.86 SAT versus ACT. Eleanor scores 680 on the Mathematics part of the SAT. The
distribution of SAT scores in a reference population is Normal, with mean 500 and
standard deviation 100. Gerald takes the American College Testing (ACT) Mathematics
test and scores 27. ACT scores are Normally distributed with mean 18 and standard
deviation 6. Find the standardized scores for both students. Assuming that both tests
measure the same kind of ability, who has the higher score?
Normal distribution calculations

Areas under a Normal curve represent proportions of observations from that Normal
distribution. There is no easy formula for areas under a Normal curve. To find areas of
interest, either software that calculates areas or a table of areas can be used. The table
cumulative proportion and most software calculate one kind of area: cumulative proportions. A cumulative
proportion is the proportion of observations in a distribution that lie at or below a given
value. When the distribution is given by a density curve, the cumulative proportion is
the area under the curve to the left of a given value. Figure 1.25 shows the idea more
clearly than words do.
The key to calculating Normal proportions is to match the area you want with areas
that represent cumulative proportions. Then get areas for cumulative proportions. The
following examples illustrate the methods.
FIGURE 1.25 The cumulative

proportion for a value x is the Cumulative proportion
proportion of all observations at x = area under curve
from the distribution that are to the left of x
less than or equal to x. This is
the area to the left of x under
the Normal curve.
EXAMPLE 1.33 The NCAA Standard for SAT Scores

The National Collegiate Athletic Association (NCAA) requires Division I athletes to get a com-
bined score of at least 820 on the SAT Mathematics and Verbal tests to compete in their first
college year. (Higher scores are required for students with poor high school grades.) The scores
of the 1.4 million students in the class of 2003 who took the SATs were approximately Normal
with mean 1026 and standard deviation 209. What proportion of all students had SAT scores of at
least 820?
Here is the calculation in pictures: the proportion of scores above 820 is the area under the
curve to the right of 820. Thats the total area under the curve (which is always 1) minus the
cumulative proportion up to 820.
= -
820 820
area right of 820 = total area area left of 820

0.8379 = 1 0.1621
That is, the proportion of all SAT takers who would be NCAA qualifiers is 0.8379, or about 84%.
There is no area under a smooth curve and exactly over the point 820. Consequently,
the area to the right of 820 (the proportion of scores > 820) is the same as the area at or
to the right of this point (the proportion of scores 820). The actual data may contain a
student who scored exactly 820 on the SAT. That the proportion of scores exactly equal
to 820 is 0 for a Normal distribution is a consequence of the idealized smoothing of
Normal distributions for data.
EXAMPLE 1.34 NCAA Partial Qualifiers

The NCAA considers a student a partial qualifier eligible to practice and receive an athletic
scholarship, but not to compete, if the combined SAT score is at least 720. What proportion of
all students who take the SAT would be partial qualifiers? That is, what proportion have scores
between 720 and 820? Here are the pictures:
= -
720 820 820 720
area between 720 and 820 = area left of 820 area left of 720
0.0905 = 0.1621 0.0716
About 9% of all students who take the SAT have scores between 720 and 820.
How do we find the numerical values of the areas in Examples 1.33 and 1.34? If
you use software, just plug in mean 1026 and standard deviation 209. Then ask for the
cumulative proportions for 820 and for 720. (Your software will probably refer to these
as cumulative probabilities. We will learn in Chapter 4 why the language of probability
fits.) If you make a sketch of the area you want, you will rarely go wrong.
You can use the Normal Curve applet on the text CD and Web site to find Normal
APPLET proportions. The applet is more flexible than most softwareit will find any Normal
proportion, not just cumulative proportions. The applet is an excellent way to understand
Normal curves. But, because of the limitations of Web browsers, the applet is not as
accurate as statistical software.
If you are not using software, you can find cumulative proportions for Normal curves
from a table. That requires an extra step, as we now explain.
Using the standard Normal table

The extra step in finding cumulative proportions from a table is that we must first stan-
dardize to express the problem in the standard scale of z-scores. This allows us to get by
with just one table, a table of standard Normal cumulative proportions. Table A in the
back of the book gives cumulative proportions for the standard Normal distribution. Ta-
ble A also appears on the inside front cover. The pictures at the top of the table remind us
that the entries are cumulative proportions, areas under the curve to the left of a value z.
EXAMPLE 1.35 Find the Proportion from z

What proportion of observations on a standard Normal variable Z take values less than z = 1.47?
Solution: To find the area to the left of 1.47, locate 1.4 in the left-hand column of Table A, then
locate the remaining digit 7 as .07 in the top row. The entry opposite 1.4 and under .07 is 0.9292.
This is the cumulative proportion we seek. Figure 1.26 illustrates this area.
Now that you see how Table A works, lets redo the NCAA Examples 1.33 and 1.34
using the table.
FIGURE 1.26 The area under

the standard Normal curve to Table entry: area = 0.9292
the left of the point z = 1.47 is
0.9292, for Example 1.35.
z = 1.47
EXAMPLE 1.36 Find the Proportion from x

What proportion of all students who take the SAT have scores of at least 820? The picture that leads
to the answer is exactly the same as in Example 1.33. The extra step is that we first standardize in
order to read cumulative proportions from Table A. If X is SAT score, we want the proportion of
students for whom X 820.
Step 1. Standardize. Subtract the mean, then divide by the standard deviation, to transform the
problem about X into a problem about a standard Normal Z :
X 820
X 1026 820 1026

209 209
Z 0.99
Step 2. Use the table. Look at the pictures in Example 1.33. From Table A, we see that the
proportion of observations less than 0.99 is 0.1611. The area to the right of 0.99 is therefore
1 0.1611 = 0.8389. This is about 84%.
The area from the table in Example 1.36 (0.8389) is slightly less accurate than the
area from software in Example 1.33 (0.8379) because we must round z to two places
when we use Table A. The difference is rarely important in practice.
EXAMPLE 1.37 Proportion of Partial Qualifiers

What proportion of all students who take the SAT would be partial qualifiers in the eyes of the
NCAA? That is, what proportion of students have SAT scores between 720 and 820? First, sketch
the areas, exactly as in Example 1.34. We again use X as shorthand for an SAT score.
Step 1. Standardize.
720 X < 820
720 1026 X 1026 820 1026
<
209 209 209
1.46 Z < 0.99
Step 2. Use the table.
area between 1.46 and 0.99 = (area left of 0.99) (area left of 1.46)
= 0.1611 0.0721 = 0.0890
As in Example 1.34, about 9% of students would be partial qualifiers.

Sometimes we encounter a value of z more extreme than those appearing in Table A.

For example, the area to the left of z = 4 is not given directly in the table. The z-values
in Table A leave only area 0.0002 in each tail unaccounted for. For practical purposes,
we can act as if there is zero area outside the range of Table A.
1.87 Find the proportion. Use the fact that the ISTEP scores from Exercise 1.84
(page 48) are approximately Normal, N (572, 51). Find the proportion of students who
have scores less than 600. Find the proportion of students who have scores greater than
or equal to 600. Sketch the relationship between these two calculations using pictures
of Normal curves similar to the ones given in Example 1.33.
1.88 Find another proportion. Use the fact that the ISTEP scores are approximately
Normal, N (572, 51). Find the proportion of students who have scores between 600
and 650. Use pictures of Normal curves similar to the ones given in Example 1.34 to
illustrate your calculations.
Inverse Normal calculations

Examples 1.33 to 1.36 illustrate the use of Normal distributions to find the proportion
of observations in a given event, such as SAT score between 720 and 820. We may
instead want to find the observed value corresponding to a given proportion.
Statistical software will do this directly. Without software, use Table A backward,
finding the desired proportion in the body of the table and then reading the corresponding
z from the left column and top row.
EXAMPLE 1.38 How High for the Top 10%?

Scores on the SAT Verbal test in recent years follow approximately the N (505, 110) distribution.
How high must a student score in order to place in the top 10% of all students taking the SAT?
Again, the key to the problem is to draw a picture. Figure 1.27 shows that we want the score
x with area above it 0.10. Thats the same as area below x equal to 0.90.
Statistical software has a function that will give you the x for any cumulative proportion you
specify. The function often has a name such as inverse cumulative probability. Plug in mean 505,
standard deviation 110, and cumulative proportion 0.9. The software tells you that x = 645.97.
We see that a student must score at least 646 to place in the highest 10%.
FIGURE 1.27 Locating the point

on a Normal curve with area
0.10 to its right, for Example Area = 0.90
1.38. The result is x = 646, or
z = 1.28 in the standard scale. Area = 0.10
x = 505 x=?
z=0 z = 1.28
Without software, first find the standard score z with cumulative proportion 0.9, then
unstandardize to find x. Here is the two-step process:
1. Use the table. Look in the body of Table A for the entry closest to 0.9. It is 0.8997.
This is the entry corresponding to z = 1.28. So z = 1.28 is the standardized value
with area 0.9 to its left.
2. Unstandardize to transform the solution from z back to the original x scale. We
know that the standardized value of the unknown x is z = 1.28. So x itself satisfies
x 505
= 1.28
110
Solving this equation for x gives
x = 505 + (1.28)(110) = 645.8
This equation should make sense: it finds the x that lies 1.28 standard deviations
above the mean on this particular Normal curve. That is the unstandardized
meaning of z = 1.28. The general rule for unstandardizing a z-score is
x = + z
1.89 What score is needed to be in the top 5%? Consider the ISTEP scores, which
are approximately Normal, N (572, 51). How high a score is needed to be in the top
5% of students who take this exam?
1.90 Find the score that 60% of students will exceed. Consider the ISTEP scores,
which are approximately Normal, N (572, 51). Sixty percent of the students will score
above x on this exam. Find x.
Assessing the Normality of data

The Normal distributions provide good models for some distributions of real data. Ex-
amples include the miles per gallon ratings of vehicles, average payrolls of Major League
Baseball teams, and statewide unemployment rates. The distributions of some other com-
mon variables are usually skewed and therefore distinctly non-Normal. Examples include
personal income, gross sales of business firms, and the service lifetime of mechanical or
electronic components. While experience can suggest whether or not a Normal model is
plausible in a particular case, it is risky to assume that a distribution is Normal without
actually inspecting the data.
The decision to describe a distribution by a Normal model may determine the later
steps in our analysis of the data. Calculations of proportions, as we have done above,
and statistical inference based on such calculations follow from the choice of a model.
How can we judge whether data are approximately Normal?
A histogram or stemplot can reveal distinctly non-Normal features of a distribution,
such as outliers, pronounced skewness, or gaps and clusters. If the stemplot or histogram
appears roughly symmetric and single-peaked, however, we need a more sensitive way
to judge the adequacy of a Normal model. The most useful tool for assessing Normality
Normal quantile plot is another graph, the Normal quantile plot.
Some software calls these graphs Normal probability plots. There is a technical distinction between the two
types of graphs, but the terms are often used loosely.
Here is the idea of a simple version of a Normal quantile plot. It is not feasible
to make Normal quantile plots by hand, but software makes them for us, using more
sophisticated versions of this basic idea.
1. Arrange the observed data values from smallest to largest. Record what percentile
of the data each value occupies. For example, the smallest observation in a set of 20
is at the 5% point, the second smallest is at the 10% point, and so on.
2. Find the same percentiles for the Normal distribution using Table A or statistical
Normal scores software. Percentiles of the standard Normal distribution are often called Normal
scores. For example, z = 1.645 is the 5% point of the standard Normal
distribution, and z = 1.282 is the 10% point.
3. Plot each data point x against the corresponding Normal score z. If the data
distribution is close to standard Normal, the plotted points will lie close to the
45-degree line x = z. If the data distribution is close to any Normal distribution, the
plotted points will lie close to some straight line.
Any Normal distribution produces a straight line on the plot because standardizing
turns any Normal distribution into a standard Normal distribution. Standardizing is a
linear transformation that can change the slope and intercept of the line in our plot but
cannot turn a line into a curved pattern.
Use of Normal Quantile Plots

If the points on a Normal quantile plot lie close to a straight line, the plot indicates that
the data are Normal. Systematic deviations from a straight line indicate a non-Normal
distribution. Outliers appear as points that are far away from the overall pattern of the plot.
Figures 1.28 to 1.31 are Normal quantile plots for data we have met earlier. The data
x are plotted vertically against the corresponding Normal scores z plotted horizontally.
For small data sets, the z axis extends from 3 to 3 because almost all of a standard
Normal curve lies between these values. With larger sample sizes, values in the extremes
are more likely, and the z axis will extend farther from zero. These figures show how
Normal quantile plots behave.
EXAMPLE 1.39 IQ Scores Are Normal
ATA FIL
DATADATA
In Example 1.18 we examined the distribution of IQ scores for a sample of 60 fifth-grade students.
D
DATADATADATA
DATADATADATADATADATA Figure 1.28 gives a Normal quantile plot for these data. Notice that the points have a pattern
IQ
DATADATADATA
that is pretty close to a straight line. This pattern indicates that the distribution is approximately
Normal. When we constructed a histogram of the data in Figure 1.11 (page 18), we noted that
the distribution has a single peak, is approximately symmetric, and has tails that decrease in a
smooth way. We can now add to that description by stating that the distribution is approximately
Normal.
Figure 1.28 does, of course, show some deviation from a straight line. Real data
almost always show some departure from the theoretical Normal model. It is important
to confine your examination of a Normal quantile plot to searching for shapes that show
clear departures from Normality. Dont overreact to minor wiggles in the plot. When we
discuss statistical methods that are based on the Normal model, we will pay attention to
the sensitivity of each method to departures from Normality. Many common methods
work well as long as the data are reasonably symmetric and outliers are not present.
FIGURE 1.28 Normal quantile

plot for the IQ data, for 150
Example 1.39. This pattern
indicates that the data are 140
approximately Normal.
130
120
IQ
110
100
90
80
-3 -2 -1 0 1 2 3
Normal score
EXAMPLE 1.40 T-bill Interest Rates Are Not Normal
CASE 1.1 We made a histogram for the distribution of interest rates for T-bills in Example 1.12 (page 12).
A Normal quantile plot for these data is shown in Figure 1.29. This plot shows some interesting
ATA FIL
DATADATA
features of the distribution. First, in the central part, from about z = 2 to z = 1, the points fall
D
DATADATADATA
TBILLRATES approximately on a straight line. This suggests that the distribution is approximately Normal in
DATADATADATA
this range. Then there is the region from slightly above z = 1 to slightly above z = 2, where
the points also fall approximately on a straight line. This line, however, has a different slope.
Combined, these features suggest that the distribution of interest rates may actually be a mixture
or a combination of two Normal populations. Finally, in both the lower and the upper extremes
the points flatten out. This occurs at an interest rate of around 1% for the lower tail and at 15% for
the upper tail. There may be some marked considerations that restrain interest rates from going
outside these bounds.

plot for the T-bill interest rates, 17.5
for Example 1.40. These data
are not approximately Normal. 15.0
Interest rate (percent)
12.5
10.0
7.5
5.0
2.5
0
-4 -3 -2 -1 0 1 2 3 4
Normal score
The idea that distributions are approximately Normal within a range of values is an
old tradition. The remark All distributions are approximately Normal in the middle
has been attributed to the statistician Charlie Winsor.38
CASE 1.2 1.91 Length of time to start a business. In Exercise 1.40 we noted that
the sample of times to start a business from 25 countries contained an outlier. For
ATA FIL
DATADATA
Suriname, the reported time is 694 days. This case is the most extreme in the entire
D
E
DATADATADATA
TIMETOSTART25 data set, which includes 195 counties. Figure 1.30 shows the Normal quantile plot for
DATADATADATA
these data with Suriname excluded.

(a) These data are skewed to the right. How does this feature appear in the Normal
quantile plot?
(b) Compare the shape of the upper portion of this Normal quantile plot with the
upper portion of the plot for the T-bill interest rates in Figure 1.29, and with the
upper portion of the plot for the IQ scores in Figure 1.28. Make a general
statement about what the shape of the upper portion of a Normal quantile plot
tells you about the upper tail of a distribution.
ATA FIL
DATADATA 1.92 Customer service center call lengths. Figure 1.31 is a Normal quantile plot
D
DATADATADATA
CALLCENTER for the customer center call lengths. We looked at these data in Example 1.14, and we
DATADATADATA examined the distribution using a histogram in Figure 1.8 (page 14). There are clearly
some very large outliers. In making the Normal quantile plot, we eliminated all calls
that lasted longer than 2 hours (7200 seconds). This distribution is strongly skewed to
the right. How does this show up in the Normal quantile plot?
BEYOND THE BASICS: Density Estimation

A density curve gives a compact summary of the overall shape of a distribution. Fig-
ure 1.17 (page 41) shows a Normal density curve that summarizes the distribution of
miles per gallon ratings for 1140 vehicles. It captures some characteristics of the distri-
bution but misses others.
Many distributions do not have the Normal shape. There are other families of density
curves that are used as mathematical models for various distribution shapes.
density estimation Modern software offers a more flexible option: density estimation. A density
estimator does not start with any specific shape, such as the Normal shape. It looks at
the data and draws a density curve that describes the overall shape of the data.

plot for the length of time 250
required to start a business, for
Exercise 1.91. Suriname, with
Time to start a business (days)
a time of 694 days, has been 200

excluded.
150
100
50
0
-3 -2 -1 0 1 2 3
Normal score

plot for the customer service 6000
center call lengths, for Exercise
1.93. Data for calls lasting 5000
more than 7200 seconds
Call length (seconds)

(2 hours) have been excluded. 4000
3000
2000
1000
0
-6 -4 -2 0 2 4 6
Normal score
EXAMPLE 1.41 Fuel Efficiency Data
ATA FIL
DATADATA
Figure 1.32 gives the histogram of the miles per gallon distribution with a density estimate produced
D
DATADATADATA
DATADATADATADATADATA by software. Compare this figure with Figure 1.17 (page 41). Notice how the density estimate
DATADATADATADATADATA MPG2009
captures more of the unusual features of the distribution than the Normal density curve does.
DATADATADATA
FIGURE 1.32 Histogram of fuel

efficiency for 1140 vehicles,
35
with a density estimate, for
Example 1.41. 30
25
Percent
20
15
10
0
5 10 15 20 25 30 35 40 45 50
Miles per gallon
Density estimates can capture other unusual features of a distribution. Here is an

example.
EXAMPLE 1.42 StubHub!

ATA FIL
DATADATA
StubHub! is a Web site where fans can buy and sell tickets to sporting events. Ticket holders
D
DATADATADATA
STUBHUB
wanting to sell their tickets provide the location of their seats and the selling price. People wanting
to buy tickets can choose from among the tickets offered for a given event.39
DATADATADATA
On Saturday, October 18, 2008, the eleventh-ranked Missouri football team was scheduled
to play the first-ranked Texas team in Austin. On Thursday, October 16, 2008, StubHub! listed
64 pairs of tickets for the game. One pair was offered at $883 per ticket. It was noted that these
seats were in a suite and that food and bar were included. We discarded this outlier and examined
the distribution of the price per ticket for the remaining 63 pairs of tickets. The histogram with
a density estimate is given in Figure 1.33. The distribution has two peaks, one around $160 and
bimodal distribution another around $360. This is the identifying characteristic of a bimodal distribution. Since the
stadium has upper- and lower-level seats, we suspect that the difference in price between these
two types of seats is responsible for the two peaks. (Texas won 56 to 31.)
FIGURE 1.33 Histogram of

StubHub! price per seat for
35
tickets to the Missouri-Texas
football game on October 18, 30
2008, with a density estimate,
for Example 1.42. One outlier, 25
with a price per seat of $883, Percent
was deleted. 20
15
10
0
100 140 180 220 260 300 340 380 420 460 500
Price ($)
Example 1.42 reminds us of a continuing theme for data analysis. We looked at a

histogram and a density estimate and saw something interesting. This led us to speculate.
Additional data on the type and location of the seats may explain more about the prices
than we see in Figure 1.33.
SECTION 1.3 Summary
We can sometimes describe the overall pattern of a distribution by a density curve.

A density curve has total area 1 underneath it. An area under a density curve gives
the proportion of observations that fall in a range of values.
A density curve is an idealized description of the overall pattern of a distribution that
smooths out the irregularities in the actual data. We write the mean of a density curve
as and the standard deviation of a density curve as to distinguish them from the
mean x and standard deviation s of the actual data.
The mean, the median, and the quartiles of a density curve can be located by eye.
The mean is the balance point of the curve. The median divides the area under
the curve in half. The quartiles and the median divide the area under the curve into
quarters. The standard deviation cannot be located by eye on most density curves.
The mean and median are equal for symmetric density curves. The mean of a skewed
curve is located farther toward the long tail than is the median.
The Normal distributions are described by a special family of bell-shaped, sym-

metric density curves, called Normal curves. The mean and standard deviation
completely specify a Normal distribution N (, ). The mean is the center of the
curve, and is the distance from to the change-of-curvature points on either side.
To standardize any observation x, subtract the mean of the distribution and then
divide by the standard deviation. The resulting z-score
x
z=

says how many standard deviations x lies from the distribution mean.
All Normal distributions are the same when measurements are transformed to the
standardized scale. In particular, all Normal distributions satisfy the 689599.7 rule,
which describes what percent of observations lie within one, two, and three standard
deviations of the mean.
If x has the N (, ) distribution, then the standardized variable z = (x )/ has
the standard Normal distribution N(0, 1) with mean 0 and standard deviation 1.
Table A gives the proportions of standard Normal observations that are less than z for
many values of z. By standardizing, we can use Table A for any Normal distribution.
The adequacy of a Normal model for describing a distribution of data is best assessed
by a Normal quantile plot, which is available in most statistical software packages.
A pattern on such a plot that deviates substantially from a straight line indicates that
the data are not Normal.
For Exercise 1.78, see page 43; for 1.79 to 1.81, see pages 4445; (a) Compute the mean and the standard deviation.
for 1.82 to 1.85, see page 48; for 1.86, see page 49; for 1.87 and (b) Apply the 689599.7 rule to this distribution.
1.88, see page 53; for 1.89 and 1.90, see page 54; and for 1.91 (c) Compare the results of the rule with the actual percents within
and 1.92, see page 57. one, two, and three standard deviations of the mean.
1.93 Sketch some Normal curves. (d) Summarize your conclusions.
(a) Sketch a Normal curve that has mean 10 and standard devi- 1.97 Do women talk more? Conventional wisdom suggests that
ation 3. women are more talkative than men. One study designed to ex-
(b) On the same x axis, sketch a Normal curve that has mean 20 amine this stereotype collected data on the speech of 42 women
and standard deviation 3. and 37 men in the United States.40
(c) How does the Normal curve change when the mean is varied (a) The mean number of words spoken per day by the women
but the standard deviation stays the same? was 14,297 with a standard deviation of 9065. Use the 6895
1.94 The effect of changing the standard deviation. 99.7 rule to describe this distribution.
(a) Sketch a Normal curve that has mean 10 and standard devi- (b) Do you think that applying the rule in this situation is rea-
ation 3. sonable? Explain your answer.
(b) On the same x axis, sketch a Normal curve that has mean 10 (c) The men averaged 14,060 words per day with a standard de-
and standard deviation 1. viation of 9056. Answer the questions in parts (a) and (b) for the
(c) How does the Normal curve change when the standard devi- men.
ation is varied but the mean stays the same? (d) Do you think that the data support the conventional wisdom?
Explain your answer. Note that in Section 7.2 we will learn formal
1.95 Know your density. Sketch density curves that might de- statistical methods to answer this type of question.
scribe distributions with the following shapes.
1.98 Data from Mexico. Refer to the previous exercise. A sim-
(a) Symmetric, but with two peaks (that is, two strong clusters
ilar study in Mexico was conducted with 31 women and 20 men.
of observations).
The women averaged 14,704 words per day with a standard de-
(b) Single peak and skewed to the left.
viation of 6215. For men the mean was 15,022 and the standard
1.96 Gross domestic product. Refer to Exercise 1.52, where deviation was 7864.
we examined the gross domestic product of 120 countries. (a) Answer the questions from the previous exercise for the Mex-
ATA FIL
ican study.
DATADATA
COUNTRIES120
D
DATADATADATA
DATADATADATA
(b) The means for both men and women are higher for the Mexi- 1.103 Selling apartment buildings. Continue with the vari-
can study than for the U.S. study. What conclusions can you draw able Sale Price Per Sqft created in the previous exercise.
ATA FIL
DATADATA
APARTMENTS
D
from this observation?
E
DATADATADATA
DATADATADATA
(a) Calculate the mean and standard deviation of the Sale Price
1.99 Total scores. Below are the total scores of 10 students in
ATA FIL
DATADATA
Per Sqft values.
STATCOURSE
D
an introductory statistics course:
E
DATADATADATA
(b) Calculate the intervals x s, x 2s, and x 3s.

DATADATADATA
68 54 92 75 73 98 64 55 80 70 (c) Create a table that allows one to easily compare the distribu-
Previous experience with this course suggests that these scores tion of Sale Price Per Sqft with the 689599.7 rule for the three
should come from a distribution that is approximately Normal intervals calculated in part (b).
with mean 70 and standard deviation 10. (d) Does your table from part (c) provide a clear indication of
(a) Using these values for and , standardize the scores of Normality (or non-Normality) for the data values?
these 10 students.
1.104 Exploring Normal quantile plots.
(b) If the grading policy is to give a grade of A to the top 15%
(a) Create three data sets: one that is clearly skewed to the right,
of scores based on the Normal distribution with mean 70 and
one that is clearly skewed to the left, and one that is clearly sym-
standard deviation 10, what is the cutoff for an A in terms of a
metric and mound-shaped. (As an alternative to creating data sets,
standardized score?
you can look through this chapter and find an example of each
(c) Which students earned an A for this course?
type of data set requested.)
1.100 Assign more grades. Refer to the previous exercise. (b) Using statistical software, obtain Normal quantile plots for
The grading policy says that the cutoffs for the other grades corre- each of your three data sets.
spond to the following: the bottom 5% receive an F, the next 10% (c) Clearly describe the pattern of each data set in the Normal
receive a D, the next 40% receive a C, and the next 30% receive quantile plots from part (b).
a B. These cutoffs are based on the N (70, 10) distribution.
(a) Give the cutoffs for the grades in terms of standardized The table below contains data on a random sample of 22 telecom
scores. stockscompanies that specialize in telecommunication prod-
(b) Give the cutoffs in terms of actual scores. ucts. For each company, trading volume and revenue growth
(c) Do you think that this method of assigning grades is a good (over the last year) have been reported. Exercises 1.105 to 1.108
one? Give reasons for your answer. concern these data.42
1.101 Selling apartment buildings. Owning an apartment

building can be very profitable, as can selling an apartment build-
ing. Data for this exercise are selling prices (in dollars) and build- Ticker symbol Trading volume Revenue growth
ing square footages for 18 apartment buildings sold in a particular AATK 68,654 0.0482
ATA FIL
city during 2005.41 0.0300

DATADATA
APARTMENTS
D
DATADATADATA
ALLN 3,500
DATADATADATA
(a) Use statistical software to obtain histograms and Normal ATGN 5,650 0.1514
quantile plots of selling prices and building square footages. AVCI 68,482 0.2580
(b) Do either of these variables appear to be Normally dis- AXE 85,900 0.0739
tributed? Explain in what way the plots match (or dont match) CGN 100 0.1098
what you would expect to see for Normally distributed data.
COVD 2,410,204 0.0166
(c) One apartment building appears to be an outlier with respect
CTV 254,600 0.0437
to both selling price and square footage. Report the selling price
and square footage for this apartment building. CYBD 6,900 0
ETCIA 1,741 0.2391
1.102 Selling apartment buildings. Continue with the data GCOM 27,392 0.4337
from the previous exercise. Create a new variable (call it Sale
HLIT 690,026 0.1765
Price Per Sqft) by dividing the selling price for each apart-
PCTU 6,500 0.2898
ment building by the square footage for each apartment building.
ATA FIL
DATADATA
APARTMENTS PTSC 314,680 0.556

D
DATADATADATA
DATADATADATA
(a) When plotting selling prices or building square footages, one QCOM 6,696,185 0.2001
apartment building stands out as an outlier. Does this same apart- SRTI 2,000 0.0006
ment building stand out in terms of the new variable you created TCCO 1,100 0.0856
for this exercise? Explain your response clearly. TKLC 246,101 0.0009
(b) Use statistical software to obtain a histogram and a Normal VERA 25,000 0.0081
quantile plot of the new variable Sale Price Per Sqft. WJCI 59,408 0.3544
(c) Does the distribution of Sale Price Per Sqft appear to be XXIA 1,750,027 0.1930
Normal? Describe precisely what about the histogram and the ZOOM 21,295 0.1298
Normal quantile plot leads you to your conclusion.

ATA FIL
DATADATA
TELECOMSTOCKS
D
1.105 Telecom shares traded. 1.110 Length of pregnancies. Some health insurance compa-
E
DATADATADATA
DATADATADATA
(a) Calculate the mean and standard deviation of the 22 trading- nies treat pregnancy as a preexisting condition when it comes
volume values. to paying for maternity expenses for a new policyholder. Some-
(b) Calculate x 3s. times the exact date of conception is unknown, so the insurance
(c) Clearly explain why your calculations in part (b) show that company must count back from the expected due date to judge
the distribution of trading volume is not symmetric and mound- whether or not conception occurred before or after the new pol-
shaped. icy began. The length of human pregnancies from conception
ATA FIL
DATADATA to birth varies according to a distribution that is approximately
TELECOMSTOCKS
D
1.106 Telecom revenue growth.
E
DATADATADATA
Normal with mean 266 days and standard deviation 16 days. Use
DATADATADATA
(a) Calculate the mean and standard deviation of the 22 revenue

the 689599.7 rule to answer the following questions.
growth values.
(a) Between what values do the lengths of the middle 95% of all
(b) Calculate the ranges x s, x 2s, and x 3s.
pregnancies fall?
(c) Determine the percent of revenue growth values that fall into
(b) How short are the shortest 2.5% of all pregnancies?
each of the three ranges that you calculated in part (b). How do
(c) How likely is it that a woman with an expected due date 218
these percents compare with the 689599.7 rule?
days after her policy began conceived the child after her policy
ATA FIL
DATADATA
TELECOMSTOCKS began?
D
1.107 Telecom shares traded.

E
DATADATADATA
DATADATADATA
(a) Use statistical software to create a histogram of the trading 1.111 Use Table A. Use Table A to find the proportion of ob-
volumes for these 22 telecom stocks. servations from a standard Normal distribution that falls in each
(b) The histogram shows that these data are clearly right-skewed. of the following regions. In each case, sketch a standard Normal
Sketch what you think a Normal quantile plot of these data will curve and shade the area representing the region.
look like. (a) z 2.30
(c) Use statistical software to create a Normal quantile plot of (b) z 2.30
these data. How well does your sketch from part (b) match the (c) z > 1.70
plot generated by your software? (d) 2.30 < z < 1.70
ATA FIL
DATADATA
TELECOMSTOCKS
D
1.108 Telecom revenue growth.

E
DATADATADATA
1.112 Use Table A. Use Table A to find the value of z for each
DATADATADATA
(a) Construct a stemplot of the revenue growth for these 22 tele- of the situations below. In each case, sketch a standard Normal
com stocks. You will need a 0 and a 0 on the stem. Use the curve and shade the area representing the region.
tenths place of these values on the stem and the hundredths place (a) Ten percent of the values of a standard Normal distribution
as the leaves. For example, 0.556 rounds to 0.56 and would are greater than z.
appear as 5|6 in the stemplot. (b) Ten percent of the values of a standard Normal distribution
(b) Describe the distribution of these revenue growth values. are greater than or equal to z.
Sketch what you think a Normal quantile plot of these data will (c) Ten percent of the values of a standard Normal distribution
look like. are less than z.
(c) Use statistical software to create a Normal quantile plot of (d) Fifty percent of the values of a standard Normal distribution
these data. How well does your sketch from part (b) match the are less than z.
plot generated by your software?
1.113 Use Table A. Consider a Normal distribution with mean
1.109 Visualizing the standard deviation. Figure 1.34 shows 100 and standard deviation 10.
two Normal curves, both with mean 0. Approximately what is (a) Find the proportion of the distribution with values 90 and
the standard deviation of each of these curves? 105. Illustrate your calculation with a sketch.
FIGURE 1.34 Two Normal

curves with the same mean but
different standard deviations,
for Exercise 1.109.
-1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6

Statistics in Summary 63
(b) Find the values of x1 and x2 such that the proportion of the 1.116 Deciles of Normal distributions. The deciles of any dis-
distribution with values between x1 and x2 include the central tribution are the 10th, 20th, . . . , 90th percentiles. The first and
85% of the distribution. Illustrate your calculation with a sketch. last deciles are the 10th and 90th percentiles, respectively.
(a) What are the first and last deciles of the standard Normal
1.114 Length of pregnancies. The length of human pregnan-
distribution?
cies from conception to birth varies according to a distribution
(b) The weights of 9-ounce potato chip bags are approximately
that is approximately Normal with mean 266 days and standard
Normal with mean 9.12 ounces and standard deviation 0.15
deviation 16 days.
ounce. What are the first and last deciles of this distribution?
(a) What percent of pregnancies last fewer than 240 days (thats
about 8 months)? 1.117 Normal random numbers. Use software to generate 100
(b) What percent of pregnancies last between 240 and 270 days observations from the standard Normal distribution. Make a his-
(roughly between 8 and 9 months)? togram of these observations. How does the shape of the his-
(c) How long do the longest 25% of pregnancies last? togram compare with a Normal density curve? Make a Normal
quantile plot of the data. Does the plot suggest any important de-
1.115 Quartiles of Normal distributions. The median of any
viations from Normality? (Repeating this exercise several times
Normal distribution is the same as its mean. We can use Normal
is a good way to become familiar with how Normal quantile plots
calculations to find the quartiles for Normal distributions.
look when data actually are close to Normal.)
(a) What is the area under the standard Normal curve to the left
of the first quartile? Use this to find the value of the first quar- 1.118 Uniform random numbers. Use software to generate
tile for a standard Normal distribution. Find the third quartile 100 observations from the distribution described in Exercise 1.80
similarly. (page 44). (The software will probably call this a uniform distri-
(b) Your work in (a) gives the Normal scores z for the quartiles of bution.) Make a histogram of these observations. How does the
any Normal distribution. What are the quartiles for the lengths of histogram compare with the density curve in Figure 1.20? Make
human pregnancies? (Use the distribution given in the previous a Normal quantile plot of your data. According to this plot, how
exercise.) does the uniform distribution deviate from Normality?
STATISTICS IN SUMMARY
Data analysis is the art of describing data using graphs and numerical summaries. The
purpose of data analysis is to describe the most important features of a set of data. This
chapter introduces data analysis by presenting statistical ideas and tools for describing
the distribution of a single variable. The Statistics in Summary figure below will help
you organize the big ideas. The question marks at the last two stages remind us that the
usefulness of numerical summaries and models such as Normal distributions depends on
what we find when we examine the data using graphs. Here is a review list of the most
important skills you should have acquired from your study of this chapter.
Plot your data

Stemplot, Histogram
Interpret what you see

Shape, Center, Spread, Outliers
Numerical summary?
x and s, Five-Number Summary
Mathematical model?
Normal Distribution?
A. Data
1. Identify the cases and variables in a set of data.
2. Identify each variable as categorical or quantitative. Identify the units in which
each quantitative variable is measured.
B. Displaying Distributions
1. Make a bar graph, pie chart, and/or Pareto chart of the distribution of a
categorical variable. Interpret bar graphs, pie charts, and Pareto charts.
2. Make a histogram of the distribution of a quantitative variable.
3. Make a stemplot of the distribution of a small set of observations. Round leaves
or split stems as needed to make an effective stemplot.
C. Inspecting Distributions (Quantitative Variable)
1. Look for the overall pattern and for major deviations from the pattern.
2. Assess from a histogram or stemplot whether the shape of a distribution is
roughly symmetric, distinctly skewed, or neither. Assess whether the
distribution has one or more major peaks.
3. Describe the overall pattern by giving numerical measures of center and spread
in addition to a verbal description of shape.
4. Decide which measures of center and spread are more appropriate: the mean and
standard deviation (especially for symmetric distributions) or the five-number
summary (especially for skewed distributions).
5. Recognize outliers.
D. Time Plots
1. Make a time plot of data, with the time of each observation on the horizontal
axis and the value of the observed variable on the vertical axis.
2. Recognize patterns in a time plot.
E. Measuring Center
1. Find the mean x of a set of observations.
2. Find the median M of a set of observations.
3. Understand that the median is more resistant (less affected by extreme
observations) than the mean. Recognize that skewness in a distribution moves
the mean away from the median toward the long tail.
F. Measuring Spread
1. Find the quartiles Q 1 and Q 3 for a set of observations.
2. Give the five-number summary and draw a boxplot; assess center, spread,
symmetry, and skewness from a boxplot.
3. Using a calculator or software, find the standard deviation s for a set of
observations.
4. Know the basic properties of s: s 0 always; s = 0 only when all observations
are identical and increases as the spread increases; s has the same units as the
original measurements; s is pulled strongly up by outliers or skewness.
G. Density Curves
1. Know that areas under a density curve represent proportions of all observations
and that the total area under a density curve is 1.
2. Approximately locate the median (equal-areas point) and the mean (balance
point) on a density curve.
3. Know that the mean and median both lie at the center of a symmetric density
curve and that the mean moves farther toward the long tail of a skewed curve.
CHAPTER 1 Review Exercises 65
H. Normal Distributions
1. Recognize the shape of Normal curves and be able to estimate by eye both the
mean and the standard deviation from such a curve.
2. Use the 689599.7 rule and symmetry to state what percent of the observations
from a Normal distribution fall between two points when the points lie one, two,
or three standard deviations on either side of the mean.
3. Find the standardized value (z-score) of an observation. Interpret z-scores and
understand that any Normal distribution becomes standard Normal N (0, 1)
when standardized.
4. Given that a variable has the Normal distribution with a stated mean and
standard deviation , calculate the proportion of values above a stated number,
below a stated number, or between two stated numbers.
5. Given that a variable has the Normal distribution with a stated mean and
standard deviation , calculate the point having a stated proportion of all values
above it. Also calculate the point having a stated proportion of all values below it.
6. Assess the Normality of a set of data by inspecting a Normal quantile plot.
CHAPTER 1 Review Exercises
1.119 Identify the histograms. A survey of a large college class (b) One way to make a pie chart of these data would be to use
asked the following questions: one slice in the pie chart for each state in the table. Give at least
(a) Are you female or male? (In the data, male = 0, one reason why this would not result in a useful pie chart.
female = 1.) (c) Group all customers from states other than Iowa (IA) into a
(b) Are you right-handed or left-handed? (In the data, right = 0, category called Other and make a pie chart with an Other slice.
left = 1.) Be sure to include the percent or count for each slice of your pie
(c) What is your height in inches? chart.
(d) How many minutes do you study on a typical weeknight?
Figure 1.35 shows histograms of the student responses, in scram- State Count State Count
bled order and without scale markings. Which histogram goes AR 1 MI 2
with each variable? Explain your reasoning. AZ 1 MO 2
1.120 How much does it cost to make a movie? Making movies CA 2 MS 2
is a very expensive activity and many cost more than they earn. CO 1 NE 3
On the other hand, enormous profits are also a possibility. For FL 1 NY 1
this exercise you will analyze the budgets for 160 films made GA 2 OH 2
ATA FIL
between 2003 and 2007.43

DATADATA
BOXOFFICE160
D
DATADATADATA
IA 1053 OK 1
DATADATADATA
(a) Examine the distribution of the budgets for these 160 films ID 2 OR 5
graphically. Describe key features of the distribution. IL 6 TN 1
(b) Plot the budgets versus time. Describe any patterns that you KS 1 TX 1
see. LA 1 UT 1
(c) Provide appropriate numerical summaries for the budgets of
MA 1 WI 2
these 160 films.
(d) Write a summary of what you learned from these data that
would be useful to someone who would like to invest in making
1.122 Help-wanted advertising in newspapers. One source
movies.
of revenue for newspapers is printing help-wanted ads for com-
1.121 Customers home state. A sample of 1095 customers panies that are looking for new employees. For this exercise we
entering a retail store were asked to fill out a brief survey. One will use monthly data on help-wanted advertising in newspapers
question on the survey asked each person to identify his or her from January 1951 to April 2005. The time series uses an index
current state of residency. The data from this question are sum- value with 1987 as the base year. That is, the monthly average for
ATA FIL
1987 is taken to be 100, so a month with an index value of 50 had

DATADATA
IOWA
D
marized in the table below.

E
DATADATADATA
DATADATADATA
(a) The state in which the retail store resides is easily deduced only half as much help-wanted advertising in newspapers as the
from the table. In which state is this store located? monthly average for 1987, while a month with an index value of
FIGURE 1.35 Match each

histogram with its variable, for
Exercise 1.119.
(a) (b)
(c) (d)
140 had 40% more help-wanted advertising in newspapers than tion of the item counts for the refunds, we see that 83 of the 103
ATA FIL
the monthly average for 1987.44

DATADATA
HELPWANTED
D
refunds were for one item. Using only this information and with-
E
DATADATADATA
DATADATADATA
(a) Using statistical software, obtain a time plot of the index val- out using software or a calculator, answer the following questions.
ATA FIL
DATADATA
REFUNDS
D
ues for help-wanted advertising in newspapers. Add a horizontal

E
DATADATADATA
DATADATADATA
line to your time plot at the value of x for these data. (a) Provide the first four numbers of the five-number summary
(b) What do you notice about the beginning years of the time for the item counts. (You cannot determine the maximum item
series relative to the overall average of the time series? Which count using only the information given in this exercise.)
month in the time series is the first to be greater than the overall (b) Construct a boxplot for the item counts using 14 as the max-
average? imum item count. How long is the box in your boxplot? Explain
(c) Describe the trend of the index values beginning in January why this makes sense, given the data on item counts.
2000. Which month is the last month to be greater than the time (c) What does your boxplot indicate about the skewness of these
series average? data?
(d) Propose at least one reasonable explanation for the observed
1.125 Telecom revenue growth. The data on revenue growth
trend in help-wanted advertising in newspapers since January
for a random sample of telecommunications companies dis-
2000.
played before Exercise 1.105 (page 62) closely follow a Normal
1.123 A closer look at customer refunds. A retail store spe- distribution with a mean of 0.0224 and a standard devia-
cializing in childrens clothing and toys has a relatively strict no tion of 0.2180. Take as a model for telecom revenue growth
refunds policy. Exceptions to this policy are sometimes granted the N (0.0224, 0.2180) distribution and answer the following
ATA FIL
DATADATA
TELECOMSTOCKS
D
in specific cases as determined by management. The store would questions.

E
DATADATADATA
DATADATADATA
like to look at refund activity for the year 2005. Data recorded (a) Calculate + 3 for the model for telecom revenue growth.
include the date, amount, and item count for all refund transac- (b) From the population of all telecom companies, what percent
tions in 2005. Of the 10,939 transactions conducted between the should we expect to have revenue growth greater than + 3 ?
store and customers during 2005, only 103 of these transactions Explain how you arrived at your response.
ATA FIL
DATADATA
REFUNDS
D
were refunds (less than 1%). (c) What percent of the telecom companies in our sample have
E
DATADATADATA
DATADATADATA
(a) Using statistical software, calculate the five-number sum- revenue growth greater than + 3 ? Is this percent different
mary for refund amounts. (Note: All refunds are recorded as from your response to part (b)? Clearly explain why these two
negative numbers.) percents being different is not inconsistent with our assumption of
(b) What percent of all refunds in 2005 were $10 or less? a Normal distribution for the model for telecom revenue growth.
(c) Construct a boxplot of the refund amounts based on your
1.126 Telecom revenue growth. Take the N (0.0224, 0.2180)
five-number summary.
distribution as the model for telecom revenue growth as described
(d) What does your boxplot indicate about the skewness of these
in the previous exercise and answer the following questions.
data? ATA FIL
DATADATA
TELECOMSTOCKS
D
DATADATADATA
DATADATADATA
1.124 A closer look at customer refunds. Continue with the (a) What percent of telecom companies had negative revenue
data on refunds described in the previous exercise. Upon inspec- growth over the past year? Show your work.
CHAPTER 1 Review Exercises 67
(b) What does negative revenue growth mean for a company? the distributions? Compare the two distributions and summarize
(c) What percent of telecom companies had revenue growth your results in a short paragraph.
greater than 0.50 (50%)? Show your work.
1.130 How much oil? How much oil the wells in a given field
(d) In terms of revenue growth, the top 25% of all telecom com-
will ultimately produce is key information in deciding whether
panies had revenue growth greater than what value? Show your
to drill more wells. The table below gives the estimated total
work.
amount of oil recovered from 64 wells in the Devonian Rich-
ATA FIL
mond Dolomite area of the Michigan basin.47

DATADATA
OILWELLS
D
1.127 What influences buying? Product preference depends
E
DATADATADATA
DATADATADATA
in part on the age, income, and gender of the consumer. A mar-

ket researcher selects a large sample of potential car buyers. For
21.7 53.2 46.4 42.7 50.4 97.7 103.1 51.9 43.4 69.5
each consumer, she records gender, age, household income, and
automobile preference. Which of these variables are categorical 156.5 34.6 37.9 12.9 2.5 31.4 79.5 26.9 18.5 14.7
and which are quantitative? 32.9 196.0 24.9 118.2 82.2 35.1 47.6 54.2 63.1 69.8
57.4 65.6 56.4 49.4 44.9 34.6 92.2 37.0 58.8 21.3
1.128 Evaluating the improvement in a product. Corn is an 36.6 64.9 14.8 17.6 29.1 61.4 38.6 32.5 12.0 28.3
important animal food. Normal corn lacks certain amino acids,
204.9 44.5 10.3 37.7 33.7 81.1 12.1 20.1 30.5 7.1
which are building blocks for protein. Plant scientists have de-
10.1 18.0 3.0 2.0
veloped new corn varieties that contain these amino acids. To
test a new corn as an animal food, a group of 20 one-day-old
male chicks was fed a ration containing the new corn. A control
(a) Graph the distribution and describe its main features.
group of another 20 chicks was fed a ration that was identical
(b) Find the mean and median of the amounts recovered. Explain
except that it contained normal corn. Here are the weight gains
ATA FIL
how the relationship between the mean and the median reflects
(in grams) after 21 days:45
DATADATA
CORN
D
DATADATADATA
the shape of the distribution.

DATADATADATA
(c) Give the five-number summary and explain briefly how it

Normal corn New corn reflects the shape of the distribution.
380 321 366 356 361 447 401 375 1.131 The 1.5 IQR rule. Exercise 1.67 (page 39) describes
283 349 402 462 434 403 393 426 the most common rule for identifying suspected outliers. Find the
356 410 329 399 406 318 467 407 interquartile range IQR for the oil recovery data in the previous
350 384 316 272 427 420 477 392 exercise. Are there any outliers according to the 1.5 IQR rule?
345 455 360 431 430 339 410 326
1.132 Grading managers. Some companies grade on a bell
curve to compare the performance of their managers. This forces
(a) Compute five-number summaries for the weight gains of the the use of some low performance ratings, so that not all managers
two groups of chicks. Then make boxplots to compare the two are graded above average. A company decides to give As to the
distributions. What do the data show about the effect of the new managers and professional workers who score in the top 15% on
corn? their performance reviews, Cs to those who score in the bottom
(b) The researchers actually reported means and standard devi- 15%, and Bs to the rest. Suppose that a companys performance
ations for the two groups of chicks. What are they? How much scores are Normally distributed. This year, managers with scores
larger is the mean weight gain of chicks fed the new corn? less than 25 received Cs and those with scores above 475 received
1.129 Fuel efficiency of hatchbacks and large sedans. Lets As. What are the mean and standard deviation of the scores?
compare the fuel efficiencies (mpg) of model year 2009 hatch- 1.133 The Statistical Abstract of the United States. Find in the
ATA FIL
backs and large sedans.46

DATADATA
MPGHATCHLARGE Here are the data:

D
DATADATADATA
library or at the U.S. Census Bureau Web site (www.census.gov)

DATADATADATA
the most recent edition of the annual Statistical Abstract of the

Hatchbacks United States. Look up data on (a) the number of businesses
started (business starts) and (b) the number of business fail-
30 29 28 27 27 27 27 27 26 25 25 25 24 24 24
ures for the 50 states. Make graphs and numerical summaries
24 24 23 23 22 22 21 21 21 21 21 21 21 20 20
to display the distributions, and write a brief description of the
20 20 20 20 20 20 19 19 19 18 16 16 most important characteristics of each distribution. Suggest an
Large sedans explanation for any outliers you see.
19 19 18 18 18 18 17 17 17 17 17 17 17 17 17
1.134 Canadas balance of international payments. Visit the
17 16 16 16 16 16 16 16 16 15 15 13 13
Web page www40.statcan.ca/l01/cst01/econ01a.htm,
which provides data on Canadas balance of international pay-
Give graphical and numerical descriptions of the fuel efficien- ments. Select some data from this Web page and use the methods
cies for these two types of vehicle. What are the main features of that you learned in this chapter to create graphical and numerical
summaries. Write a report summarizing your findings that in- having specified distributions. Use your statistical software
cludes supporting evidence from your analyses. to generate 25 observations from the N (30, 5) distribution.
Compute the mean and standard deviation x and s of the
1.135 Canadian government revenue and expendi-
25 values you obtain. How close are x and s to the
tures by province and territory. Visit the Web pages
and of the distribution from which the observations were
www40.statcan.ca/l01/cst01/govt08a.htm,
drawn?
www40.statcan.ca/l01/cst01/govt08b.htm, and
Repeat 19 more times the process of generating 25 observa-
www40.statcan.ca/l01/cst01/govt08c.htm. You need to
tions from the N (30, 5) distribution and recording x and s. Make
look at the three pages to obtain data for all provinces and ter-
a stemplot of the 20 values of x and another stemplot of the 20
ritories. Select some data from these Web pages and use the
values of s. Make Normal quantile plots of both sets of data.
methods that you learned in this chapter to create graphical and
Briefly describe each of these distributions. Are they symmetric
numerical summaries. Write a report summarizing your findings
or skewed? Are they roughly Normal? Where are their centers?
that includes supporting evidence from your analyses.
(The distributions of measures like x and s when repeated sets
1.136 Simulated observations. Most statistical software of observations are made from the same theoretical distribution
packages have routines for simulating values of variables will be very important in later chapters.)
CHAPTER 1 Case Study Exercises

CASE STUDY EXERCISE 1: What colors sell? Vehicle col- CASE STUDY EXERCISE 2: The business of health.
ors differ among types of vehicle in different regions. Here are The Behavioral Risk Factor Surveillance System (BRFSS) con-
data on the most popular colors in 2007 for several different ducts a large survey of health conditions and risk behaviors
ATA FIL
regions of the world:48 in the United States.49 The BRFSS data set contains data on
DATADATA
VEHICLECOLORSBYCOUNTRY
D
DATADATADATA
DATADATADATA
North South South

America America Europe China Korea Japan
Color (percent) (percent) (percent) (percent) (percent) (percent)
Silver 19 26 28 24 21 27
White 16 11 4 16 18 24
Gray 13 14 16 3 19 12
Black 13 20 24 19 20 16
Blue 11 8 13 17 9 10
Red 11 10 6 9 6 3
Brown 7 7 4 1 6 2
Other 10 4 5 11 1 6
Use the methods you learned in this chapter to compare the ve- 29 demographic factors and risk factors for each state. Pick three
hicle color preferences for the regions of the world presented or more variables from this data set and summarize the dis-
in this table. Write a report summarizing your findings with an tributions graphically and numerically. Write a report describ-
emphasis on similarities and differences across regions. Include ing your summary. Include a discussion of business opportu-
recommendations related to marketing and advertising of vehi- nities that you would consider on the basis of your analysis.
ATA FIL
DATADATA
BRFSS
D
DATADATADATA
cles in these regions. DATADATADATADATADATA

DATADATADATA
CHAPTER 1 Appendix 69
CHAPTER 1 Appendix
Using Software for Statistical Analysis it should be emphasized that we are not bound to these
software programs. Because computer output from sta-
Good statistical analysis relies heavily on interactive statistical packages is very similar, you can feel quite com-
tistical software. In this Appendix, we discuss the use of fortable using any one of a number of excellent statistical
Minitab and Excel for conducting statistical analysis. As a packages.
specialized statistical package, Minitab is one of the most
popular software choices both in industry and in colleges
and schools of business. As an all-purpose spreadsheet
Getting Started with Minitab
program, Excel provides a limited set of statistical analysis
options in comparison to Minitab, or to any other statistics In this section, we provide a basic overview of Minitab
package for that matter. However, given its pervasiveness Release 15. For more instruction, Minitab provides a
and wide acceptance in industry and the computer world number of Help features found under the Help selec-
at large, we believe it is important to give Excel proper tion on the toolbar (see Figure App. 1.1). The Tuto-
attention. It should be noted that for users who want more rials option, for example, introduces the user to basic
statistical capabilities but want to work in an Excel en- Minitab features and walks the user through some ex-
vironment, there are a number of commercially available ample Minitab sessions. In addition, at Minitabs Web
add-on packages. site, www.minitab.com, you can search through its
Even though basic guidance for using Minitab and knowledge base of customer support questions and their
Excel is provided in this and subsequent Appendices, answers.
Minitab - Untitled
Session Help
Help
Welcome to Minitab, press F1 for help. StatGuide

Tutorials
Glossary
Methods and Formulas
Answers Knowledgebase
Keyboard Map
Check fro Updates

Minitab on the Web
Contact Us
About Minitab
Worksheet 1 ***
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Project
FIGURE App. 1.1 Minitab open screen shot with Help option opened.
Minitab Windows menu, you will notice that files can be opened or saved as
worksheets or as projects. Worksheet files (.MTW exten-
Upon entering Minitab, you will find the display parti- sion) simply store the data found in the Data window, while
tioned into two windows, as seen in Figure App. 1.1. The project files (.MPJ extension) store all the current work, in-
Session window is the area where all nongraphical sta- cluding the data, Session window output, and graphs. Thus,
tistical output and Minitab commands generating statisti- if you save a project prior to exiting Minitab and open the
cal output (graphical and nongraphical) are displayed. The project at a later time, you can resume from where you last
Data window displays a spreadsheet environment (known left off. Minitab files for selected examples and exercises
as a worksheet) where the data can be directly entered and provided on this books CD are worksheet files.
edited. Each column represents a variable to be analyzed.
Unlike Excel, cells in a Minitab worksheet are not active
in that formulas cannot be embedded within the cells. A Getting Started with Excel
Minitab worksheet is simply an environment for data to In this section, we provide a basic overview of the statis-
reside within. tical analysis options in Excel 2007. We assume that the
There is a third window, which is minimized upon en- reader is familiar with the basic layout and usage of Excel.
tering Minitab, known as the Project Manager window. As with all Microsoft products, Excel provides compre-
This window allows you to do a variety of housekeeping hensive support for the user in terms of the general use
tasks such as keeping track of all commands issued or of its software or the more specific details of a particular
seeing the basic attributes of the worksheet. procedure. As noted earlier, Excel provides a number of
standard statistical analysis procedures but is not as com-
Invoking Statistical Procedures
prehensive as a stand-alone statistical package. Therefore,
There are two ways to invoke procedures: for a few of the topics covered in this book, software sup-
port will be found only in a statistical package or in an
1. You can type session commands in the Session win-
enhanced add-on version of Excel rather than in standard
dow. To do so, the command language must be
Excel.
enabled, which will in turn produce an MTB>
It should be noted that the accuracy of statistical pro-
prompt in the Session window. At this prompt, you
cedures in earlier versions of Excel (2002 and earlier) has
can then type desired commands. For more details
been called into question. Some of the problems revolved
on enabling session commands, refer to Minitabs
around Excels use of shortcut formulas for certain sta-
Help options.
tistical computations. A number of these problems have
2. Users can make a sequence of selections from been addressed with the newest version of Excel, although
a series of menus that all begin in the toolbar a comprehensive independent study of the software has
menu. For example, in this chapter, we produced not been released at the time of the publication of this
a graph known as a boxplot. To create this graph, book. It is worth noting that reliability of established sta-
you would click Graph on the toolbar and then tistical packages should not be taken for granted. Albeit
select Boxplot. In this book, such a sequence of less serious than Excels earlier problems, inaccuracies
selections will be presented as Graph Boxplot. have been reported for even some well-known statistical
Once the sequence of selections has been made, di- packages.50
alog and/or option boxes will be encountered that
allow you to indicate which variable(s) will be part
of the analysis, along with other information. If fur- Built-in Statistical Functions and Charts
ther help is needed, you can click the Help button
Excel has a variety of built-in statistical functions that can
that appears with every pop-up box. Once all appro-
be used to compute many common descriptive statistics
priate information is provided, click the OK button
for a given set of data or to compute probabilities from
to get the desired output.
a number of well-known statistical distributions. To find
these functions, select the Formulas tab found in the main
Minitab Files
menu. You can then click AutoSum and select the More
Minitab provides standard file options for retrieving Functions option, which allows you to select the cate-
(Open) and saving (Save and Save As). Within the File gory Statistical to reveal all the statistical functions. As
an alternative to clicking AutoSum, you can click More Button, click Excel Options, click Add-Ins, and then, in
Functions and then move the cursor to your Statistical the Manage box, choose Excel Add-ins and click Go. At
Functions menu choice. this point, select Analysis ToolPak in the Add-ins avail-
In addition to the built-in statistical functions, a num- able box and finally click OK.
ber of graphing options are available that may prove useful
for data analysis. The available charts are found by select-
Invoking Analysis ToolPak Procedures
ing the Insert tab found in the main menu. One then finds
a variety of graphing options in the Charts group. A few Once the Analysis ToolPak is installed, the statistical anal-
statistical options (for example, regression fitting) can be ysis routines are found by first selecting the Data tab found
implemented in conjunction with the charts. on the main toolbar. You will then see the Data Analysis
command in the Analysis group. Figure App. 1.2 shows a
blank Excel spreadsheet with the Data Analysis command
Installing Analysis ToolPak
invoked, resulting in the appearance of the Data Analysis
Excels built-in statistical functions can be useful for iso- menu box.
lated computations. However, attempting to do a more Within the Data Analysis menu box, there are 19
complete statistical analysis with a collection of raw menu choices. When you select one of the menu choices,
functions can be a laborious and clumsy process. Excel a box specific to the statistical routine will appear that calls
provides an add-on known as Analysis ToolPak that en- for you to indicate where the data reside and where you
ables you to perform a more integrative statistical analysis. want the output to be displayed. In particular, to indicate
This add-on is not loaded with the standard installation of where the data for analysis reside, you specify the range
Excel. To install this add-on, click the Microsoft Office of cells for the data in the Input Range box. This can be
Microsoft Excel
A1 fx
A B C D E F G H I J K L M N O P Q R S
1
2
3 Data Analysis
4 Analysis Tools
OK
5 Anova: Single Factor
Anova: Two-Factor Wirh Replication Cancel
6 Anova: Two-Factor Without Replication
7 Correlation
Covariance Help
8 Descriptive Statistics
Exponential Smoothing
9 F-Test Two-Sample for Variances
10 Fourier Analysis
Histogram
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Sheet1 Sheet2 Sheet3
FIGURE App. 1.2 Excel blank spreadsheet with Data Analysis menu box.
accomplished by first clicking the cursor in the Input Excel make the counts. For pretabulated frequencies, the
Range box and then typing in the cell range, or more eas- spreadsheet should have two columns of information. With
ily you can highlight the data by clicking and dragging a column name in the top row, one column should have the
the mouse over the cell range. The statistical output can names of the distinct categories. The other column, with its
be placed either in the current worksheet (placement indi- column name in the top row, should have the total counts
cated with Output Range box), in a new worksheet tabbed of each category. If Excel needs to make the counts, there
with the current workbook (New Output Ply option), or should be a column, with a column name in the top row,
in an entirely new workbook (New Workbook option). that has the data on the names of the categories that need
to be counted. Once the one or two columns have been
Excel Data Files created, all the cells should be selected by dragging the
mouse. Then click the Insert tab and click PivotTable in
As noted, we assume that you are familiar with the basics
the Tables group and finally click PivotChart. You will
of Excel, including how to save and open files. It should be
then notice that Excel will produce a PivotTable Field List
noted that files saved by Excel 2007 as an Excel Workbook
box. You will find that the column name(s) that you high-
cannot be opened by earlier versions of Excel. There is,
lighted will be listed as fields. Select the field(s) presented
however, an option to save workbooks as an Excel 97-2003
to you by clicking a checkmark next to the name(s). For
Workbook. Excel 2007 is backward compatible in terms
pretabulated frequencies, a bar graph will be created auto-
of opening workbooks of older versions. Data files for se-
matically. When you have only one column that requires
lected examples and exercises provided on this books CD
counting, you will find that the field name appears in a sec-
are compatible with all versions of Excel.
tion titled Axis Fields (Categories). You want to also have
this field name in the section titled Values. To do so, click
Using Minitab and Excel for Examining
and hold the field name and then drag the field from the
Distributions
field section into the Values section. Excel will then au-
Now that we have provided a general overview of Minitab tomatically make the counts and create a corresponding bar
and Excel, we discuss more specifically how these software graph.
programs can be used to create the graphs and numerical
summaries presented in this chapter. Pie Charts
Bar Graphs Minitab:
Minitab: Graph Pie Chart

Graph Bar Chart Making a pie chart is quite similar to making a bar
If the frequencies have been pretabulated, select Values graph. If the frequencies have been pretabulated, select
from a table from the Bars represent menu. If the fre- the Chart values from a table option. If the frequencies
quencies have not been tabulated and you want Minitab to have not been tabulated, select the Chart counts of unique
make the counts, select Counts of unique values from values option. For pretabulated frequencies, click-in the
the Bars represent menu. Select Simple for the type data column into the Summary variables box, and click-
of bar graph, then click OK. For pretabulated frequencies, in the column that has the names of the categories into the
click-in the data column into the Graph variables box and Categorical variables box. If the frequencies have not
click-in the column that has the names of the categories been pretabulated, click-in the column that has data on the
into the Categorical variables box. If the frequencies have categorical names that need to be counted into the Cate-
not been pretabulated, click-in the column that has data gorical variables box. If you wish to have the pie slices
on the categorical names that need to be counted into the labeled by categorical names and have percents reported
Categorical variables box. Click OK. (as in Figure 1.4), click the Label button and then click
the Slice Labels tab and finally place checkmarks next to
Excel: the desired labels.
There are a few ways to create bar graphs in Excel.
However, there is one particular approach that allows Excel:
you to create bar graphs based on providing in the To make a pie chart, you should follow the exact steps
spreadsheet the total counts of each category or having for making a bar graph. You want to now simply change
the created bar graph into a pie chart. To do so, click the Select Simple for the type of histogram, then click OK.
Design tab and then click the Change Chart Type in the Click-in the data column into the Graph Variables box
Type group and finally select the Pie chart type. Alter- and then click OK. If you wish to change the automati-
natively, you can right-click on the bar graph and find the cally selected classes, double-click on the horizontal axis
Change Chart Type option. To add labels to the pie slices, to make the Edit Scale box appear. Now, click the Binning
first right-click on one of the pie slices and then choose the tab and then choose the Midpoint/Cutpoint positions op-
Add Data Labels option. Once labels have been added, tion found in the Interval Definition section. Depending
right-click again on one of the pie slices and then choose on whether you choose the Interval type as Midpoint
the Format Data Labels option and finally place check- or Cutpoint, you then give the desired values of the mid-
marks next to the desired labels. points (that is, the middle values of the classes) or the
cutpoints (that is, lower and upper values of the classes).
Pareto Charts
Excel:
Minitab:
Select Histogram in the Data Analysis menu box and
Stat Quality Tools Pareto Chart click OK. Enter the cell range of the data into the Input
If the frequencies have been pretabulated, select the Chart Range box. If you want Excel to automatically select the
defects table option. If the frequencies have not been tabu- classes, leave the Bin Range box empty. Place a check-
lated, select the Chart defects data in option. For pretabu- mark next to the Chart Output option. Click OK. Excel
lated frequencies, click-in the data column into the Labels will then create a histogram with gaps between the data
in box and click-in the column that has the names of the bars. To remove these gaps, right-click on any one of the
categories into the Frequencies in box. If the frequencies bars and then select the Format Data Series option. You
have not been pretabulated, click-in the column that has will then have the opportunity to set the gap width to 0%.
data on the categorical names that need to be counted into With the bars now closed up to each other, it is a good
the topmost box next to the Chart defects data in option. idea to border the bars with line edges. Before closing the
An alternative way to create a Pareto chart is to follow Format Data Series box, click the Border Color option
the steps for creating a bar graph but then click the Chart and select the Solid line option and finally click Close.
Options button and select the Decreasing Y option and If you wish to change the automatically selected classes,
place a checkmark next to the Show Y as Percent option. enter upper values for each class into the spreadsheet and
input their cell range in the Bin Range box.
Excel:
As a first step, create a bar graph as already described. You Stemplots
will find in the spreadsheet a PivotTable report made up Minitab:
of two columns: (1) a column labeled Row Labels and
(2) a column with the frequencies. Highlight the contents Graph Stem-and-Leaf
of the report (that is, the cells with the category names
and the cells with the frequencies). Now click the Data Click-in the data column into the Graph Variables box
tab and then click Sort in the Sort & Filter group. At this and then click OK.
point, choose the Descending (Z to A) option and select
Excel:
the column associated with the frequency numbers in the
menu box found immediately below the option. We now Stemplots are available in neither standard Excel nor the
want to convert the counts into percents. To do so, click enhanced add-on version of Excel.
the field name found in the Values section, select the
Value Field Setting option, click the Show values as tab, Time Plots
finally select % of total from the Show values as menu Minitab:
and then click OK.
Graph Time Series Plot
Histograms
Select Simple for the type of time series plot, then click
Minitab: OK. Click-in the data column into the Series box. In de-
Graph Histogram fault mode, Minitab will label the time periods as 1, 2,
3, and so on. If you wish to label the time periods by ple boxplots that you want to display together, as in Fig-
year, as in Figure 1.12, then click the Time/Scale button, ure 1.15, select Multiple Ys Simple for the type of box-
select the Calendar option, select the desired time periods plot, then click OK. In either case, click-in the data col-
(for example, Year) from the adjacent menu, and click umn(s) for which you want to construct boxplots into the
OK to close the pop-up. Click OK to produce the plot. Graph variables box. Click OK.
Excel: Excel:
Click and drag the mouse to highlight the cell range of the Boxplots are not available in standard Excel, but they are
data you wish to time plot (include the column name if available in the enhanced add-on version of Excel.
you wish it to appear as a chart label). With the cell range
highlighted, click the Insert tab and then click Line in Normal Distribution
the Charts group. Within the 2-D Line choices, you can
Minitab:
choose whether to have data symbols at the data values or
not. Graph Probability Distribution Plot
Numerical Summaries of Distribution This pull-down sequence will allow you to visualize areas
under the Normal curve. Select View Probability and
Minitab: then click OK. The standard Normal distribution is the de-
fault distribution. You can change the values for the mean
Stat Basic Statistics Display Descriptive Statistics and/or standard deviation. Now click the Shaded Area
Click-in the data column(s) for which you want to get tab. If you want to find the area under the curve associated
numerical summaries into the Variables box. To choose with a specified value, select the X Value option. You can
what numerical summaries you want reported, click the choose to find the area to the left or right of that specified
Statistics button, place checkmarks next to all desired value or even between two values by clicking the appropri-
measures, and then click OK to close the pop-up. Click ate picture. You then enter the specified value(s) in the X
OK to have the summaries reported in the Session value box. Click OK. As an exercise, you should be able to
window. reproduce Examples 1.35, 1.36, and 1.37 (pages 5152).
To do inverse Normal calculations, select the Probabil-
Excel: ity option rather than the X Value option. Depending on
Select Descriptive Statistics in the Data Analysis menu whether you are considering the area to the left or to the
box and click OK. Enter the cell range of the data into the right of a value, enter the desired area in the Probability
Input Range box. Place a checkmark next to the Chart box and click OK. If more accurate reporting of numbers
Output option. Click OK. You will find that the first and is desired, then you can consider the following pull-down
third quartiles are not reported. If you wish to compute sequence:
these quartiles, click an empty cell in the spreadsheet and
Calc Probability Distributions Normal
then proceed to the Statistical function menu as described
in the overview section of this Appendix. Scroll down the Choose the Cumulative probability option if you wish
list of functions and double-click on the QUARTILE func- to find the area to the left of a specified value. Choose the
tion choice. In the Array box, input the cell range of the Inverse cumulative probability option if you wish to find
data. In the Quart box, input the value 1 to get the first the value associated with a specified area to the left of that
quartile or the value 3 to get the third quartile and then value. You can then select the Input constant option. In
click OK. the box next to this option enter the specified value of x or
z or enter the specified area. Click OK to find the results
Boxplots reported in the Session window.
Minitab: Excel:
Graph Boxplot Excel does not provide a means to visualize areas un-
der the Normal curve, but it can compute areas under
If you have only one variable, select One Y Simple for the Normal curve or work backward. In either case, click
the type of boxplot, then click OK. If you have multi- an empty cell in the spreadsheet and then proceed to the
Statistical function menu as described in the overview sec- This pull-down sequence will produce a Normal probabil-
tion of this Appendix. If you wish to find the area to the ity plot. As noted in this chapter, there is a bit of a technical
left of a specified value under the standard Normal curve, distinction between a Normal quantile plot and a Normal
then scroll down the list of functions and double-click on probability plot. However, the interpretation is the same
the NORMSDIST function choice. Type the value of z in in that the closer the data points plot to a straight line, the
the Z box and click OK. To do inverse standard Normal closer is the conformity to the Normal distribution. Upon
calculations, double-click on the NORMSINV function doing the noted pull-down sequence, click-in the data col-
choice. Type the specified area in the Probability box and umn of interest into the Variable box and then click OK.
click OK.
Excel:
Normal Quantile Plots Neither Normal quantile plots nor Normal probability
Minitab: plots are available in standard Excel, but Normal proba-
bility plots are available in the enhanced add-on version of
Stat Basic Statistics Normality Test Excel.
Moore-3620020 psbeFM August 17, 2010 1:22
This page was intentionally left blank

CHAPTER 1 Examining Distributions

Uploaded by

Copyright:

Available Formats

CHAPTER 1 Examining Distributions

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHAPTER 1 Examining Distributions

Uploaded by

Copyright:

Available Formats

Moore-3620020 psbe August 16, 2010 23:30

An iPod can hold thousands of

Examining Distributions iTunes playlist to organize data

Introduction CHAPTER OUTLINE

EXAMPLE 1.1 Over 5 Billion Sold

4 CHAPTER 1 Examining Distributions

FIGURE 1.1 Part of an iTunes playlist, for Example 1.1.

Categorical and Quantitative Variables

EXAMPLE 1.2 Categorical and Quantitative Variables in the iTunes Playlist

APPLY YOUR KNOWLEDGE

In practice, any set of data is accompanied by background information that helps

EXAMPLE 1.3 Data for Students in a Statistics Class

FIGURE 1.2 Spreadsheet for Example 1.3.

APPLY YOUR KNOWLEDGE

6 CHAPTER 1 Examining Distributions

EXAMPLE 1.5 Statistics Class Data for a Different Purpose

APPLY YOUR KNOWLEDGE

1.1 Displaying Distributions with Graphs 7

Knowledge of the context of data includes an understanding of the variables that

EXAMPLE 1.6 Insurance for Passenger Cars and Motorcycles

1.1 Displaying Distributions with Graphs

Categorical variables: bar graphs and pie charts

8 CHAPTER 1 Examining Distributions

EXAMPLE 1.7 GPS Market Share

FIGURE 1.3 Bar graph for the

1.1 Displaying Distributions with Graphs 9

FIGURE 1.4 Pie chart for the

EXAMPLE 1.10 The Cost Is $164 Billion!

10 CHAPTER 1 Examining Distributions

FIGURE 1.5 Bar graph for the

EXAMPLE 1.11 Pareto Chart for Automobile Accidents

FIGURE 1.6 Pareto chart for the

1.1 Displaying Distributions with Graphs 11

table for the European market:7

Company Market share (%)

12 CHAPTER 1 Examining Distributions

Quantitative variables: histograms

EXAMPLE 1.12 A Histogram of T-bill Interest Rates

Class Count Class Count

1.1 Displaying Distributions with Graphs 13

FIGURE 1.7 Histogram for

EXAMPLE 1.13 Calls to a Customer Service Center

14 CHAPTER 1 Examining Distributions

EXAMPLE 1.14 Histogram for Customer Service Center Call Lengths

FIGURE 1.8 The distribution of

1.1 Displaying Distributions with Graphs 15

class in Figure 1.9.

FIGURE 1.9 The default

is revealed by the histogram of

APPLY YOUR KNOWLEDGE

16 CHAPTER 1 Examining Distributions

Quantitative variables: stemplots

EXAMPLE 1.16 A Stemplot of T-bill Interest Rates

1.1 Displaying Distributions with Graphs 17

FIGURE 1.10 Steps in creating

Interpreting histograms and stemplots

18 CHAPTER 1 Examining Distributions

EXAMPLE 1.17 The Distribution of T-bill Interest Rates

Symmetric and Skewed Distributions

EXAMPLE 1.18 IQ Scores of Fifth-Grade Students