CHAPTER 1 Examining Distributions
CHAPTER 1 Examining Distributions
CHAPTER 1 Examining Distributions
CHAPTER 1
CORBIS
Some variables, like the name of a song and the artist simply place cases into
categories. Others, like the length of a song, take numerical values for which we can do
arithmetic. It makes sense to give an average length of time for a collection of songs, but
it does not make sense to give an average album. We can, however, count the numbers
of songs for different albums, and we can do arithmetic with these counts.
An appropriate label for your cases should be chosen carefully. In our iTunes ex-
ample, a natural choice of a label would be the name of the song. However, if you have
more than one artist performing the same song, or the same artist performing the same
song on different albums, then the name of the song would not uniquely label each of
the songs in your playlist.
A quantitative variable such as the time in the iTunes playlist requires some special
attention before we can do arithmetic with its values. The first song in the playlist has
time equal to 3:29that is, 3 minutes and 29 seconds. To do arithmetic with this variable,
we should first convert all of the values so that they have a single unit of measurement.
We could convert to seconds; 3 minutes is 180 seconds, so the total time is 180 + 29, or
209 seconds. An alternative would be to convert to minutes; 29 seconds is .483 minutes,
so the time calculated in this way is 3.483 minutes.
1.1 Time in the iTunes playlist. In the iTunes playlist, do you prefer to convert the
time to seconds or minutes? Give a reason for your answer.
Moore-3620020 psbe August 16, 2010 23:30
Introduction 5
Microsoft Excel
A B C D E F G H
1 ID Exam1 Exam2 Homework Final Project Total Points Grade
2 101 89 94 88 87 95 899 A
3 102 78 84 90 89 94 866 B
4 103 71 80 75 79 95 780 C
5 104 95 98 97 96 93 962 A
6 105 79 88 85 88 96 861 B
1.2 Who, what, and why for the statistics class data. Answer the Who, What, and
Why questions for the statistics class data set.
1.3 Read the spreadsheet. Refer to Figure 1.2. Give the values of the variables
Exam1, Exam2, and Final for the student with ID equal to 103.
1.4 Calculate the grade. A student whose data do not appear on the spreadsheet
scored 88 on Exam1, 85 on Exam2, 77 for Homework, 90 on the Final, and 80 on the
Project. Find TotalPoints for this student and give the grade earned.
Moore-3620020 psbe August 16, 2010 23:30
spreadsheet The display in Figure 1.2 is from an Excel spreadsheet. Spreadsheets are very useful
for doing the kind of simple computations that you did in Exercise 1.4. You can type in
a formula and have the same computation performed for each row.
Note that the names we have chosen for the variables in our spreadsheet do not
have spaces. For example, we could have used the name Exam 1 for the first exam
score rather than Exam1. In some statistical software packages, however, spaces are not
allowed in variable names. For this reason, when creating spreadsheets for eventual use
with statistical software, it is best to avoid spaces in variable names. Another convention
is to use an underscore ( ) where you would normally use a space. For our data set, we
could use Exam 1, Exam 2, and Final Exam.
EXAMPLE 1.4 Cases and Variables for the Statistics Class Data
The data set in Figure 1.2 was constructed to keep track of the grades for students in an introductory
statistics course. The cases are the students in the class. There are 8 variables in this data set. These
include an identifier for each student and scores for the various course requirements. There are no
units of measurement for ID and grade; they are categorical variables. The other variables all are
measured in points; since it makes sense to do arithmetic with these values, these variables are
quantitative variables.
In our example, the possible values for the grade variable are A, B, C, D, and F.
When computing grade point averages, many colleges and universities translate these
letter grades into numbers using A = 4, B = 3, C = 2, D = 1, and F = 0. The transformed
variable with numeric values is considered to be quantitative because we can average
the numerical values across different courses to obtain a grade point average.
Sometimes, experts argue about numerical scales such as this. They ask whether
or not the difference between an A and a B is the same as the difference between a D
and an F. Similarly, many questionnaires ask people to respond on a 1 to 5 scale with 1
representing strongly agree, 2 representing agree, etc. Again we could ask whether or not
the five possible values for this scale are equally spaced in some sense. From a practical
point of view, the averages that can be computed when we convert categorical scales such
as these to numerical values frequently provide a very useful way to summarize data.
1.5 Apartment rentals for students. A data set lists apartments available for students
to rent. Information provided includes the monthly rent, whether or not cable is included
free of charge, whether or not pets are allowed, the number of bedrooms, and the
distance to the campus. Describe the cases in the data set, give the number of variables,
and specify whether each variable is categorical or quantitative.
Moore-3620020 psbe August 16, 2010 23:30
Begin by examining each variable by itself. Then move on to study the relationships
among the variables.
Begin with a graph or graphs. Then add numerical summaries of specific aspects of
the data.
We will follow these principles in organizing our learning. This chapter presents methods
for describing a single variable. We study relationships among two or more variables in
Chapter 2. Within each chapter, we begin with graphical displays, then add numerical
summaries for a more complete description.
E
DATADATADATA
DATADATADATADATADATA
GPS
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
GPS receivers to determine the exact location of the receiver. Here are the market shares for the
DATADATADATADATADATA
DATADATADATA
major GPS receiver brands sold in the United States.2
Company Percent
Garmin 47
TomTom 19
Magellan 17
Mio 7
Courtesy Garmin
Other 10
Company is the categorical variable in this example, and the values of this variable are the names
of the companies that provide GPS receivers in this market.
Note that the last value of the variable Company is Other, which includes all
receivers sold by companies other than the four listed by name. For data sets that have a
large number of values for a categorical variable, we often create a category such as this
that includes categories that have relatively small counts or percents. Careful judgment is
needed when doing this. You dont want to cover up some important piece of information
contained in the data by combining data in this way.
When we look at the GPS market share data set, we see that Garmin dominates the
market with almost half of the sales. By using graphical methods, we can easily see this
information and other characteristics of the data easily. We now examine two graphical
ways to do this.
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA GPS EXAMPLE 1.8 Bar Graph for the GPS Market Share Data
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Figure 1.3 displays the GPS market share data using a bar graph. The heights of the five bars
bar graph show the market shares for the four companies and the Other category.
35
30
25
20
15
10
5
0
in
io
er
la
m
To
th
el
ar
O
m
ag
G
To
M
Moore-3620020 psbe August 16, 2010 23:30
The categories in a bar graph can be put in any order. In Figure 1.3, we ordered
the companies based on their market share, with the Other category coming last. For
other data sets, an alphabetical ordering or some other arrangement might produce a
more useful graphical display. You should always consider the best way to order the
values of the categorical variable in a bar graph. Choose an ordering that will be useful
to you. If you are uncertain, ask a friend whether your choice communicates what you
expect.
ATA FIL
DATADATA
D
E
DATADATADATA
DATADATADATADATADATA
GPS
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
EXAMPLE 1.9 Pie Chart for the GPS Market Share Data
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
pie chart The pie chart in Figure 1.4 helps us see what part of the whole each group forms. Even if we did
not include the percents, it would be very easy to see that Garmin has about half of the market.
Other
10%
Mio
7%
Garmin
47%
Magellan
17%
TomTom
19%
To make a pie chart, you must include all the categories that make up a whole. A
category such as Other in this example can be used, but the sum of the percents for all
of the categories should be 100%.
Bar graphs are more flexible. For example, you can use a bar graph to compare the
numbers of students at your college majoring in biology, business, and political science.
A pie chart cannot make this comparison, because not all students fall into one of these
three majors.
We use graphical displays to help us learn things from data. Here is another example.
ATA FIL
DATADATA
Auto accidents cost $164 billion each year.3 How can this enormous burden on the economy
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA CRASHES be reduced? Lets look at some data.4 Figure 1.5 is a bar graph that gives the percents of auto
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
accidents for each day of the week. What do we learn from this graph? The highest percent is
on Saturday, about 17%, and the lowest is on Monday, about 10%. If we were to seek govern-
ment funding for a program to reduce accidents, we might do some research on the Saturday
accidents.
Moore-3620020 psbe August 16, 2010 23:30
Percent
10
8
6
4
2
0
ay
ay
ay
ay
ay
da
da
d
sd
id
nd
on
es
ur
Fr
ne
ur
Su
t
Tu
M
Sa
Th
ed
W
The categories in Figure 1.5 are ordered by the days of the week, Monday through
Sunday. In exploring what these data tell us about accidents, we focused on the day of the
week with the highest percent of accidents. Lets pursue this idea a little further and order
the categories from highest percent to lowest percent. A bar graph whose categories are
Pareto chart ordered from most frequent to least frequent is called a Pareto chart.5
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
Figure 1.6 displays the Pareto chart for the automobile accident data. Here it is easy to see that
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
CRASHES
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Saturday is the highest. Friday, Wednesday, and Thursday are also relatively high. Tuesday and
Sunday are a bit lower. Monday is the lowest.
10
8
6
4
2
0
ay
ay
ay
ay
ay
y
da
da
rd
id
sd
sd
nd
es
on
Fr
tu
ne
ur
Su
Tu
M
Sa
Th
ed
W
Moore-3620020 psbe August 16, 2010 23:30
Pareto charts are frequently used in quality control settings. Here, the purpose is
often to identify common types of defects in a manufactured product. Deciding upon
strategies for corrective action can then be based on what would be most effective.
Chapter 12 gives more examples of settings where Pareto charts are used.
Bar graphs, pie charts, and Pareto charts help an audience grasp a distribution
quickly. When you prepare them, keep in mind this purpose. We will move on to quan-
titative variables, where graphs are essential tools.
ATA FIL
DATADATA
APPLY YOUR KNOWLEDGE
D
DATADATADATA
DATADATADATADATADATA
CANADIAN POPULATION
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
1.6 Population of Canadian provinces and territories. Here are populations of
13 Canadian provinces and territories based on the 2006 census:6
Province/territory Population
Alberta 3,290,350
British Columbia 4,113,487
Manitoba 1,148,401
New Brunswick 729,997
Newfoundland and Labrador 505,469
Northwest Territories 41,464
Nova Scotia 913,462
Nunavut 29,474
Ontario 12,160,282
Prince Edward Island 135,851
Quebec 7,546,131
Saskatchewan 968,157
Yukon 30,372
(a) Display these data in a bar graph using the alphabetical order of provinces and
territories in the table.
(b) Use a Pareto chart to display these data.
(c) Compare the two graphs. Which do you prefer? Give a reason for your answer.
ATA FIL
DATADATA
1.7 GPS market share in Europe. In Examples 1.7 to 1.9 (pages 8 to 9), we examined
D
DATADATADATA
DATADATADATADATADATA
the U.S. market share of several companies that sell GPS receivers. Here is a similar
GPSEUROPE
DATADATADATADATADATA
DATADATADATADATADATA
(a) Display the data in a bar graph. Be sure to choose the ordering for the companies
carefully. Explain why you made this choice.
Moore-3620020 psbe August 16, 2010 23:30
(b) Compare this graph with the bar graph in Figure 1.3. Garmin has its world
headquarters in Olathe, Kansas, while TomToms registered address is
Amsterdam, the Netherlands. Explain how this information helps you to
understand the differences between the two bar graphs.
ATA FIL
Treasury Bills Treasury bills, also known as T-bills, are bonds issued by the U.S. Depart-
DATADATA
D
DATADATADATA
DATADATADATADATADATA
CASE 1.1
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TBILLRATES ment of the Treasury. You buy them at a discount from their face value, and they mature in a
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
fixed period of time. For example, you might buy a $1000 T-bill for $980. When it matures,
six months later, you would receive $1000your original $980 investment plus $20 interest.
This interest rate is $20 divided by $980, which is 2.04% for six months. Interest is usually
reported as a rate per year, so for this example the interest rate would be 4.08%. Rates are
determined by an auction that is held every four weeks. The data set contains the interest
rates for T-bills for each auction from December 12, 1958, to October 3, 2008.8
Our data set contains 2600 cases. The two variables in the data set are the date of
the auction and the interest rate. To learn something about T-bill interest rates, we begin
with a histogram.
CASE 1.1 To make a histogram of the T-bill interest rates, we proceed as follows.
classes Step 1. Divide the range of the interest rates into classes of equal width. The T-bill interest rates
range from 0.85% to 15.76%, so we choose as our classes
ATA FIL
DATADATA
0.00 rate < 2.00
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TBILLRATES 2.00 rate < 4.00
..
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
.
14.00 rate < 16.00
Be sure to specify the classes precisely so that each case falls into exactly one class. An interest
rate of 1.98% would fall into the first class, but 2.00% would falls into the second.
Step 2. Count the number of cases in each class. Here are the counts:
Step 3. Draw the histogram. Mark on the horizontal axis the scale for the variable whose dis-
tribution you are displaying. The variable is interest rate in this example. The scale runs from
0 to 16 to span the data. The vertical axis contains the scale of counts. Each bar represents a class.
Moore-3620020 psbe August 16, 2010 23:30
The base of the bar covers the class, and the bar height is the class count. Notice that the scale on
the vertical axis runs from 0 to 1000 to accommodate the tallest bar, which has a height of 951.
There is no horizontal space between the bars unless a class is empty, so that its bar has height
zero. Figure 1.7 is our histogram.
800
600
Count
400
200
0
1 3 5 7 9 11 13 15
Interest rate (%)
Our eyes respond to the area of the bars in a histogram.9 Because the classes are
all the same width, area is determined by height and all classes are fairly represented.
There is no one right choice of the classes in a histogram. Too few classes will give
a skyscraper graph, with all values in a few classes with tall bars. Too many will
produce a pancake graph, with most classes having one or no observations. Neither
choice will give a good picture of the shape of the distribution. You must always use your
judgment in choosing classes to display the shape. Statistics software will choose the
classes for you. The computers choice is usually a good one. Sometimes, however, the
classes chosen by software differ from the natural choices that you would make. Usually,
options are available for you to change them. The next example illustrates a situation
where the wrong choice of classes will cause you to miss a very important characteristic
of a data set.
ATA FIL
DATADATA
Many businesses operate call centers to serve customers who want to place an order or make an
D
DATADATADATA
DATADATADATADATADATA
CALLCENTER80
DATADATADATADATADATA
DATADATADATADATADATA
inquiry. Customers want their requests handled thoroughly. Businesses want to treat customers
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
well, but they also want to avoid wasted time on the phone. They therefore monitor the length of
calls and encourage their representatives to keep calls short.
We have data on the length of all 31,492 calls made to the customer service center of a small
bank in a month. Table 1.1 displays the lengths of the first 80 calls.10
Take a look at the data in Table 1.1. In this data set the cases are calls made to the banks call
center. The variable recorded is the length of each call. The units of measurement are seconds. We
see that the call lengths vary a great deal. The longest call lasted 2631 seconds, almost 44 minutes.
More striking is that 8 of these 80 calls lasted less than 10 seconds. Whats going on?
Moore-3620020 psbe August 16, 2010 23:30
TABLE 1.1 Service times (seconds) for calls to a customer service center
77 289 128 59 19 148 157 203
126 118 104 141 290 48 3 2
372 140 438 56 44 274 479 211
179 1 68 386 2631 90 30 57
89 116 225 700 40 73 75 51
148 9 115 19 76 138 178 76
67 102 35 80 143 951 106 55
4 54 137 367 277 201 52 9
700 182 73 199 325 75 103 64
121 11 9 88 1148 2 465 25
We started our study of the customer service center data by examining a few cases,
the ones displayed in Table 1.1. It would be very difficult to examine all 31,492 cases in
this way. We need a better method. Lets try a histogram.
ATA FIL
DATADATA
Figure 1.8 is a histogram of the lengths of all 31,492 calls. We did not plot the few lengths greater
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA than 1200 seconds (20 minutes). As expected, the graph shows that most calls last between about 1
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
CALLCENTER and 5 minutes, with some lasting much longer when customers have complicated problems. More
DATADATADATADATADATA
DATADATADATA
striking is the fact that 7.6% of all calls are no more than 10 seconds long. It turns out that the bank
penalized representatives whose average call length was too longso some representatives just
hung up on customers in order to bring their average length down. Neither the customers nor the
bank were happy about this. The bank changed its policy, and later data showed that calls under
10 seconds had almost disappeared.
1500
order to bring down their
average call length.
1000
500
0
0 200 400 600 800 1000 1200
Service time (seconds)
Moore-3620020 psbe August 16, 2010 23:30
The choice of the classes is an important part of making a histogram. Lets look at
the customer service center call lengths again.
EXAMPLE 1.15 Another Histogram for Customer Service Center Call Lengths
ATA FIL
DATADATA
Figure 1.9 is a histogram of the lengths of all 31,492 calls with class boundaries of 0, 100,
D
E
DATADATADATA
DATADATADATADATADATA
200, etc. seconds. Statistical software made this choice as a default option. Notice that the spike
CALLCENTER
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
representing the very brief calls that appears in Figure 1.8 is covered up in the 0 to 100 seconds
DATADATADATA
4,000
2,000
0
0 200 400 600 800 1000 1200
Service time (seconds)
If we let software choose the classes, we would miss one of the most important
features of the data, the calls of very short duration. We were alerted to this unexpected
characteristic of the data by our examination of the 80 cases displayed in Table 1.1.
Beware of letting statistical software do your thinking for you. Example 1.15 illustrates
the danger of doing this. To do an effective analysis of data, we often need to look at
data in more than one way. For histograms, looking at several choices of classes will
lead us to a good choice. Fortunately, with software, examining choices such as this is
relatively easy.
1.8 Exam grades in a statistics course. The table below summarizes the exam scores
of students in an introductory statistics course. Use the summary to sketch a histogram
that shows the distribution of scores.
Class Count
60 score < 70 11
70 score < 80 36
80 score < 90 57
90 score < 100 29
Moore-3620020 psbe August 16, 2010 23:30
1.9 Suppose some students scored 100. No students earned a perfect score of 100
on the exam described in the previous exercise. Note that the last class included only
scores that were greater than or equal to 90 and less than 100. Explain how you would
change the class definitions for a similar exam on which some students earned a perfect
score.
Stemplot
To make a stemplot:
1. Separate each observation into a stem consisting of all but the final (rightmost)
digit and a leaf, the final digit. Stems may have as many digits as needed, but
each leaf contains only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and draw a
vertical line at the right of this column.
3. Write each leaf in the row to the right of its stem, in increasing order out from
the stem.
CASE 1.1 The histogram that we produced in Example 1.12 to examine the T-bill interest rates used all 2600
cases in the data set. To illustrate the idea of a stemplot, we will take a simple random sample of
ATA FIL
DATADATA
size 50 from this data set. We will learn more about how to take such samples in Chapter 3. Here
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TBILLRATES50 are the data:
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
7.2 5.7 6.0 5.0 12.8 7.8 11.6 4.6 2.7 4.9
5.8 13.8 1.5 4.6 3.7 8.3 7.0 3.2 5.8 1.0
7.2 8.0 3.2 7.5 5.4 5.3 6.9 5.8 5.0 9.4
10.4 4.3 6.8 1.0 5.5 5.1 4.6 6.6 4.7 6.1
5.7 1.0 3.8 7.3 6.5 3.0 3.9 8.0 3.0 7.9
The original data set gave the interest rates with two digits after the decimal point. To make the job
of preparing our stemplot easier, we first rounded the values to one place following the decimal.
Figure 1.10 illustrates the key steps in constructing the stemplot for these data. How does the
stemplot for this sample of size 50 compare with the histogram based on all 2600 interest rates
that we examined in Figure 1.7 (page 13)?
You can choose the classes in a histogram. The classes (the stems) of a stemplot are
rounding given to you. When the observed values have many digits, it is often best to round the
numbers to just a few digits before making a stemplot, as we did in Example 1.16.
Moore-3620020 psbe August 17, 2010 0:30
splitting stems You can also split stems to double the number of stems when all the leaves would
otherwise fall on just a few stems. Each stem then appears twice. Leaves 0 to 4 go on
the upper stem and leaves 5 to 9 go on the lower stem. Rounding and splitting stems are
matters for judgment, like choosing the classes in a histogram. Stemplots work well for
small sets of data. When there are more than 100 observations, a histogram is almost
always a better choice.
Special considerations apply for very large data sets. It is often useful to take a
sample and examine it in detail as a first step. This is what we did in Example 1.16.
Sampling can be done in many different ways. A company with a very large number of
customer records, for example, might look at those from a particular region or country
for an initial analysis.
Examining a Distribution
In any graph of data, look for the overall pattern and for striking deviations from that
pattern.
You can describe the overall pattern of a histogram by its shape, center, and spread.
An important kind of deviation is an outlier, an individual value that falls outside the
overall pattern.
We will learn how to describe center and spread numerically in Section 1.2. For now,
we can describe the center of a distribution by its midpoint, the value with roughly half
the observations taking smaller values and half taking larger values. We can describe the
spread of a distribution by giving the smallest and largest values.
Moore-3620020 psbe August 16, 2010 23:30
CASE 1.1 Lets look again at the histogram in Figure 1.7. There appear to be some relatively large interest
rates. The largest is 15.76%. What do we think about this value? Is it so extreme relative to the
ATA FIL
DATADATA
other values that we would call it an outlier? To qualify for this status an observation should stand
D
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TBILLRATES50 apart from the other observations either alone or with very few other cases. A careful examination
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA of the data indicates that this 15.76% does not qualify for outlier status. There are interest rates of
15.72%, 15.68%, and 15.58%. In fact, there are 15 auctions with interest rates of 15% or higher.
The distribution has a single peak at around 5%. The distribution is somewhat right-skewed
that is, the right tail extends farther from the peak than does the left tail.
When you describe a distribution, concentrate on the main features. Look for major
peaks, not for minor ups and downs in the bars of the histogram. Look for clear outliers,
not just for the smallest and largest observations. Look for rough symmetry or clear
skewness.
ATA FIL
DATADATA
Figure 1.11 displays a histogram of the IQ scores of 60 fifth-grade students. There is a single peak
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA around 110 and the distribution is approximately symmetric. The tails decrease smoothly as we
DATADATADATADATADATA
DATADATADATADATADATA IQ
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
move away from the peak. Measures such as this are usually constructed so that they have nice
DATADATADATA
10
0
80 90 100 110 120 130 140 150
IQ score
Moore-3620020 psbe August 16, 2010 23:30
1.10 Make a stemplot. Make a stemplot for a distribution that has a single peak,
approximately symmetric with one high and two low outliers.
1.11 Make another one. Make a stemplot of a distribution that is skewed toward
large values.
Time plots
Many variables are measured at intervals over time. We might, for example, measure the
cost of raw materials for a manufacturing process each month or the price of a stock at
the end of each day. In these examples, our main interest is change over time. To display
change over time, make a time plot.
Time Plot
A time plot of a variable plots each observation against the time at which it was
measured. Always put time on the horizontal scale of your plot and the variable you are
measuring on the vertical scale. Connecting the data points by lines helps emphasize any
change over time.
More details about how to analyze data that vary over time are given in Chapter 13,
Time Series Forecasting. For now, we will examine how a time plot can reveal some
additional important information about T-bill interest rates.
CASE 1.1 The Web site of the Federal Reserve Bank of St. Louis provided a very interesting graph of T-bill
interest rates.11 It is shown in Figure 1.12. A time plot shows us the relationship between two
variables, in this case interest rate and the auctions that occurred at four-week intervals. Notice
how the Federal Reserve Bank included information about a third variable in this plot. The third
variable is a categorical variable that indicates whether or not the United States was in a recession.
It is indicated by the shaded areas in the plot.
CASE 1.1 1.12 What does the time plot show? Carefully examine the time plot in
Figure 1.12.
(a) How do the T-bill interest rates vary over time?
(b) What can you say about the relationship between the rates and the recession
periods?
Moore-3620020 psbe August 16, 2010 23:30
15
Percent
10
0
1950 1960 1970 1980 1990 2000 2010
In Example 1.12 (page 12) we examined the distribution of T-bill interest rates for
the period December 12, 1958, to October 3, 2008. The histogram in Figure 1.7 showed
us the shape of the distribution. By looking at the time plot in Figure 1.12, we now
see that there is more to this data set than is revealed by the histogram. This scenario
illustrates the types of steps used in an effective statistical analysis of data. We are rarely
able to completely plan our analysis in advance, set up the appropriate steps to be taken,
and then click on the appropriate buttons in a software package to obtain useful results.
An effective analysis requires that we proceed in an organized way, use a variety of
analytical tools as we proceed, and exercise careful judgment at each step in the process.
A data set contains information on a number of cases. Cases may be people, animals,
or things. For each case, the data give values for one or more variables. A variable
describes some characteristic of an individual, such as a persons height, gender, or
salary. Variables can have different values for different cases.
Some variables are categorical and others are quantitative. A categorical variable
places each case into a category, such as male or female. A quantitative variable has
numerical values that measure some characteristic of each case, such as height in
centimeters or salary in dollars per year.
Exploratory data analysis uses graphs and numerical summaries to describe the
variables in a data set and the relations among them.
The distribution of a variable describes what values the variable takes and how often
it takes these values.
To describe a distribution, begin with a graph. Bar graphs and pie charts describe
the distribution of a categorical variable, and Pareto charts identify the most im-
portant categories for a categorical variable. Histograms and stemplots graph the
distributions of quantitative variables.
Moore-3620020 psbe August 16, 2010 23:30
When examining any graph, look for an overall pattern and for notable deviations
from the pattern.
Shape, center, and spread describe the overall pattern of a distribution. Some dis-
tributions have simple shapes, such as symmetric and skewed. Not all distributions
have a simple overall shape, especially when there are few observations.
Outliers are observations that lie outside the overall pattern of a distribution. Always
look for outliers and try to explain them.
When observations on a variable are taken over time, make a time plot that graphs
time horizontally and the values of the variable vertically. A time plot can reveal
interesting patterns in a set of data.
For Exercise 1.1, see page 4; for 1.2 to 1.4, see page 5; for 1.5, tive variables that you might measure for each student. Give the
see page 6; for 1.6 and 1.7, see pages 1112; for 1.8 and 1.9, see units of measurement for the quantitative variables.
pages 1516; for 1.10 and 1.11, see page 19; and for 1.12, see
1.18 What color should you use for your product? What
page 19.
is your favorite color? One survey produced the following sum-
1.13 Employee application data. The personnel department mary of responses to that question: blue, 42%; green, 14%; pur-
keeps records on all employees in a company. Here is the infor- ple, 14%; red, 8%; black, 7%; orange, 5%; yellow, 3%; brown,
mation that they keep in one of their data files: employee identifi- 3%; gray, 2%; and white, 2%.12 Make a bar graph of the percents
cation number, last name, first name, middle initial, department, and write a short summary of the major features of your graph.
ATA FIL
DATADATA
FAVORITECOLORS
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
LEASTFAVORITECOLORS
D
1.14 Where should you locate your business? You are in-
1.20 Market share doubles in a year. The market share of
terested in choosing a new location for your business. Create a
iPhones doubled from 5.3% to 10.8% between the first quarter of
list of criteria that you would use to rank cities. Include at least
2008 and the first quarter of 2009.13 One of the attractions of the
eight variables and give reasons for your choices. Classify each
iPhone is the Web browser, which they market as the most ad-
variable as quantitative or categorical.
vanced Web browser on a mobile device. Users of iPhones were
1.15 Survey of students. A survey of students in an introduc- asked to respond to the statement I do a lot more browsing on
tory statistics class asked the following questions: (a) age; (b) do the iPhone than I did on my previous mobile phone. Here are the
you like to dance? (yes, no); (c) can you play a musical instru- results:14
ment (not at all, a little, pretty well); (d) how much did you spend
on food last week? (e) height; (f) do you like broccoli? (yes, no). Response Percent
Classify each of these variables as categorical or quantitative and
Strongly agree 54
give reasons for your answers.
Mildly agree 22
1.16 What questions would you ask? Refer to the previous Mildly disagree 16
exercise. Make up your own survey questions with at least six Strongly disagree 8
questions. Include at least two categorical variables and at least
two quantitative variables. Tell which variables are categorical
(a) Make a bar graph to display the distribution of the responses.
and which are quantitative. Give reasons for your answers.
(b) Display the distribution with a pie chart.
1.17 Study habits of students. You are planning a sur- (c) Summarize the information in these charts.
vey to collect information about the study habits of college (d) Do you prefer the bar graph or the pie chart? Give a reason
ATA FIL
DATADATA
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Moore-3620020 psbe August 16, 2010 23:30
1.21 What did the iPhone replace? The survey in the previ- (a) Use a bar graph to display the market shares.
ous exercise also asked iPhone users what phone, if any, did the (b) Summarize what the graph tells you about market shares for
iPhone replace. Here are the responses: search engines.
1.24 Market share for computer operating systems. The
Response Percent Response Percent following table gives the market share for the major computer
ATA FIL
operating systems.17
DATADATA
OPERATINGSYSTEMS
E
DATADATADATA
DATADATADATADATADATA
Make a bar graph for these data. Carefully consider how you Windows 90.29% Playstation 0.03%
will order the responses. Explain why you chose the ordering that Mac 8.23% SunOS 0.01%
ATA FIL
PHONEREPLACEMENT
DATADATA
you did.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
iPhone 0.32%
1.22 Garbage is big business. The formal name for garbage
is municipal solid waste. In the United States, approximately
254 million tons of garbage are generated in a year. Below is a (a) Make a bar graph of this market share data.
breakdown of the materials that made up American municipal (b) Write a short paragraph summarizing these data.
ATA FIL
GARBAGE
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
FACEBOOKBYCOUNTRY
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
SEARCHENGINES
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
Search Market Search Market has been increasing rapidly. Data are available on the in-
engine share engine share creases between February 8, 2008, and September 29, 2008.20
ATA FIL
DATADATA
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
UNEMPLOYMENT
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
Facebook Facebook
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
UNEMPLOYMENTCANADA
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
10 01
11 0177889
Luxury Intermediate-price 12 122225567799
Color car (%) car (%) 13 0000112333345556699
Black 22 10 14 033678
Silver 16 25 15 25
White Pearl 14 4 16
Gray 12 12 17 0
White 11 8
(a) There is an outlier: Florida has the highest percent of resi-
Blue 7 13
dents aged 65 and older and clearly stands out. Alaska has the
Red 7 10
lowest percent, but it is at the end of a relatively flat tail on the
Yellow/Gold 6 4 low end of the distribution. What are the percents for these two
Other 5 14 states?
(b) Describe the shape, center, and spread of this distribution.
1.32 U.S. population 65 and older. Make a stemplot of the per-
(a) Make a bar graph for the luxury car percents.
cent of residents aged 65 and older in the states other than Alaska
(b) Make a bar graph for the intermediate-price car percents.
and Florida by splitting stems 8 to 15 in the plot from the previous
(c) Now, be creative: make one bar graph that compares ATA FIL
DATADATA
POPOVER65BYSTATE
D
exercise. Which plot do you prefer? Why?
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
your graph so that it is easy to compare the two types of 1.33 The Canadian market. Refer to Exercise 1.31. Here are
vehicle. similar data for the 13 Canadian provinces and territories:26
ATA FIL
DATADATA
CANADIANPOPULATION
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
1.30 Procter & Gamble sales. The 2007 annual report of the
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Procter & Gamble Company (P&G) states that global net sales
were over $76 billion. The sales information is organized into Province/Territory Percent over 65
global segments. The following summary gives the net sales for Alberta 10.7
ATA FIL
PANDGSALES
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
Manitoba 14.1
New Brunswick 14.7
Net sales Newfoundland and Labrador 13.9
Segment ($ millions) Northwest Territories 4.8
Beauty 22,981 Nova Scotia 15.1
Health care 8,964 Nunavut 2.7
Fabric care and home care 18,971 Ontario 13.6
Baby care and family care 12,726 Prince Edward Island 14.9
Snacks, coffee, and pet care 4,537 Quebec 14.3
Blades and razors 5,229 Saskatchewan 15.4
Duracell and Braun 4,031 Yukon 7.5
(a) Display the data graphically and describe the major features
Summarize these data graphically and write a paragraph describ- of your plot.
ing the net sales of P&G. (b) Explain why you chose the particular format for your graphi-
cal display. What other types of graph could you have used? What
1.31 Products for senior citizens. The market for products
are the strengths and weaknesses of each for displaying this set
designed for senior citizens in the United States is expanding.
of data?
Here is a stemplot of the percents of residents aged 65 and older
in the 50 states, for 2006, as estimated by the U.S. Census Bu- 1.34 Left-skew. Sketch a histogram for a distribution that is
reau.25 The stems are whole percents and the leaves are tenths of a skewed to the left. Suppose that you and your friends emptied
ATA FIL
DATADATA
POPOVER65BYSTATE
D
percent.
E
DATADATADATA
your pockets of coins and recorded the year marked on each coin.
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Moore-3620020 psbe August 16, 2010 23:30
The distribution of dates would be skewed to the left. Explain In particular, look at the percent of children relative to the rest of
why. the population.
(c) Make a histogram with vertical scale in percents of the pro-
1.35 Is the supply adequate? How much oil the wells in a
jected age distribution for the year 2075. Use the same scales as
given field will ultimately produce is key information in de-
in (b) for easy comparison. What are the most important changes
ciding whether to drill more wells. Here are the estimated to-
in the U.S. age distribution projected for the years between 1950
tal amounts of oil recovered from 64 wells in the Devonian
and 2075?
Richmond Dolomite area of the Michigan basin, in thousands
ATA FIL
of barrels:27
DATADATA
OILWELLS
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
USPOPULATION
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Because the total population in 2075 is much larger than the sockeye salmon in runs at Bristol Bay between 1988 and 2007:29
ATA FIL
DATADATA
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Make a graph to display the distribution of salmon run vertical axis and compress the time axis, data appear to be more
size, then describe the pattern and any striking deviations that variable. Compressing the vertical axis and stretching the time
you see. axis make variations appear to be smaller. Make two time plots of
(b) Make a time plot of run size and describe its pattern. As is the data in the previous exercise to illustrate this idea. Make one
often the case with data measured at specific time intervals, a plot that makes variability appear to be larger and one plot that
time plot is needed to understand what is happening. makes variability appear to be smaller. The moral of this exercise
1.39 Watch those scales! The impression that a time plot gives is: pay close attention to the scales when you look at a time plot.
ATA FIL
DATADATA
BERINGSEAFISH
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
depends on the scales you use on the two axes. If you stretch the
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Time to Start a Business An entrepreneur faces many bureaucratic and legal hur-
CASE 1.2
dles when starting a new business. The World Bank collects information about starting
businesses throughout the world. It has determined the time, in days, to complete all of the
procedures required to start a business.30 Data for 195 countries are included in the data set.
For this section we will examine data for a sample of 24 of these countries. Here are the
data:
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA
TIMETOSTART24
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
23 4 29 44 47 24 40 23 23 44 33 27
DATADATADATADATADATA
DATADATADATA
60 46 61 11 23 62 31 44 77 14 65 42
observations, the shape of the distribution is irregular. There are peaks in the 20s and the 40s. The
D
DATADATADATA
DATADATADATADATADATA
TIMETOSTART24
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
values range from 4 to 77 days, with a center somewhere in the middle of these two extremes.
DATADATADATADATADATA
DATADATADATA
There do not appear to be any outliers.
0 4
1 14
2 3333479
3 13
4 0244467
5
6 0125
7 7
FIGURE 1.13 Stemplot for sample of 24 business start times, for Example 1.20.
Moore-3620020 psbe August 16, 2010 23:30
The Mean x
To find the mean of a set of observations, add their values and divide by the number of
observations. If the n observations are x1 , x2 , . . . , xn , their mean is
x1 + x2 + + xn
x=
n
or, in more compact notation,
1
x= xi
n
The (capital Greek sigma) in the formula for the mean is short for add them all
up. The subscripts on the observations xi are just a way of keeping the n observations
distinct. They do not necessarily indicate order or any other special facts about the data.
The bar over the x indicates the mean of all the x-values. Pronounce the mean x as
x-bar. This notation is very common. When writers who are discussing data use x or
y, they are talking about a mean.
DATADATADATA
DATADATADATADATADATA
23 + 4 + + 42
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA TIMETOSTART24
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
=
DATADATADATA
24
897
= = 37.375
24
The mean time to start a business for the 24 countries in our data set is 37.4 days. Note that we
have rounded the answer. Our goal in using the mean to describe the center of a distribution is not
to demonstrate that we can compute with great accuracy. The additional digits do not provide any
additional useful information. In fact, they distract our attention from the important digits that are
meaningful. Do you think it would be better to report the mean as 37 days?
In practice, you can key the data into your calculator and hit the Mean key. You dont
have to actually add and divide. But you should know that this is what the calculator is
doing.
CASE 1.2 1.40 Include the outlier. The complete business start time data set with
195 countries has a few with very large start times. In constructing the data set for
ATA FIL
DATADATA
Case 1.2 a random sample of 25 countries was selected. This sample included the
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART25 South American country of Suriname, where the start time is 694 days. This case was
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
deleted for Case 1.2. Reconstruct the original random sample by including Suriname.
Show that the mean has increased to 64 days. (This is a rounded number. You should
report the mean with two digits after the decimal.)
Moore-3620020 psbe August 16, 2010 23:30
Exercise 1.40 illustrates an important fact about the mean as a measure of center: it
is sensitive to the influence of one or more extreme observations. These may be outliers,
but a skewed distribution that has no outliers will also pull the mean toward its long tail.
Because the mean cannot resist the influence of extreme observations, we say that it is
resistant measure not a resistant measure of center.
ATA FIL
DATADATA
1.41 Calls to a customer service center. The service times for 80 calls to a customer
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA CALLCENTER80 service center are given in Table 1.1 (page 14). Use these data to compute the mean
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
service time.
1.42 Find the mean of the first-exam scores. Here are the scores on the first exam
in an introductory statistics course for 10 students:
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA 80 73 92 85 75 98 93 55 80 90
STATCOURSE
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Find the mean first-exam score for these students.
The Median M
The median M is the midpoint of a distribution, the number such that half the
observations are smaller and the other half are larger. To find the median of a distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median M is the center observation in
the ordered list. Find the location of the median by counting (n + 1)/2
observations up from the bottom of the list.
3. If the number of observations n is even, the median M is the mean of the two
center observations in the ordered list. The location of the median is again
(n + 1)/2 from the bottom of the list.
Note that the formula (n + 1)/2 does not give the median, just the location of the
median in the ordered list. Medians require little arithmetic, so they are easy to find by
hand for small sets of data. Arranging even a moderate number of observations in order
is very tedious, however, so that finding the median by hand for larger sets of data is
unpleasant. Even simple calculators have an x button, but you will need software or a
graphing calculator to automate finding the median.
CASE 1.2 To find the median time to start a business for our 24 countries, we first arrange the data in order
from smallest to largest:
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA
TIMETOSTART24
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
4 11 14 23 23 23 23 24 27 29 31 33
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA 40 42 44 44 44 46 47 60 61 62 65 77
Moore-3620020 psbe August 16, 2010 23:30
The count of observations n = 24 is even. The median, then, is the average of the two center
observations in the ordered list. To find the location of the center observations, we first compute
n+1 25
location of M = = = 12.5
2 2
Therefore, the center observations are the 12th and 13th observations in the ordered list. The
median is
33 + 40
M= = 36.5
2
Note that you can use the stemplot directly to compute the median. In the stemplot
the cases are already ordered and you simply need to count from the top or the bottom
to the desired location.
CASE 1.2 1.43 Include the outlier. Include Suriname, where the start time is
694 days, in the data set and show that the median is 40 days. Note that with this
ATA FIL
DATADATA case included, the sample size is now 25 and the median is the 13th observation in the
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART25 ordered list. Write out the ordered list and circle the outlier. Describe the effect of the
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA outlier on the median for this set of data.
ATA FIL
DATADATA 1.44 Calls to a customer service center. The service times for 80 calls to a customer
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
CALLCENTER80 service center are given in Table 1.1 (page 14). Use these data to compute the median
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA service time.
ATA FIL
DATADATA
1.45 Find the median of the first-exam scores. Here are the scores on the first exam
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA STATCOURSE in an introductory statistics course for 10 students:
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
80 73 92 85 75 98 93 55 80 90
Find the median first-exam score for these students.
Consider the prices of existing single-family homes in the United States. The mean
price in 2007 was $266,200 while the median was $217,900. This distribution is strongly
skewed to the right. There are many moderately priced houses and a few very ex-
pensive mansions. The few expensive houses pull the mean up but do not affect the
median.
Reports about house prices, incomes, and other strongly skewed distributions usually
give the median (midpoint) rather than the mean (arithmetic average). However, if
you are a tax assessor interested in the total value of houses in your area, use the mean.
The total is the mean times the number of houses, but it has no connection with the
median. The mean and median measure center in different ways, and both are useful.
ATA FIL
DATADATA
1.46 Gross domestic product. The success of companies expanding to developing
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA GDP12 regions of the world depends in part on the prosperity of the countries in those regions.
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Here are World Bank data on the growth of gross domestic product (percent per year)
for the period 2000 to 2004 in countries in Asia (not including Japan):
Country Growth
Bangladesh 5.2
China 9.4
Hong Kong 3.2
India 6.2
Indonesia 4.6
Korea (South) 4.7
Malaysia 4.4
Pakistan 4.1
Philippines 3.9
Singapore 2.9
Thailand 5.4
Vietnam 7.2
4 to 694 days. Without Suriname, the range is 4 to 77 days. These largest and smallest
observations show the full spread of the data and are highly influenced by outliers.
We can improve our description of spread by also giving several percentiles. The
percentile pth percentile of a distribution is the value such that p percent of the observations fall
at or below it. The median is just the 50th percentile, so the use of percentiles to report
spread is particularly appropriate when the median is our measure of center.
The most commonly used percentiles other than the median are the quartiles. The
first quartile is the 25th percentile, and the third quartile is the 75th percentile. That
is, the first and third quartiles show the spread of the middle half of the data. (The
second quartile is the median itself.) To calculate a percentile, arrange the observations
in increasing order and count up the required percent from the bottom of the list. Our
definition of percentiles is a bit inexact because there is not always a value with exactly
p percent of the data at or below it. We will be content to take the nearest observation
for most percentiles, but the quartiles are important enough to require an exact recipe.
The rule for calculating the quartiles uses the rule for the median.
Here is an example that shows how the rules for the quartiles work for both odd and
even numbers of observations.
CASE 1.2 Here is the ordered list of the times to start a business in our sample of 24 countries:
4 11 14 23 23 23 23 24 27 29 31 33
40 42 44 44 44 46 47 60 61 62 65 77
ATA FIL
DATADATA
D
DATADATADATA
The count of observations n = 24 is even, so the median is at position (24 + 1)/2 = 12.5, that
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART24
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
is, between the 12th and the 13th observation in the ordered list. There are 12 cases above this
position and 12 below it. The first quartile is the median of the first 12 observations, and the third
quartile is the median of the last 12 observations. Check that Q 1 = 23 and Q 3 = 46.5.
Notice that the quartiles are resistant. For example, Q 3 would have the same value
if the highest start time was 770 days rather than 77 days.
There are slight differences in the methods used by software to compute percentiles.
However, the results will generally be quite similar except in cases where the sample
sizes are very small.
Be careful when several observations take the same numerical value. Write down
all the observations and apply the rules just as if they all had distinct values.
Moore-3620020 psbe August 16, 2010 23:30
You can draw boxplots either horizontally or vertically. Be sure to include a numer-
ical scale in the graph. When you look at a boxplot, first locate the median, which marks
the center of the distribution. Then look at the spread. The quartiles show the spread of the
middle half of the data, and the extremes (the smallest and largest observations) show
the spread of the entire data set. We now have the tools for a preliminary examination of
the customer service center call lengths.
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
CALLCENTER80 calls that we discussed in Example 1.13. The five-number summary for these data is 1.0, 54.4,
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
103.5, 200, 2631. The distribution is highly skewed. The mean is 197 seconds, a value that is very
close to the third quartile. The boxplot is displayed in Figure 1.14. The skewness of the distribution
is the major feature that we see in this plot. Note that the mean is marked with a + and appears
very close to the upper edge of the box.
2000
1500
1000
500
+
0
n = 80
Moore-3620020 psbe August 16, 2010 23:30
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA MPG ronmental Protection Agency provides data on the fuel efficiencies of vehicles sold in the United
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
States each year.31 Figure 1.15 gives side-by-side boxplots of the miles per gallon (mpg) for four
vehicle classes: convertibles, pickup trucks, SUVs, and small cars. Small cars appear to have better
efficiency than the other three classes. Pickup trucks show less variation than the other classes; the
range of mpg values is less, and the first and third quartiles are closer together. The distributions
for SUVs and small cars show some skewness, with some vehicles having particularly good fuel
efficiency. However, note that the mean (marked with a +) and the median are very close for all
four classes.
+
20
+ +
+
10
0
Convertible Pickup truck SUV Small car
Car class
CASE 1.2 1.47 Time to start a business. Refer to the data on times to start a business
ATA FIL
DATADATA
in 24 countries described in Case 1.2 on page 26. Use a boxplot to display the distribu-
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART24 tion. Discuss the features of the data that you see in the boxplot, and compare it with
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
the stemplot in Figure 1.13. Which do you prefer? Give reasons for your answer.
Moore-3620020 psbe August 16, 2010 23:30
1.48 First-exam scores. Here are the scores on the first exam in an introductory
ATA FIL
DATADATA statistics course for 10 students:
D
E
DATADATADATA
DATADATADATADATADATA
STATCOURSE
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
80 73 92 85 75 98 93 55 80 90
DATADATADATA
Display the distribution with a boxplot. Discuss whether or not a stemplot would
provide a better way to look at this distribution.
Notice that the average in the variance s 2 divides the sum by 1 less than the
number of observations, that is, n 1 rather than n. The reason is that the deviations
xi x always sum to exactly 0, so that knowing n 1 of them determines the last one.
Only n 1 of the squared deviations can vary freely, and we average by dividing the
degrees of freedom total by n 1. The number n 1 is called the degrees of freedom of the variance or
standard deviation. Many calculators offer a choice between dividing by n and dividing
by n 1, so be sure to use n 1.
In practice, use software or your calculator to obtain the standard deviation from
keyed-in data. Doing an example step-by-step will help you understand how the variance
and standard deviation work, however.
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA BLSWAGES hourly wages for 9 categories of law-related occupations (OCC Code 23-0000) (the units are
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
dollars per hour):32
75 38 27 48 23 23 20 20 26
We organize the rest of the arithmetic in a table. This is a good way to do calculations such as this
when you need to work through all the details.
The variance is the sum of the squared deviations divided by 1 less than the number of observations:
2636.01
s2 = = 329.5
8
The standard deviation is the square root of the variance:
s = 329.5 = 18.15 dollars per hour
More important than the details of hand calculation are the properties that determine
the usefulness of the standard deviation:
s measures spread about the mean and should be used only when the mean is chosen
as the measure of center.
s = 0 only when there is no spread. This happens only when all observations have
the same value. Otherwise, s is greater than zero. As the observations become more
spread out about their mean, s gets larger.
s has the same units of measurement as the original observations. For example, if
you measure wages in dollars per hour, s is also in dollars per hour.
Like the mean x, s is not resistant. Strong skewness or a few outliers can greatly
increase s.
CASE 1.2 1.49 Time to start a business. Verify the statement in the last bullet
ATA FIL
DATADATA
above using the data on the time to start a business. First, use the 24 cases from Case
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART24 1.2 (page 26) to calculate a standard deviation. Next, include the country Suriname,
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
where the time to start a business is 694 days. Show that the inclusion of this single
outlier increases the standard deviation from 19 to 133.
ATA FIL
DATADATA
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART25 You may rightly feel that the importance of the standard deviation is not yet clear.
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
We will see in the next section that the standard deviation is the natural measure of
spread for an important class of symmetric distributions, the Normal distributions. The
usefulness of many statistical procedures is tied to distributions with particular shapes.
This is certainly true of the standard deviation.
Moore-3620020 psbe August 16, 2010 23:30
Choosing a Summary
The five-number summary is usually better than the mean and standard deviation for
describing a skewed distribution or a distribution with extreme outliers. Use x and s only
for reasonably symmetric distributions that are free of outliers.
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
STATCOURSE 1.50 First-exam scores. Below are the scores on the first exam in an introductory
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA statistics course for 10 students. We found the mean of these scores in Exercise 1.42
(page 28) and the median in Exercise 1.45 (page 29).
80 73 92 85 75 98 93 55 80 90
(a) Make a stemplot of these data.
(b) Compute the standard deviation.
(c) Are the mean and the standard deviation effective in describing the distribution of
these scores? Explain your answer.
1.51 Calls to a customer service center. We displayed the distribution of the lengths
ATA FIL
DATADATA
DATADATADATA
DATADATADATADATADATA
CALLCENTER80
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
(a) Compute the mean and the standard deviation for these 80 calls (the data are
DATADATADATA
Stocks are risky. They went up 14% per year on the average during this period, but they
dropped almost 28% in the worst year. The large standard deviation reflects the fact that
stocks have produced both large gains and large losses. When you buy a Treasury bill,
Moore-3620020 psbe August 16, 2010 23:30
on the other hand, you are lending money to the government for one year. You know
that the government will pay you back with interest. That is much less risky than buying
stocks, so (on the average) you get a smaller return.
Are x and s good summaries for distributions of investment returns? Figures 1.16(a)
and 1.16(b) display stemplots of the annual returns for both investments. You see that
returns on Treasury bills have a right-skewed distribution. Convention in the financial
world calls for x and s because some parts of investment theory use them. For describ-
ing this right-skewed distribution, however, the five-number summary would be more
informative.
Remember that a graph gives the best overall picture of a distribution. Numerical
measures of center and spread report specific facts about a distribution, but they do not
describe its entire shape. Numerical summaries do not disclose the presence of multiple
peaks or gaps, for example. Always plot your data.
A numerical summary of a distribution should report its center and its spread or
variability.
The mean x and the median M describe the center of a distribution in different
ways. The mean is the arithmetic average of the observations, and the median is the
midpoint of the values.
When you use the median to indicate the center of the distribution, describe its spread
by giving the quartiles. The first quartile Q 1 has one-fourth of the observations
below it, and the third quartile Q 3 has three-fourths of the observations below it.
Moore-3620020 psbe August 16, 2010 23:30
The five-number summary consisting of the median, the quartiles, and the high
and low extremes provides a quick overall description of a distribution. The median
describes the center, and the quartiles and extremes show the spread.
Boxplots based on the five-number summary are useful for comparing several distri-
butions. The box spans the quartiles and shows the spread of the central half of the
distribution. The median is marked within the box. Lines extend from the box to the
extremes and show the full spread of the data.
The variance s2 and especially its square root, the standard deviation s, are common
measures of spread about the mean as center. The standard deviation s is zero when
there is no spread and gets larger as the spread increases.
A resistant measure of any aspect of a distribution is relatively unaffected by changes
in the numerical value of a small proportion of the total number of observations, no
matter how large these changes are. The median and quartiles are resistant, but the
mean and the standard deviation are not.
The mean and standard deviation are good descriptions for symmetric distributions
without outliers. They are most useful for the Normal distributions, introduced in the
next section. The five-number summary is a better exploratory summary for skewed
distributions.
For Exercises 1.41 and 1.42, see page 28; for 1.43 to 1.45, (a) Describe the distribution of trade balance using the mean and
see page 29; for 1.46, see page 30; for 1.47 and 1.48, see the standard deviation.
pages 3334; for 1.49, see page 35; and for 1.50 and 1.51, see (b) Do the same using the median and the quartiles.
page 36. (c) Using only the information from parts (a) and (b), give a
description of the data.
1.52 Gross domestic product growth in 120 countries. The
Do not look at any graphical summaries or other numerical sum-
gross domestic product (GDP) of a country is the total value of
maries for this part of the exercise.
all goods and services produced in the country. It is an important
measure of the health of a countrys economy. For this exercise, 1.55 What do the trade balance graphical summaries show?
ATA FIL
you will analyze the growth in GDP, expressed as a percent, for DATADATA
COUNTRIES120
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
ATA FIL
DATADATADATADATADATA
120 countries.33
DATADATADATADATADATA
DATADATA DATADATADATADATADATA
COUNTRIES120
D
DATADATADATA DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
(a) Compute the mean and the standard deviation. trade balance for these countries.
(b) Which two countries are outliers for this variable? (b) Give the names of the countries that correspond to extreme
(c) Recompute the mean and standard deviation without the out- values in this distribution.
liers. Explain how the mean and standard deviation changed when (c) Reanalyze the data without the outliers.
you deleted the outliers. (d) Summarize what you have learned about the distribution of
1.53 Use the resistant measures for GDP. Repeat parts (a) the trade balance for these countries. Include appropriate graph-
and (c) of the previous exercise using the median and the quar- ical and numerical summaries as well as comments about the
tiles. Summarize your results and compare them with those of outliers.
ATA FIL
DATADATA
COUNTRIES120
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
1.54 Trade balance for 120 countries. Trade balance is an- Table 1.2 (page 23) for the U.S. unemployment rates for each of
ATA FIL
DATADATA
UNEMPLOYMENT
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
defined as the difference between the value of a countrys exports (a) Find the mean and the standard deviation.
and its imports. A negative trade balance occurs when a country (b) Find the five-number summary.
imports more than it exports. Similarly, the trade balance will be (c) Draw a boxplot.
positive for a country that exports more than it imports. Note that (d) How do you prefer to summarize these data? Include numer-
values of this variable are missing for five countries. In this data ical and graphical summaries and explain the reasons for your
ATA FIL
DATADATA
COUNTRIES120
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Moore-3620020 psbe August 16, 2010 23:30
1.57 Canadian unemployment rates. Unemployment rates for by IBM, at $59,031 million; Microsoft, at $59,007 million; GE,
10 Canadian provinces are given in Exercise 1.28 (page 23). An- at $53,086 million; and Toyota, at $34,050 million. For this exer-
swer the questions in the previous exercise for these data. The cise you will use the brand values, reported in millions of dollars,
ATA FIL
DATADATA
BRANDS
D
U.S. data set has 50 cases while the Canadian data set has 10 for the top 100 brands.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
cases. Discuss how this difference influences the way in which (a) Graphically display the distribution of the values of these
ATA FIL
DATADATA
UNEMPLOYMENTCANADA
D
you summarize the data. brands.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
D
Refer to the previous two exercises.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
MENTCANADA
(a) Use side-by-side boxplots to give a graphical summary of 1.62 The alcohol content of beer. Brewing beer involves a va-
the two sets of unemployment rates. riety of steps that can affect the alcohol content. A Web site gives
ATA FIL
(b) Use a back-to-back stemplot to compare the two sets of the percent alcohol for 86 domestic brands of beer.35
DATADATA
BEER
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
rates. A back-to-back stemplot has a single stem with leaves on (a) Use graphical and numerical summaries of your choice to
the left for one group and leaves on the right for the other. describe these data. Give reasons for your choice.
(c) Summarize the major differences and similarities between (b) The data set contains an outlier. Explain why this particular
the two sets of unemployment rates. beer is unusual and how its outlier status is related to how it is
(d) Which graphical comparison do you prefer? Give reasons for marketed.
your answer.
1.63 An outlier for alcohol content of beer. Refer to the pre-
ATA FIL
DATADATA
BEER
D
1.59 Recoverable oil. The estimated amounts of recoverable vious exercise.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
oil from 64 oil wells in the Devonian Richmond Dolomite area (a) Calculate the mean with and without the outlier. Do the same
ATA FIL
DATADATA
OILWELLS
D
of Michigan are given Exercise 1.35 (page 25). for the median. Explain how these values change when the outlier
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
BEER
D
uct is one that is consistent and has very little variability in its data set also gives the calories per 12 ounces of beverage.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
characteristics. Controlling variability can be more difficult with (a) Analyze the data and summarize the distribution of calories
agricultural products than with those that are manufactured. The for these 86 brands of beer.
following table gives the weights, in ounces, of the 25 potatoes (b) In Exercise 1.62 you identified one brand of beer as an out-
ATA FIL
DATADATA
POTATOES
D
sold in a 10-pound bag. lier. To what extent is this brand an outlier in the distribution of
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
2. An observation is a suspected outlier if it lies more than (b) The mean of these returns is about 5.19%. Explain from the
1.5 IQR below the first quartile Q 1 or above the third quar- shape of the distribution why the mean return is larger than the
tile Q 3 . median return.
The stemplot in Exercise 1.31 (page 24) displays the dis-
1.73 Salary increase for the owners. Last year a small ac-
tribution of the percents of residents aged 65 and older in
counting firm paid each of its five clerks $30,000, two junior
the 50 states. Stemplots help you find the five-number sum-
accountants $65,000 each, and the firms owner $355,000.
mary because they arrange the observations in increasing order.
ATA FIL
DATADATA
(a) What is the mean salary paid at this firm? How many of the
POPOVER65BYSTATE
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
employees earn less than the mean? What is the median salary?
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
IMPUTATION
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
APPLET
the line by clicking below it. 17 6 12 14 20 23 9 12 16 21
(a) Add 1 additional observation without changing the median.
Where is your new point? The values for the other 10 cases are missing. One way to deal
(b) Use the applet to convince yourself that when you add yet with missing data is called imputation. The basic idea is that miss-
another observation (there are now 7 in all), the median does not ing values are replaced, or imputed, with values that are based on
change no matter where you put the 7th point. Explain why this an analysis of the data that are not missing. For a data set with a
must be true. single variable, the usual choice of a value for imputation is the
mean of the values that are not missing. The mean for this data
1.71 x and s are not enough. The mean x and standard deviation set is 15.
s measure center and spread but are not a complete description of (a) Verify that the mean is 15 and find the standard deviation for
a distribution. Data sets with different shapes can have the same the 10 cases for which x is not missing.
mean and standard deviation. To demonstrate this fact, find x and (b) Create a new data set with 20 cases by setting the values
s for these two small data sets. Then make a stemplot of each and for the 10 missing cases to 15. Compute the mean and standard
ATA FIL
DATADATA
ABDATA
D
DATADATADATA
DATADATADATADATADATA
(c) Summarize what you have learned about the possible effects
Data A: 9.14 8.14 8.74 8.77 9.26 8.10 of this type of imputation on the mean and the standard deviation.
6.13 3.10 9.13 7.26 4.74
1.77 A different type of mean. The trimmed mean is a mea-
Data B: 6.58 5.76 7.71 8.84 8.47 7.04 sure of center that is more resistant than the mean but uses more
5.25 5.56 7.91 6.89 12.50 of the available information than the median. To compute the
5% trimmed mean, discard the highest 5% and the lowest 5% of
CASE 1.1 1.72 Returns on Treasury bills. Figure 1.16(a) the observations and compute the mean of the remaining 90%.
(page 37) is a stemplot of the annual returns on U.S. Treasury Trimming eliminates the effect of a small number of outliers. Use
bills for fifty years. (The entries are rounded to the nearest tenth the data on the values of the top 100 brands that we studied in
ATA FIL
TBILLRATES50
D
of a percent.)
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Use the stemplot to find the five-number summary of T-bill this result with the value of the mean computed in the usual way.
ATA FIL
DATADATA
BRANDS
D
DATADATADATA
returns.
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Moore-3620020 psbe August 16, 2010 23:30
Density curves
mathematical model A density curve is a mathematical model for a distribution. Mathematical models are
idealized descriptions. They allow us to easily make many statements in an idealized
world. The statements are useful when the idealized world is similar to the real world.
The density curves that we will study give a compact picture of the overall pattern of
data. They ignore minor irregularities as well as outliers. For some situations, we are
able to capture all of the essential characteristics of a distribution with a density curve.
For other situations, our idealized model misses some important characteristics. As with
so many things in statistics, your careful judgment is needed to decide what is important
and how close is good enough.
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
Figure 1.17 is a histogram of the city gas mileage achieved by all 1140 motor vehicles (2009 model
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
MPG2009 year) listed in the governments annual fuel economy report.36 Superimposed on the histogram is
DATADATADATA
20
15
10
0
5 10 15 20 25 30 35 40 45 50
Miles per gallon
Moore-3620020 psbe August 16, 2010 23:30
a density curve. The histogram shows that there are a few vehicles with very good fuel efficiency.
These are high outliers in the distribution. The distribution is somewhat skewed to the right,
reflecting the successful attempts of the auto industry to produce high-fuel-efficiency vehicles.
There is a single peak around 15 miles per gallon. Both tails fall off quite smoothly. The density
curve in Figure 1.17 is close to the histogram in many places but fails to capture some important
characteristics of the distribution displayed by the histogram.
If we use a density curve that ignores vehicles that are outliers, we would capture the
main features of the distribution of fuel efficiency for 2009 vehicles. On the other hand,
we would miss the fact that some of these vehicles have been engineered to give excellent
fuel efficiency. A marketing campaign based on this outstanding performance could be
very effective for selling vehicles in an economy with high fuel prices. Be careful about
how you deal with outliers. They may be data errors or they may be the most important
feature of the distribution. Computer software cannot make this judgment. Only you
can.
Here are some details about density curves. We need these basic ideas to understand
the rest of this chapter.
Density Curve
A density curve is a curve that
is always on or above the horizontal axis and
has area exactly 1 underneath it.
A density curve describes the overall pattern of a distribution. The area under the curve
and above any range of values is the proportion of all observations that fall in that range.
Mean
Median
1.78 Another skewed curve. Sketch a curve similar to Figure 1.18(b) for a left-
skewed density curve. Be sure to mark the location of the mean and the median.
What about the mean? The mean of a set of observations is their arithmetic average.
If we think of the observations as weights strung out along a thin rod, the mean is the
point at which the rod would balance. This fact is also true of density curves. The mean
is the point at which the curve would balance if made of solid material.
mathematical ways of calculating the mean for any density curve, so we are able to mark the mean
as well as the median in Figure 1.18(b).
We can roughly locate the mean, median, and quartiles of any density curve by eye.
This is not true of the standard deviation. When necessary, we can once again call on
more advanced mathematics to learn the value of the standard deviation. The study of
mathematical methods for doing calculations with density curves is part of theoretical
statistics. Though we are concentrating on statistical practice, we often make use of the
results of mathematical study.
Because a density curve is an idealized description of the distribution of data, we
need to distinguish between the mean and standard deviation of the density curve and
the mean x and standard deviation s computed from the actual observations. The usual
mean notation for the mean of an idealized distribution is (the Greek letter mu). We write
standard deviation the standard deviation of a density curve as (the Greek letter sigma).
1.79 A symmetric curve. Sketch a density curve that is symmetric but has a shape
different from that of the curve in Figure 1.18(a).
uniform distribution 1.80 A uniform distribution. Figure 1.20 displays the density curve of a uniform
distribution. The curve takes the constant value 1 over the interval from 0 to 1 and is
0 1
Moore-3620020 psbe August 16, 2010 23:30
A BC A B C AB C
(a) (b) (c)
0 outside that range of values. This means that data described by this distribution take
values that are uniformly spread between 0 and 1. Use areas under this density curve
to answer the following questions.
(a) Why is the total area under this curve equal to 1?
(b) What percent of the observations lie above 0.8?
(c) What percent of the observations lie below 0.6?
(d) What percent of the observations lie between 0.25 and 0.75?
(e) What is the mean of this distribution?
1.81 Three curves. Figure 1.21 displays three density curves, each with three points
marked. At which of these points on each curve do the mean and the median fall?
Normal distributions
One particularly important class of density curves has already appeared in Figure 1.18(a).
These density curves are symmetric, single-peaked, and bell-shaped. They are called
Normal distributions Normal curves, and they describe Normal distributions. All Normal distributions have
the same overall shape. The exact density curve for a particular Normal distribution is
described by giving its mean and its standard deviation . The mean is located at
the center of the symmetric curve and is the same as the median. Changing without
changing moves the Normal curve along the horizontal axis without changing its
spread. The standard deviation controls the spread of a Normal curve. Figure 1.22
shows two Normal curves with different values of . The curve with the larger standard
deviation is more spread out.
The standard deviation is the natural measure of spread for Normal distributions.
Not only do and completely determine the shape of a Normal curve, but we can
FIGURE 1.22 Two Normal curves, showing the mean and the standard deviation .
Moore-3620020 psbe August 16, 2010 23:30
locate by eye on the curve. Heres how. Imagine that you are skiing down a mountain
that has the shape of a Normal curve. At first, you descend at an ever-steeper angle as
you go out from the peak:
Fortunately, before you find yourself going straight down, the slope begins to grow flatter
rather than steeper as you go out and down:
The points at which this change of curvature takes place are located along the
horizontal axis at distance on either side of the mean . Remember that and
alone do not specify the shape of most distributions, and that the shape of density curves
in general does not reveal . These are special properties of Normal distributions.
Why are the Normal distributions important in statistics? Here are three reasons.
First, Normal distributions are good descriptions for some distributions of real data. Dis-
tributions that are often close to Normal include scores on tests taken by many people
(such as GMAT exams), repeated careful measurements of the same quantity (such as
measurements taken from a production process), and characteristics of biological popu-
lations (such as yields of corn). Second, Normal distributions are good approximations to
the results of many kinds of chance outcomes, such as tossing a coin many times. Third,
and most important many of the statistical inference procedures that we will study in
later chapters are based on Normal distributions.
Figure 1.23 illustrates the 689599.7 rule. By remembering these three num-
bers, you can think about Normal distributions without constantly making detailed
calculations.
68% of data
95% of data
99.7% of data
3 2 1 0 1 2 3
68%
95%
99.7%
Two standard deviations is 0.3 ounces for this distribution. The 95 part of the 689599.7
rule says that the middle 95% of 9-ounce bags weigh between 9.12 0.3 and 9.12 + 0.3 ounces,
that is, between 8.82 ounces and 9.42 ounces. This fact is exactly true for an exactly Normal dis-
tribution. It is approximately true for the weights of 9-ounce bags of chips because the distribution
of these weights is approximately Normal.
The other 5% of bags have weights outside the range from 8.82 to 9.42 ounces. Because the
Normal distributions are symmetric, half of these bags are on the heavy side. So the heaviest 2.5%
of 9-ounce bags are heavier than 9.42 ounces.
The 99.7 part of the 689599.7 rule says that almost all bags (99.7% of them) have weights
between 3 and + 3 . This range of weights is 8.67 to 9.57 ounces.
1.82 Heights of young men. Product designers often must consider physical char-
acteristics of their target population. For example, the distribution of heights of men
aged 20 to 29 years is approximately Normal with mean 69 inches and standard de-
viation 2.5 inches. Draw a Normal curve on which this mean and standard deviation
are correctly located. (Hint: Draw the curve first, locate the points where the curvature
changes, then mark the horizontal axis.)
1.83 More on young mens heights. The distribution of heights of young men is
approximately Normal with mean 69 inches and standard deviation 2.5 inches. Use the
689599.7 rule to answer the following questions.
(a) What percent of these men are taller than 74 inches?
(b) Between what heights do the middle 95% of young men fall?
(c) What percent of young men are shorter than 66.5 inches?
1.84 Test scores. Many states have programs for assessing the skills of students in
various grades. The Indiana Statewide Testing for Educational Progress (ISTEP) is
one such program.37 In a recent year, 76,531, tenth-grade Indiana students took the
English/language arts exam. The mean score was 572 and the standard deviation was
51. Assuming that these scores are approximately Normally distributed, N (572, 51),
use the 689599.7 rule to give a range of scores that includes 95% of these students.
1.85 Use the 689599.7 rule. Refer to the previous exercise. Use the 689599.7
rule to give a range of scores that includes 99.7% of these students.
A z-score tells us how many standard deviations the original observation falls away
from the mean, and in which direction. Observations larger than the mean are pos-
itive when standardized, and observations smaller than the mean are negative when
standardized.
weight 9.12
z=
0.15
Moore-3620020 psbe August 16, 2010 23:30
A bags standardized weight is the number of standard deviations by which its weight differs from
the mean weight of all bags. A bag weighing 9.3 ounces, for example, has standardized weight
9.3 9.12
z= = 1.2
0.15
or 1.2 standard deviations above the mean. Similarly, a bag weighing 8.7 ounces has standardized
weight
8.7 9.12
z= = 2.8
0.15
or 2.8 standard deviations below the mean bag weight.
1.86 SAT versus ACT. Eleanor scores 680 on the Mathematics part of the SAT. The
distribution of SAT scores in a reference population is Normal, with mean 500 and
standard deviation 100. Gerald takes the American College Testing (ACT) Mathematics
test and scores 27. ACT scores are Normally distributed with mean 18 and standard
deviation 6. Find the standardized scores for both students. Assuming that both tests
measure the same kind of ability, who has the higher score?
= -
820 820
That is, the proportion of all SAT takers who would be NCAA qualifiers is 0.8379, or about 84%.
There is no area under a smooth curve and exactly over the point 820. Consequently,
the area to the right of 820 (the proportion of scores > 820) is the same as the area at or
to the right of this point (the proportion of scores 820). The actual data may contain a
student who scored exactly 820 on the SAT. That the proportion of scores exactly equal
to 820 is 0 for a Normal distribution is a consequence of the idealized smoothing of
Normal distributions for data.
all students who take the SAT would be partial qualifiers? That is, what proportion have scores
between 720 and 820? Here are the pictures:
= -
area between 720 and 820 = area left of 820 area left of 720
0.0905 = 0.1621 0.0716
About 9% of all students who take the SAT have scores between 720 and 820.
How do we find the numerical values of the areas in Examples 1.33 and 1.34? If
you use software, just plug in mean 1026 and standard deviation 209. Then ask for the
cumulative proportions for 820 and for 720. (Your software will probably refer to these
as cumulative probabilities. We will learn in Chapter 4 why the language of probability
fits.) If you make a sketch of the area you want, you will rarely go wrong.
You can use the Normal Curve applet on the text CD and Web site to find Normal
APPLET proportions. The applet is more flexible than most softwareit will find any Normal
proportion, not just cumulative proportions. The applet is an excellent way to understand
Normal curves. But, because of the limitations of Web browsers, the applet is not as
accurate as statistical software.
If you are not using software, you can find cumulative proportions for Normal curves
from a table. That requires an extra step, as we now explain.
Now that you see how Table A works, lets redo the NCAA Examples 1.33 and 1.34
using the table.
Moore-3620020 psbe August 16, 2010 23:30
z = 1.47
The area from the table in Example 1.36 (0.8389) is slightly less accurate than the
area from software in Example 1.33 (0.8379) because we must round z to two places
when we use Table A. The difference is rarely important in practice.
1.87 Find the proportion. Use the fact that the ISTEP scores from Exercise 1.84
(page 48) are approximately Normal, N (572, 51). Find the proportion of students who
have scores less than 600. Find the proportion of students who have scores greater than
or equal to 600. Sketch the relationship between these two calculations using pictures
of Normal curves similar to the ones given in Example 1.33.
1.88 Find another proportion. Use the fact that the ISTEP scores are approximately
Normal, N (572, 51). Find the proportion of students who have scores between 600
and 650. Use pictures of Normal curves similar to the ones given in Example 1.34 to
illustrate your calculations.
x = 505 x=?
z=0 z = 1.28
Moore-3620020 psbe August 16, 2010 23:30
Without software, first find the standard score z with cumulative proportion 0.9, then
unstandardize to find x. Here is the two-step process:
1. Use the table. Look in the body of Table A for the entry closest to 0.9. It is 0.8997.
This is the entry corresponding to z = 1.28. So z = 1.28 is the standardized value
with area 0.9 to its left.
2. Unstandardize to transform the solution from z back to the original x scale. We
know that the standardized value of the unknown x is z = 1.28. So x itself satisfies
x 505
= 1.28
110
Solving this equation for x gives
This equation should make sense: it finds the x that lies 1.28 standard deviations
above the mean on this particular Normal curve. That is the unstandardized
meaning of z = 1.28. The general rule for unstandardizing a z-score is
x = + z
1.89 What score is needed to be in the top 5%? Consider the ISTEP scores, which
are approximately Normal, N (572, 51). How high a score is needed to be in the top
5% of students who take this exam?
1.90 Find the score that 60% of students will exceed. Consider the ISTEP scores,
which are approximately Normal, N (572, 51). Sixty percent of the students will score
above x on this exam. Find x.
Some software calls these graphs Normal probability plots. There is a technical distinction between the two
types of graphs, but the terms are often used loosely.
Moore-3620020 psbe August 16, 2010 23:30
Here is the idea of a simple version of a Normal quantile plot. It is not feasible
to make Normal quantile plots by hand, but software makes them for us, using more
sophisticated versions of this basic idea.
1. Arrange the observed data values from smallest to largest. Record what percentile
of the data each value occupies. For example, the smallest observation in a set of 20
is at the 5% point, the second smallest is at the 10% point, and so on.
2. Find the same percentiles for the Normal distribution using Table A or statistical
Normal scores software. Percentiles of the standard Normal distribution are often called Normal
scores. For example, z = 1.645 is the 5% point of the standard Normal
distribution, and z = 1.282 is the 10% point.
3. Plot each data point x against the corresponding Normal score z. If the data
distribution is close to standard Normal, the plotted points will lie close to the
45-degree line x = z. If the data distribution is close to any Normal distribution, the
plotted points will lie close to some straight line.
Any Normal distribution produces a straight line on the plot because standardizing
turns any Normal distribution into a standard Normal distribution. Standardizing is a
linear transformation that can change the slope and intercept of the line in our plot but
cannot turn a line into a curved pattern.
Figures 1.28 to 1.31 are Normal quantile plots for data we have met earlier. The data
x are plotted vertically against the corresponding Normal scores z plotted horizontally.
For small data sets, the z axis extends from 3 to 3 because almost all of a standard
Normal curve lies between these values. With larger sample sizes, values in the extremes
are more likely, and the z axis will extend farther from zero. These figures show how
Normal quantile plots behave.
ATA FIL
DATADATA
In Example 1.18 we examined the distribution of IQ scores for a sample of 60 fifth-grade students.
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA Figure 1.28 gives a Normal quantile plot for these data. Notice that the points have a pattern
DATADATADATADATADATA
DATADATADATADATADATA
IQ
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
that is pretty close to a straight line. This pattern indicates that the distribution is approximately
Normal. When we constructed a histogram of the data in Figure 1.11 (page 18), we noted that
the distribution has a single peak, is approximately symmetric, and has tails that decrease in a
smooth way. We can now add to that description by stating that the distribution is approximately
Normal.
Figure 1.28 does, of course, show some deviation from a straight line. Real data
almost always show some departure from the theoretical Normal model. It is important
to confine your examination of a Normal quantile plot to searching for shapes that show
clear departures from Normality. Dont overreact to minor wiggles in the plot. When we
discuss statistical methods that are based on the Normal model, we will pay attention to
the sensitivity of each method to departures from Normality. Many common methods
work well as long as the data are reasonably symmetric and outliers are not present.
Moore-3620020 psbe August 16, 2010 23:30
120
IQ
110
100
90
80
-3 -2 -1 0 1 2 3
Normal score
CASE 1.1 We made a histogram for the distribution of interest rates for T-bills in Example 1.12 (page 12).
A Normal quantile plot for these data is shown in Figure 1.29. This plot shows some interesting
ATA FIL
DATADATA
features of the distribution. First, in the central part, from about z = 2 to z = 1, the points fall
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TBILLRATES approximately on a straight line. This suggests that the distribution is approximately Normal in
DATADATADATADATADATA
DATADATADATA
this range. Then there is the region from slightly above z = 1 to slightly above z = 2, where
the points also fall approximately on a straight line. This line, however, has a different slope.
Combined, these features suggest that the distribution of interest rates may actually be a mixture
or a combination of two Normal populations. Finally, in both the lower and the upper extremes
the points flatten out. This occurs at an interest rate of around 1% for the lower tail and at 15% for
the upper tail. There may be some marked considerations that restrain interest rates from going
outside these bounds.
12.5
10.0
7.5
5.0
2.5
0
-4 -3 -2 -1 0 1 2 3 4
Normal score
The idea that distributions are approximately Normal within a range of values is an
old tradition. The remark All distributions are approximately Normal in the middle
has been attributed to the statistician Charlie Winsor.38
Moore-3620020 psbe August 16, 2010 23:30
CASE 1.2 1.91 Length of time to start a business. In Exercise 1.40 we noted that
the sample of times to start a business from 25 countries contained an outlier. For
ATA FIL
DATADATA
Suriname, the reported time is 694 days. This case is the most extreme in the entire
D
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
TIMETOSTART25 data set, which includes 195 counties. Figure 1.30 shows the Normal quantile plot for
DATADATADATA
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
CALLCENTER for the customer center call lengths. We looked at these data in Example 1.14, and we
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA examined the distribution using a histogram in Figure 1.8 (page 14). There are clearly
some very large outliers. In making the Normal quantile plot, we eliminated all calls
that lasted longer than 2 hours (7200 seconds). This distribution is strongly skewed to
the right. How does this show up in the Normal quantile plot?
100
50
0
-3 -2 -1 0 1 2 3
Normal score
Moore-3620020 psbe August 16, 2010 23:30
3000
2000
1000
0
-6 -4 -2 0 2 4 6
Normal score
ATA FIL
DATADATA
Figure 1.32 gives the histogram of the miles per gallon distribution with a density estimate produced
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA by software. Compare this figure with Figure 1.17 (page 41). Notice how the density estimate
DATADATADATADATADATA
DATADATADATADATADATA MPG2009
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
captures more of the unusual features of the distribution than the Normal density curve does.
DATADATADATA
25
Percent
20
15
10
0
5 10 15 20 25 30 35 40 45 50
Miles per gallon
DATADATADATA
DATADATADATADATADATA
STUBHUB
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
wanting to sell their tickets provide the location of their seats and the selling price. People wanting
to buy tickets can choose from among the tickets offered for a given event.39
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Moore-3620020 psbe August 16, 2010 23:30
On Saturday, October 18, 2008, the eleventh-ranked Missouri football team was scheduled
to play the first-ranked Texas team in Austin. On Thursday, October 16, 2008, StubHub! listed
64 pairs of tickets for the game. One pair was offered at $883 per ticket. It was noted that these
seats were in a suite and that food and bar were included. We discarded this outlier and examined
the distribution of the price per ticket for the remaining 63 pairs of tickets. The histogram with
a density estimate is given in Figure 1.33. The distribution has two peaks, one around $160 and
bimodal distribution another around $360. This is the identifying characteristic of a bimodal distribution. Since the
stadium has upper- and lower-level seats, we suspect that the difference in price between these
two types of seats is responsible for the two peaks. (Texas won 56 to 31.)
15
10
0
100 140 180 220 260 300 340 380 420 460 500
Price ($)
For Exercise 1.78, see page 43; for 1.79 to 1.81, see pages 4445; (a) Compute the mean and the standard deviation.
for 1.82 to 1.85, see page 48; for 1.86, see page 49; for 1.87 and (b) Apply the 689599.7 rule to this distribution.
1.88, see page 53; for 1.89 and 1.90, see page 54; and for 1.91 (c) Compare the results of the rule with the actual percents within
and 1.92, see page 57. one, two, and three standard deviations of the mean.
1.93 Sketch some Normal curves. (d) Summarize your conclusions.
(a) Sketch a Normal curve that has mean 10 and standard devi- 1.97 Do women talk more? Conventional wisdom suggests that
ation 3. women are more talkative than men. One study designed to ex-
(b) On the same x axis, sketch a Normal curve that has mean 20 amine this stereotype collected data on the speech of 42 women
and standard deviation 3. and 37 men in the United States.40
(c) How does the Normal curve change when the mean is varied (a) The mean number of words spoken per day by the women
but the standard deviation stays the same? was 14,297 with a standard deviation of 9065. Use the 6895
1.94 The effect of changing the standard deviation. 99.7 rule to describe this distribution.
(a) Sketch a Normal curve that has mean 10 and standard devi- (b) Do you think that applying the rule in this situation is rea-
ation 3. sonable? Explain your answer.
(b) On the same x axis, sketch a Normal curve that has mean 10 (c) The men averaged 14,060 words per day with a standard de-
and standard deviation 1. viation of 9056. Answer the questions in parts (a) and (b) for the
(c) How does the Normal curve change when the standard devi- men.
ation is varied but the mean stays the same? (d) Do you think that the data support the conventional wisdom?
Explain your answer. Note that in Section 7.2 we will learn formal
1.95 Know your density. Sketch density curves that might de- statistical methods to answer this type of question.
scribe distributions with the following shapes.
1.98 Data from Mexico. Refer to the previous exercise. A sim-
(a) Symmetric, but with two peaks (that is, two strong clusters
ilar study in Mexico was conducted with 31 women and 20 men.
of observations).
The women averaged 14,704 words per day with a standard de-
(b) Single peak and skewed to the left.
viation of 6215. For men the mean was 15,022 and the standard
1.96 Gross domestic product. Refer to Exercise 1.52, where deviation was 7864.
we examined the gross domestic product of 120 countries. (a) Answer the questions from the previous exercise for the Mex-
ATA FIL
ican study.
DATADATA
COUNTRIES120
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Moore-3620020 psbe August 16, 2010 23:30
(b) The means for both men and women are higher for the Mexi- 1.103 Selling apartment buildings. Continue with the vari-
can study than for the U.S. study. What conclusions can you draw able Sale Price Per Sqft created in the previous exercise.
ATA FIL
DATADATA
APARTMENTS
D
from this observation?
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Calculate the mean and standard deviation of the Sale Price
1.99 Total scores. Below are the total scores of 10 students in
ATA FIL
DATADATA
Per Sqft values.
STATCOURSE
D
an introductory statistics course:
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
68 54 92 75 73 98 64 55 80 70 (c) Create a table that allows one to easily compare the distribu-
Previous experience with this course suggests that these scores tion of Sale Price Per Sqft with the 689599.7 rule for the three
should come from a distribution that is approximately Normal intervals calculated in part (b).
with mean 70 and standard deviation 10. (d) Does your table from part (c) provide a clear indication of
(a) Using these values for and , standardize the scores of Normality (or non-Normality) for the data values?
these 10 students.
1.104 Exploring Normal quantile plots.
(b) If the grading policy is to give a grade of A to the top 15%
(a) Create three data sets: one that is clearly skewed to the right,
of scores based on the Normal distribution with mean 70 and
one that is clearly skewed to the left, and one that is clearly sym-
standard deviation 10, what is the cutoff for an A in terms of a
metric and mound-shaped. (As an alternative to creating data sets,
standardized score?
you can look through this chapter and find an example of each
(c) Which students earned an A for this course?
type of data set requested.)
1.100 Assign more grades. Refer to the previous exercise. (b) Using statistical software, obtain Normal quantile plots for
The grading policy says that the cutoffs for the other grades corre- each of your three data sets.
spond to the following: the bottom 5% receive an F, the next 10% (c) Clearly describe the pattern of each data set in the Normal
receive a D, the next 40% receive a C, and the next 30% receive quantile plots from part (b).
a B. These cutoffs are based on the N (70, 10) distribution.
(a) Give the cutoffs for the grades in terms of standardized The table below contains data on a random sample of 22 telecom
scores. stockscompanies that specialize in telecommunication prod-
(b) Give the cutoffs in terms of actual scores. ucts. For each company, trading volume and revenue growth
(c) Do you think that this method of assigning grades is a good (over the last year) have been reported. Exercises 1.105 to 1.108
one? Give reasons for your answer. concern these data.42
APARTMENTS
D
DATADATADATA
ALLN 3,500
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Use statistical software to obtain histograms and Normal ATGN 5,650 0.1514
quantile plots of selling prices and building square footages. AVCI 68,482 0.2580
(b) Do either of these variables appear to be Normally dis- AXE 85,900 0.0739
tributed? Explain in what way the plots match (or dont match) CGN 100 0.1098
what you would expect to see for Normally distributed data.
COVD 2,410,204 0.0166
(c) One apartment building appears to be an outlier with respect
CTV 254,600 0.0437
to both selling price and square footage. Report the selling price
and square footage for this apartment building. CYBD 6,900 0
ETCIA 1,741 0.2391
1.102 Selling apartment buildings. Continue with the data GCOM 27,392 0.4337
from the previous exercise. Create a new variable (call it Sale
HLIT 690,026 0.1765
Price Per Sqft) by dividing the selling price for each apart-
PCTU 6,500 0.2898
ment building by the square footage for each apartment building.
ATA FIL
DATADATA
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) When plotting selling prices or building square footages, one QCOM 6,696,185 0.2001
apartment building stands out as an outlier. Does this same apart- SRTI 2,000 0.0006
ment building stand out in terms of the new variable you created TCCO 1,100 0.0856
for this exercise? Explain your response clearly. TKLC 246,101 0.0009
(b) Use statistical software to obtain a histogram and a Normal VERA 25,000 0.0081
quantile plot of the new variable Sale Price Per Sqft. WJCI 59,408 0.3544
(c) Does the distribution of Sale Price Per Sqft appear to be XXIA 1,750,027 0.1930
Normal? Describe precisely what about the histogram and the ZOOM 21,295 0.1298
Normal quantile plot leads you to your conclusion.
Moore-3620020 psbe August 16, 2010 23:30
TELECOMSTOCKS
D
1.105 Telecom shares traded. 1.110 Length of pregnancies. Some health insurance compa-
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Calculate the mean and standard deviation of the 22 trading- nies treat pregnancy as a preexisting condition when it comes
volume values. to paying for maternity expenses for a new policyholder. Some-
(b) Calculate x 3s. times the exact date of conception is unknown, so the insurance
(c) Clearly explain why your calculations in part (b) show that company must count back from the expected due date to judge
the distribution of trading volume is not symmetric and mound- whether or not conception occurred before or after the new pol-
shaped. icy began. The length of human pregnancies from conception
ATA FIL
DATADATA to birth varies according to a distribution that is approximately
TELECOMSTOCKS
D
1.106 Telecom revenue growth.
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
Normal with mean 266 days and standard deviation 16 days. Use
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
TELECOMSTOCKS began?
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Use statistical software to create a histogram of the trading 1.111 Use Table A. Use Table A to find the proportion of ob-
volumes for these 22 telecom stocks. servations from a standard Normal distribution that falls in each
(b) The histogram shows that these data are clearly right-skewed. of the following regions. In each case, sketch a standard Normal
Sketch what you think a Normal quantile plot of these data will curve and shade the area representing the region.
look like. (a) z 2.30
(c) Use statistical software to create a Normal quantile plot of (b) z 2.30
these data. How well does your sketch from part (b) match the (c) z > 1.70
plot generated by your software? (d) 2.30 < z < 1.70
ATA FIL
DATADATA
TELECOMSTOCKS
D
DATADATADATA
1.112 Use Table A. Use Table A to find the value of z for each
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Construct a stemplot of the revenue growth for these 22 tele- of the situations below. In each case, sketch a standard Normal
com stocks. You will need a 0 and a 0 on the stem. Use the curve and shade the area representing the region.
tenths place of these values on the stem and the hundredths place (a) Ten percent of the values of a standard Normal distribution
as the leaves. For example, 0.556 rounds to 0.56 and would are greater than z.
appear as 5|6 in the stemplot. (b) Ten percent of the values of a standard Normal distribution
(b) Describe the distribution of these revenue growth values. are greater than or equal to z.
Sketch what you think a Normal quantile plot of these data will (c) Ten percent of the values of a standard Normal distribution
look like. are less than z.
(c) Use statistical software to create a Normal quantile plot of (d) Fifty percent of the values of a standard Normal distribution
these data. How well does your sketch from part (b) match the are less than z.
plot generated by your software?
1.113 Use Table A. Consider a Normal distribution with mean
1.109 Visualizing the standard deviation. Figure 1.34 shows 100 and standard deviation 10.
two Normal curves, both with mean 0. Approximately what is (a) Find the proportion of the distribution with values 90 and
the standard deviation of each of these curves? 105. Illustrate your calculation with a sketch.
Statistics in Summary 63
(b) Find the values of x1 and x2 such that the proportion of the 1.116 Deciles of Normal distributions. The deciles of any dis-
distribution with values between x1 and x2 include the central tribution are the 10th, 20th, . . . , 90th percentiles. The first and
85% of the distribution. Illustrate your calculation with a sketch. last deciles are the 10th and 90th percentiles, respectively.
(a) What are the first and last deciles of the standard Normal
1.114 Length of pregnancies. The length of human pregnan-
distribution?
cies from conception to birth varies according to a distribution
(b) The weights of 9-ounce potato chip bags are approximately
that is approximately Normal with mean 266 days and standard
Normal with mean 9.12 ounces and standard deviation 0.15
deviation 16 days.
ounce. What are the first and last deciles of this distribution?
(a) What percent of pregnancies last fewer than 240 days (thats
about 8 months)? 1.117 Normal random numbers. Use software to generate 100
(b) What percent of pregnancies last between 240 and 270 days observations from the standard Normal distribution. Make a his-
(roughly between 8 and 9 months)? togram of these observations. How does the shape of the his-
(c) How long do the longest 25% of pregnancies last? togram compare with a Normal density curve? Make a Normal
quantile plot of the data. Does the plot suggest any important de-
1.115 Quartiles of Normal distributions. The median of any
viations from Normality? (Repeating this exercise several times
Normal distribution is the same as its mean. We can use Normal
is a good way to become familiar with how Normal quantile plots
calculations to find the quartiles for Normal distributions.
look when data actually are close to Normal.)
(a) What is the area under the standard Normal curve to the left
of the first quartile? Use this to find the value of the first quar- 1.118 Uniform random numbers. Use software to generate
tile for a standard Normal distribution. Find the third quartile 100 observations from the distribution described in Exercise 1.80
similarly. (page 44). (The software will probably call this a uniform distri-
(b) Your work in (a) gives the Normal scores z for the quartiles of bution.) Make a histogram of these observations. How does the
any Normal distribution. What are the quartiles for the lengths of histogram compare with the density curve in Figure 1.20? Make
human pregnancies? (Use the distribution given in the previous a Normal quantile plot of your data. According to this plot, how
exercise.) does the uniform distribution deviate from Normality?
STATISTICS IN SUMMARY
Data analysis is the art of describing data using graphs and numerical summaries. The
purpose of data analysis is to describe the most important features of a set of data. This
chapter introduces data analysis by presenting statistical ideas and tools for describing
the distribution of a single variable. The Statistics in Summary figure below will help
you organize the big ideas. The question marks at the last two stages remind us that the
usefulness of numerical summaries and models such as Normal distributions depends on
what we find when we examine the data using graphs. Here is a review list of the most
important skills you should have acquired from your study of this chapter.
Numerical summary?
x and s, Five-Number Summary
Mathematical model?
Normal Distribution?
A. Data
1. Identify the cases and variables in a set of data.
2. Identify each variable as categorical or quantitative. Identify the units in which
each quantitative variable is measured.
Moore-3620020 psbe August 16, 2010 23:30
B. Displaying Distributions
1. Make a bar graph, pie chart, and/or Pareto chart of the distribution of a
categorical variable. Interpret bar graphs, pie charts, and Pareto charts.
2. Make a histogram of the distribution of a quantitative variable.
3. Make a stemplot of the distribution of a small set of observations. Round leaves
or split stems as needed to make an effective stemplot.
C. Inspecting Distributions (Quantitative Variable)
1. Look for the overall pattern and for major deviations from the pattern.
2. Assess from a histogram or stemplot whether the shape of a distribution is
roughly symmetric, distinctly skewed, or neither. Assess whether the
distribution has one or more major peaks.
3. Describe the overall pattern by giving numerical measures of center and spread
in addition to a verbal description of shape.
4. Decide which measures of center and spread are more appropriate: the mean and
standard deviation (especially for symmetric distributions) or the five-number
summary (especially for skewed distributions).
5. Recognize outliers.
D. Time Plots
1. Make a time plot of data, with the time of each observation on the horizontal
axis and the value of the observed variable on the vertical axis.
2. Recognize patterns in a time plot.
E. Measuring Center
1. Find the mean x of a set of observations.
2. Find the median M of a set of observations.
3. Understand that the median is more resistant (less affected by extreme
observations) than the mean. Recognize that skewness in a distribution moves
the mean away from the median toward the long tail.
F. Measuring Spread
1. Find the quartiles Q 1 and Q 3 for a set of observations.
2. Give the five-number summary and draw a boxplot; assess center, spread,
symmetry, and skewness from a boxplot.
3. Using a calculator or software, find the standard deviation s for a set of
observations.
4. Know the basic properties of s: s 0 always; s = 0 only when all observations
are identical and increases as the spread increases; s has the same units as the
original measurements; s is pulled strongly up by outliers or skewness.
G. Density Curves
1. Know that areas under a density curve represent proportions of all observations
and that the total area under a density curve is 1.
2. Approximately locate the median (equal-areas point) and the mean (balance
point) on a density curve.
3. Know that the mean and median both lie at the center of a symmetric density
curve and that the mean moves farther toward the long tail of a skewed curve.
Moore-3620020 psbe August 16, 2010 23:30
H. Normal Distributions
1. Recognize the shape of Normal curves and be able to estimate by eye both the
mean and the standard deviation from such a curve.
2. Use the 689599.7 rule and symmetry to state what percent of the observations
from a Normal distribution fall between two points when the points lie one, two,
or three standard deviations on either side of the mean.
3. Find the standardized value (z-score) of an observation. Interpret z-scores and
understand that any Normal distribution becomes standard Normal N (0, 1)
when standardized.
4. Given that a variable has the Normal distribution with a stated mean and
standard deviation , calculate the proportion of values above a stated number,
below a stated number, or between two stated numbers.
5. Given that a variable has the Normal distribution with a stated mean and
standard deviation , calculate the point having a stated proportion of all values
above it. Also calculate the point having a stated proportion of all values below it.
6. Assess the Normality of a set of data by inspecting a Normal quantile plot.
1.119 Identify the histograms. A survey of a large college class (b) One way to make a pie chart of these data would be to use
asked the following questions: one slice in the pie chart for each state in the table. Give at least
(a) Are you female or male? (In the data, male = 0, one reason why this would not result in a useful pie chart.
female = 1.) (c) Group all customers from states other than Iowa (IA) into a
(b) Are you right-handed or left-handed? (In the data, right = 0, category called Other and make a pie chart with an Other slice.
left = 1.) Be sure to include the percent or count for each slice of your pie
(c) What is your height in inches? chart.
(d) How many minutes do you study on a typical weeknight?
Figure 1.35 shows histograms of the student responses, in scram- State Count State Count
bled order and without scale markings. Which histogram goes AR 1 MI 2
with each variable? Explain your reasoning. AZ 1 MO 2
1.120 How much does it cost to make a movie? Making movies CA 2 MS 2
is a very expensive activity and many cost more than they earn. CO 1 NE 3
On the other hand, enormous profits are also a possibility. For FL 1 NY 1
this exercise you will analyze the budgets for 160 films made GA 2 OH 2
ATA FIL
BOXOFFICE160
D
DATADATADATA
IA 1053 OK 1
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Examine the distribution of the budgets for these 160 films ID 2 OR 5
graphically. Describe key features of the distribution. IL 6 TN 1
(b) Plot the budgets versus time. Describe any patterns that you KS 1 TX 1
see. LA 1 UT 1
(c) Provide appropriate numerical summaries for the budgets of
MA 1 WI 2
these 160 films.
(d) Write a summary of what you learned from these data that
would be useful to someone who would like to invest in making
1.122 Help-wanted advertising in newspapers. One source
movies.
of revenue for newspapers is printing help-wanted ads for com-
1.121 Customers home state. A sample of 1095 customers panies that are looking for new employees. For this exercise we
entering a retail store were asked to fill out a brief survey. One will use monthly data on help-wanted advertising in newspapers
question on the survey asked each person to identify his or her from January 1951 to April 2005. The time series uses an index
current state of residency. The data from this question are sum- value with 1987 as the base year. That is, the monthly average for
ATA FIL
IOWA
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) The state in which the retail store resides is easily deduced only half as much help-wanted advertising in newspapers as the
from the table. In which state is this store located? monthly average for 1987, while a month with an index value of
Moore-3620020 psbe August 16, 2010 23:30
(a) (b)
(c) (d)
140 had 40% more help-wanted advertising in newspapers than tion of the item counts for the refunds, we see that 83 of the 103
ATA FIL
HELPWANTED
D
refunds were for one item. Using only this information and with-
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Using statistical software, obtain a time plot of the index val- out using software or a calculator, answer the following questions.
ATA FIL
DATADATA
REFUNDS
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
line to your time plot at the value of x for these data. (a) Provide the first four numbers of the five-number summary
(b) What do you notice about the beginning years of the time for the item counts. (You cannot determine the maximum item
series relative to the overall average of the time series? Which count using only the information given in this exercise.)
month in the time series is the first to be greater than the overall (b) Construct a boxplot for the item counts using 14 as the max-
average? imum item count. How long is the box in your boxplot? Explain
(c) Describe the trend of the index values beginning in January why this makes sense, given the data on item counts.
2000. Which month is the last month to be greater than the time (c) What does your boxplot indicate about the skewness of these
series average? data?
(d) Propose at least one reasonable explanation for the observed
1.125 Telecom revenue growth. The data on revenue growth
trend in help-wanted advertising in newspapers since January
for a random sample of telecommunications companies dis-
2000.
played before Exercise 1.105 (page 62) closely follow a Normal
1.123 A closer look at customer refunds. A retail store spe- distribution with a mean of 0.0224 and a standard devia-
cializing in childrens clothing and toys has a relatively strict no tion of 0.2180. Take as a model for telecom revenue growth
refunds policy. Exceptions to this policy are sometimes granted the N (0.0224, 0.2180) distribution and answer the following
ATA FIL
DATADATA
TELECOMSTOCKS
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
like to look at refund activity for the year 2005. Data recorded (a) Calculate + 3 for the model for telecom revenue growth.
include the date, amount, and item count for all refund transac- (b) From the population of all telecom companies, what percent
tions in 2005. Of the 10,939 transactions conducted between the should we expect to have revenue growth greater than + 3 ?
store and customers during 2005, only 103 of these transactions Explain how you arrived at your response.
ATA FIL
DATADATA
REFUNDS
D
were refunds (less than 1%). (c) What percent of the telecom companies in our sample have
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
(a) Using statistical software, calculate the five-number sum- revenue growth greater than + 3 ? Is this percent different
mary for refund amounts. (Note: All refunds are recorded as from your response to part (b)? Clearly explain why these two
negative numbers.) percents being different is not inconsistent with our assumption of
(b) What percent of all refunds in 2005 were $10 or less? a Normal distribution for the model for telecom revenue growth.
(c) Construct a boxplot of the refund amounts based on your
1.126 Telecom revenue growth. Take the N (0.0224, 0.2180)
five-number summary.
distribution as the model for telecom revenue growth as described
(d) What does your boxplot indicate about the skewness of these
in the previous exercise and answer the following questions.
data? ATA FIL
DATADATA
TELECOMSTOCKS
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
1.124 A closer look at customer refunds. Continue with the (a) What percent of telecom companies had negative revenue
data on refunds described in the previous exercise. Upon inspec- growth over the past year? Show your work.
Moore-3620020 psbe August 16, 2010 23:30
(b) What does negative revenue growth mean for a company? the distributions? Compare the two distributions and summarize
(c) What percent of telecom companies had revenue growth your results in a short paragraph.
greater than 0.50 (50%)? Show your work.
1.130 How much oil? How much oil the wells in a given field
(d) In terms of revenue growth, the top 25% of all telecom com-
will ultimately produce is key information in deciding whether
panies had revenue growth greater than what value? Show your
to drill more wells. The table below gives the estimated total
work.
amount of oil recovered from 64 wells in the Devonian Rich-
ATA FIL
OILWELLS
D
1.127 What influences buying? Product preference depends
E
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
CORN
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
DATADATADATADATADATA
summaries. Write a report summarizing your findings that in- having specified distributions. Use your statistical software
cludes supporting evidence from your analyses. to generate 25 observations from the N (30, 5) distribution.
Compute the mean and standard deviation x and s of the
1.135 Canadian government revenue and expendi-
25 values you obtain. How close are x and s to the
tures by province and territory. Visit the Web pages
and of the distribution from which the observations were
www40.statcan.ca/l01/cst01/govt08a.htm,
drawn?
www40.statcan.ca/l01/cst01/govt08b.htm, and
Repeat 19 more times the process of generating 25 observa-
www40.statcan.ca/l01/cst01/govt08c.htm. You need to
tions from the N (30, 5) distribution and recording x and s. Make
look at the three pages to obtain data for all provinces and ter-
a stemplot of the 20 values of x and another stemplot of the 20
ritories. Select some data from these Web pages and use the
values of s. Make Normal quantile plots of both sets of data.
methods that you learned in this chapter to create graphical and
Briefly describe each of these distributions. Are they symmetric
numerical summaries. Write a report summarizing your findings
or skewed? Are they roughly Normal? Where are their centers?
that includes supporting evidence from your analyses.
(The distributions of measures like x and s when repeated sets
1.136 Simulated observations. Most statistical software of observations are made from the same theoretical distribution
packages have routines for simulating values of variables will be very important in later chapters.)
regions of the world:48 in the United States.49 The BRFSS data set contains data on
DATADATA
VEHICLECOLORSBYCOUNTRY
D
DATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATADATADATA
DATADATADATA
Use the methods you learned in this chapter to compare the ve- 29 demographic factors and risk factors for each state. Pick three
hicle color preferences for the regions of the world presented or more variables from this data set and summarize the dis-
in this table. Write a report summarizing your findings with an tributions graphically and numerically. Write a report describ-
emphasis on similarities and differences across regions. Include ing your summary. Include a discussion of business opportu-
recommendations related to marketing and advertising of vehi- nities that you would consider on the basis of your analysis.
ATA FIL
DATADATA
BRFSS
D
DATADATADATA
CHAPTER 1 Appendix 69
CHAPTER 1 Appendix
Using Software for Statistical Analysis it should be emphasized that we are not bound to these
software programs. Because computer output from sta-
Good statistical analysis relies heavily on interactive sta- tistical packages is very similar, you can feel quite com-
tistical software. In this Appendix, we discuss the use of fortable using any one of a number of excellent statistical
Minitab and Excel for conducting statistical analysis. As a packages.
specialized statistical package, Minitab is one of the most
popular software choices both in industry and in colleges
and schools of business. As an all-purpose spreadsheet
Getting Started with Minitab
program, Excel provides a limited set of statistical analysis
options in comparison to Minitab, or to any other statistics In this section, we provide a basic overview of Minitab
package for that matter. However, given its pervasiveness Release 15. For more instruction, Minitab provides a
and wide acceptance in industry and the computer world number of Help features found under the Help selec-
at large, we believe it is important to give Excel proper tion on the toolbar (see Figure App. 1.1). The Tuto-
attention. It should be noted that for users who want more rials option, for example, introduces the user to basic
statistical capabilities but want to work in an Excel en- Minitab features and walks the user through some ex-
vironment, there are a number of commercially available ample Minitab sessions. In addition, at Minitabs Web
add-on packages. site, www.minitab.com, you can search through its
Even though basic guidance for using Minitab and knowledge base of customer support questions and their
Excel is provided in this and subsequent Appendices, answers.
Minitab - Untitled
Session Help
Help
Glossary
Methods and Formulas
Answers Knowledgebase
Keyboard Map
About Minitab
Worksheet 1 ***
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Project
FIGURE App. 1.1 Minitab open screen shot with Help option opened.
Moore-3620020 psbe August 16, 2010 23:30
Minitab Windows menu, you will notice that files can be opened or saved as
worksheets or as projects. Worksheet files (.MTW exten-
Upon entering Minitab, you will find the display parti- sion) simply store the data found in the Data window, while
tioned into two windows, as seen in Figure App. 1.1. The project files (.MPJ extension) store all the current work, in-
Session window is the area where all nongraphical sta- cluding the data, Session window output, and graphs. Thus,
tistical output and Minitab commands generating statisti- if you save a project prior to exiting Minitab and open the
cal output (graphical and nongraphical) are displayed. The project at a later time, you can resume from where you last
Data window displays a spreadsheet environment (known left off. Minitab files for selected examples and exercises
as a worksheet) where the data can be directly entered and provided on this books CD are worksheet files.
edited. Each column represents a variable to be analyzed.
Unlike Excel, cells in a Minitab worksheet are not active
in that formulas cannot be embedded within the cells. A Getting Started with Excel
Minitab worksheet is simply an environment for data to In this section, we provide a basic overview of the statis-
reside within. tical analysis options in Excel 2007. We assume that the
There is a third window, which is minimized upon en- reader is familiar with the basic layout and usage of Excel.
tering Minitab, known as the Project Manager window. As with all Microsoft products, Excel provides compre-
This window allows you to do a variety of housekeeping hensive support for the user in terms of the general use
tasks such as keeping track of all commands issued or of its software or the more specific details of a particular
seeing the basic attributes of the worksheet. procedure. As noted earlier, Excel provides a number of
standard statistical analysis procedures but is not as com-
Invoking Statistical Procedures
prehensive as a stand-alone statistical package. Therefore,
There are two ways to invoke procedures: for a few of the topics covered in this book, software sup-
port will be found only in a statistical package or in an
1. You can type session commands in the Session win-
enhanced add-on version of Excel rather than in standard
dow. To do so, the command language must be
Excel.
enabled, which will in turn produce an MTB>
It should be noted that the accuracy of statistical pro-
prompt in the Session window. At this prompt, you
cedures in earlier versions of Excel (2002 and earlier) has
can then type desired commands. For more details
been called into question. Some of the problems revolved
on enabling session commands, refer to Minitabs
around Excels use of shortcut formulas for certain sta-
Help options.
tistical computations. A number of these problems have
2. Users can make a sequence of selections from been addressed with the newest version of Excel, although
a series of menus that all begin in the toolbar a comprehensive independent study of the software has
menu. For example, in this chapter, we produced not been released at the time of the publication of this
a graph known as a boxplot. To create this graph, book. It is worth noting that reliability of established sta-
you would click Graph on the toolbar and then tistical packages should not be taken for granted. Albeit
select Boxplot. In this book, such a sequence of less serious than Excels earlier problems, inaccuracies
selections will be presented as Graph Boxplot. have been reported for even some well-known statistical
Once the sequence of selections has been made, di- packages.50
alog and/or option boxes will be encountered that
allow you to indicate which variable(s) will be part
of the analysis, along with other information. If fur- Built-in Statistical Functions and Charts
ther help is needed, you can click the Help button
Excel has a variety of built-in statistical functions that can
that appears with every pop-up box. Once all appro-
be used to compute many common descriptive statistics
priate information is provided, click the OK button
for a given set of data or to compute probabilities from
to get the desired output.
a number of well-known statistical distributions. To find
these functions, select the Formulas tab found in the main
Minitab Files
menu. You can then click AutoSum and select the More
Minitab provides standard file options for retrieving Functions option, which allows you to select the cate-
(Open) and saving (Save and Save As). Within the File gory Statistical to reveal all the statistical functions. As
Moore-3620020 psbe August 16, 2010 23:30
CHAPTER 1 Appendix 71
an alternative to clicking AutoSum, you can click More Button, click Excel Options, click Add-Ins, and then, in
Functions and then move the cursor to your Statistical the Manage box, choose Excel Add-ins and click Go. At
Functions menu choice. this point, select Analysis ToolPak in the Add-ins avail-
In addition to the built-in statistical functions, a num- able box and finally click OK.
ber of graphing options are available that may prove useful
for data analysis. The available charts are found by select-
Invoking Analysis ToolPak Procedures
ing the Insert tab found in the main menu. One then finds
a variety of graphing options in the Charts group. A few Once the Analysis ToolPak is installed, the statistical anal-
statistical options (for example, regression fitting) can be ysis routines are found by first selecting the Data tab found
implemented in conjunction with the charts. on the main toolbar. You will then see the Data Analysis
command in the Analysis group. Figure App. 1.2 shows a
blank Excel spreadsheet with the Data Analysis command
Installing Analysis ToolPak
invoked, resulting in the appearance of the Data Analysis
Excels built-in statistical functions can be useful for iso- menu box.
lated computations. However, attempting to do a more Within the Data Analysis menu box, there are 19
complete statistical analysis with a collection of raw menu choices. When you select one of the menu choices,
functions can be a laborious and clumsy process. Excel a box specific to the statistical routine will appear that calls
provides an add-on known as Analysis ToolPak that en- for you to indicate where the data reside and where you
ables you to perform a more integrative statistical analysis. want the output to be displayed. In particular, to indicate
This add-on is not loaded with the standard installation of where the data for analysis reside, you specify the range
Excel. To install this add-on, click the Microsoft Office of cells for the data in the Input Range box. This can be
Microsoft Excel
A1 fx
A B C D E F G H I J K L M N O P Q R S
1
2
3 Data Analysis
4 Analysis Tools
OK
5 Anova: Single Factor
Anova: Two-Factor Wirh Replication Cancel
6 Anova: Two-Factor Without Replication
7 Correlation
Covariance Help
8 Descriptive Statistics
Exponential Smoothing
9 F-Test Two-Sample for Variances
10 Fourier Analysis
Histogram
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Sheet1 Sheet2 Sheet3
FIGURE App. 1.2 Excel blank spreadsheet with Data Analysis menu box.
Moore-3620020 psbe August 16, 2010 23:30
accomplished by first clicking the cursor in the Input Excel make the counts. For pretabulated frequencies, the
Range box and then typing in the cell range, or more eas- spreadsheet should have two columns of information. With
ily you can highlight the data by clicking and dragging a column name in the top row, one column should have the
the mouse over the cell range. The statistical output can names of the distinct categories. The other column, with its
be placed either in the current worksheet (placement indi- column name in the top row, should have the total counts
cated with Output Range box), in a new worksheet tabbed of each category. If Excel needs to make the counts, there
with the current workbook (New Output Ply option), or should be a column, with a column name in the top row,
in an entirely new workbook (New Workbook option). that has the data on the names of the categories that need
to be counted. Once the one or two columns have been
Excel Data Files created, all the cells should be selected by dragging the
mouse. Then click the Insert tab and click PivotTable in
As noted, we assume that you are familiar with the basics
the Tables group and finally click PivotChart. You will
of Excel, including how to save and open files. It should be
then notice that Excel will produce a PivotTable Field List
noted that files saved by Excel 2007 as an Excel Workbook
box. You will find that the column name(s) that you high-
cannot be opened by earlier versions of Excel. There is,
lighted will be listed as fields. Select the field(s) presented
however, an option to save workbooks as an Excel 97-2003
to you by clicking a checkmark next to the name(s). For
Workbook. Excel 2007 is backward compatible in terms
pretabulated frequencies, a bar graph will be created auto-
of opening workbooks of older versions. Data files for se-
matically. When you have only one column that requires
lected examples and exercises provided on this books CD
counting, you will find that the field name appears in a sec-
are compatible with all versions of Excel.
tion titled Axis Fields (Categories). You want to also have
this field name in the section titled Values. To do so, click
Using Minitab and Excel for Examining
and hold the field name and then drag the field from the
Distributions
field section into the Values section. Excel will then au-
Now that we have provided a general overview of Minitab tomatically make the counts and create a corresponding bar
and Excel, we discuss more specifically how these software graph.
programs can be used to create the graphs and numerical
summaries presented in this chapter. Pie Charts
CHAPTER 1 Appendix 73
the created bar graph into a pie chart. To do so, click the Select Simple for the type of histogram, then click OK.
Design tab and then click the Change Chart Type in the Click-in the data column into the Graph Variables box
Type group and finally select the Pie chart type. Alter- and then click OK. If you wish to change the automati-
natively, you can right-click on the bar graph and find the cally selected classes, double-click on the horizontal axis
Change Chart Type option. To add labels to the pie slices, to make the Edit Scale box appear. Now, click the Binning
first right-click on one of the pie slices and then choose the tab and then choose the Midpoint/Cutpoint positions op-
Add Data Labels option. Once labels have been added, tion found in the Interval Definition section. Depending
right-click again on one of the pie slices and then choose on whether you choose the Interval type as Midpoint
the Format Data Labels option and finally place check- or Cutpoint, you then give the desired values of the mid-
marks next to the desired labels. points (that is, the middle values of the classes) or the
cutpoints (that is, lower and upper values of the classes).
Pareto Charts
Excel:
Minitab:
Select Histogram in the Data Analysis menu box and
Stat Quality Tools Pareto Chart click OK. Enter the cell range of the data into the Input
If the frequencies have been pretabulated, select the Chart Range box. If you want Excel to automatically select the
defects table option. If the frequencies have not been tabu- classes, leave the Bin Range box empty. Place a check-
lated, select the Chart defects data in option. For pretabu- mark next to the Chart Output option. Click OK. Excel
lated frequencies, click-in the data column into the Labels will then create a histogram with gaps between the data
in box and click-in the column that has the names of the bars. To remove these gaps, right-click on any one of the
categories into the Frequencies in box. If the frequencies bars and then select the Format Data Series option. You
have not been pretabulated, click-in the column that has will then have the opportunity to set the gap width to 0%.
data on the categorical names that need to be counted into With the bars now closed up to each other, it is a good
the topmost box next to the Chart defects data in option. idea to border the bars with line edges. Before closing the
An alternative way to create a Pareto chart is to follow Format Data Series box, click the Border Color option
the steps for creating a bar graph but then click the Chart and select the Solid line option and finally click Close.
Options button and select the Decreasing Y option and If you wish to change the automatically selected classes,
place a checkmark next to the Show Y as Percent option. enter upper values for each class into the spreadsheet and
input their cell range in the Bin Range box.
Excel:
As a first step, create a bar graph as already described. You Stemplots
will find in the spreadsheet a PivotTable report made up Minitab:
of two columns: (1) a column labeled Row Labels and
(2) a column with the frequencies. Highlight the contents Graph Stem-and-Leaf
of the report (that is, the cells with the category names
and the cells with the frequencies). Now click the Data Click-in the data column into the Graph Variables box
tab and then click Sort in the Sort & Filter group. At this and then click OK.
point, choose the Descending (Z to A) option and select
Excel:
the column associated with the frequency numbers in the
menu box found immediately below the option. We now Stemplots are available in neither standard Excel nor the
want to convert the counts into percents. To do so, click enhanced add-on version of Excel.
the field name found in the Values section, select the
Value Field Setting option, click the Show values as tab, Time Plots
finally select % of total from the Show values as menu Minitab:
and then click OK.
Graph Time Series Plot
Histograms
Select Simple for the type of time series plot, then click
Minitab: OK. Click-in the data column into the Series box. In de-
Graph Histogram fault mode, Minitab will label the time periods as 1, 2,
Moore-3620020 psbe August 16, 2010 23:30
3, and so on. If you wish to label the time periods by ple boxplots that you want to display together, as in Fig-
year, as in Figure 1.12, then click the Time/Scale button, ure 1.15, select Multiple Ys Simple for the type of box-
select the Calendar option, select the desired time periods plot, then click OK. In either case, click-in the data col-
(for example, Year) from the adjacent menu, and click umn(s) for which you want to construct boxplots into the
OK to close the pop-up. Click OK to produce the plot. Graph variables box. Click OK.
Excel: Excel:
Click and drag the mouse to highlight the cell range of the Boxplots are not available in standard Excel, but they are
data you wish to time plot (include the column name if available in the enhanced add-on version of Excel.
you wish it to appear as a chart label). With the cell range
highlighted, click the Insert tab and then click Line in Normal Distribution
the Charts group. Within the 2-D Line choices, you can
Minitab:
choose whether to have data symbols at the data values or
not. Graph Probability Distribution Plot
Numerical Summaries of Distribution This pull-down sequence will allow you to visualize areas
under the Normal curve. Select View Probability and
Minitab: then click OK. The standard Normal distribution is the de-
fault distribution. You can change the values for the mean
Stat Basic Statistics Display Descriptive Statistics and/or standard deviation. Now click the Shaded Area
Click-in the data column(s) for which you want to get tab. If you want to find the area under the curve associated
numerical summaries into the Variables box. To choose with a specified value, select the X Value option. You can
what numerical summaries you want reported, click the choose to find the area to the left or right of that specified
Statistics button, place checkmarks next to all desired value or even between two values by clicking the appropri-
measures, and then click OK to close the pop-up. Click ate picture. You then enter the specified value(s) in the X
OK to have the summaries reported in the Session value box. Click OK. As an exercise, you should be able to
window. reproduce Examples 1.35, 1.36, and 1.37 (pages 5152).
To do inverse Normal calculations, select the Probabil-
Excel: ity option rather than the X Value option. Depending on
Select Descriptive Statistics in the Data Analysis menu whether you are considering the area to the left or to the
box and click OK. Enter the cell range of the data into the right of a value, enter the desired area in the Probability
Input Range box. Place a checkmark next to the Chart box and click OK. If more accurate reporting of numbers
Output option. Click OK. You will find that the first and is desired, then you can consider the following pull-down
third quartiles are not reported. If you wish to compute sequence:
these quartiles, click an empty cell in the spreadsheet and
Calc Probability Distributions Normal
then proceed to the Statistical function menu as described
in the overview section of this Appendix. Scroll down the Choose the Cumulative probability option if you wish
list of functions and double-click on the QUARTILE func- to find the area to the left of a specified value. Choose the
tion choice. In the Array box, input the cell range of the Inverse cumulative probability option if you wish to find
data. In the Quart box, input the value 1 to get the first the value associated with a specified area to the left of that
quartile or the value 3 to get the third quartile and then value. You can then select the Input constant option. In
click OK. the box next to this option enter the specified value of x or
z or enter the specified area. Click OK to find the results
Boxplots reported in the Session window.
Minitab: Excel:
Graph Boxplot Excel does not provide a means to visualize areas un-
der the Normal curve, but it can compute areas under
If you have only one variable, select One Y Simple for the Normal curve or work backward. In either case, click
the type of boxplot, then click OK. If you have multi- an empty cell in the spreadsheet and then proceed to the
Moore-3620020 psbe August 16, 2010 23:30
CHAPTER 1 Appendix 75
Statistical function menu as described in the overview sec- This pull-down sequence will produce a Normal probabil-
tion of this Appendix. If you wish to find the area to the ity plot. As noted in this chapter, there is a bit of a technical
left of a specified value under the standard Normal curve, distinction between a Normal quantile plot and a Normal
then scroll down the list of functions and double-click on probability plot. However, the interpretation is the same
the NORMSDIST function choice. Type the value of z in in that the closer the data points plot to a straight line, the
the Z box and click OK. To do inverse standard Normal closer is the conformity to the Normal distribution. Upon
calculations, double-click on the NORMSINV function doing the noted pull-down sequence, click-in the data col-
choice. Type the specified area in the Probability box and umn of interest into the Variable box and then click OK.
click OK.
Excel:
Normal Quantile Plots Neither Normal quantile plots nor Normal probability
Minitab: plots are available in standard Excel, but Normal proba-
bility plots are available in the enhanced add-on version of
Stat Basic Statistics Normality Test Excel.
Moore-3620020 psbeFM August 17, 2010 1:22