AS Theory 2022-23

Download as pdf or txt
Download as pdf or txt
You are on page 1of 140

Applied Statistics

Lecturer: Dr. Aisling Daly


Department of Data Analysis and Mathematical Modelling
2022 - 2023

Authors: Prof. dr. O. Thas, dr. Aisling Daly


Contents

1 Basic concepts 7

1.1 Introduction by example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.2 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.3 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Some notation and formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2.2 Some basic formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Population and sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 Population distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.3 Some characteristics of a distribution . . . . . . . . . . . . . . . . . . 18

1.3.4 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.3.5 The t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.3.6 Sampling variability . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4 Estimator of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.4.1 Random variables: estimates versus estimators . . . . . . . . . . . . . 36

1.4.2 The sampling distribution of the sample mean . . . . . . . . . . . . . 37

1.5 QQ-Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3
4 CONTENTS

2 Confidence Intervals and Hypothesis Tests 43


2.1 Confidence interval of the mean . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.1 Example: birth weights . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.2 The confidence interval when the variance σ 2 is known . . . . . . . . 45
2.1.3 The confidence interval when the variance σ 2 is unknown . . . . . . . 48
2.2 The one-sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.1 Example: birth weight . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.2 The one-sided one-sample t-test . . . . . . . . . . . . . . . . . . . . . 52
2.2.3 Strong and weak conclusions . . . . . . . . . . . . . . . . . . . . . . . 55
2.2.4 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.5 Another one-sided one-sample t-test . . . . . . . . . . . . . . . . . . . 59
2.2.6 The two-sided one-sample t-test . . . . . . . . . . . . . . . . . . . . . 60
2.3 The paired t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4 The two-sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.5 Confidence interval of µ1 − µ2 . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3 Analysis of Variance 73
3.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.1 Introduction and motivation . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.2 The ANOVA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.3 The general one-way ANOVA model . . . . . . . . . . . . . . . . . . 82
3.1.4 Multiple comparisons of means . . . . . . . . . . . . . . . . . . . . . 88
3.2 Assessment of model assumptions . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3.3 The additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
CONTENTS 5

3.3.4 Assessment of the assumptions . . . . . . . . . . . . . . . . . . . . . 104


3.3.5 The interaction model . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4 Regression Analysis 115


4.1 Simple linear regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.1.1 Example and introduction . . . . . . . . . . . . . . . . . . . . . . . . 115
4.1.2 The regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.1.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.1.4 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.1.5 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.1.6 ANOVA table, F -test and R2 . . . . . . . . . . . . . . . . . . . . . . 123
4.1.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.1.8 Assessment of the assumptions . . . . . . . . . . . . . . . . . . . . . 129
4.2 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.2 The additive model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.2.4 The interaction model . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Chapter 1

Basic concepts

In this chapter, we will learn:

• the purpose and use of summary statistics

• the concepts of a population and a sample

• the concept of a distribution

• how to recognize and summarize some of the most important types of distribution

• the powerful Central Limit Theorem, and why it’s so important in statistics

• how to use a QQ-plot to check a sample’s normality

1.1 Introduction by example

In the next few sections we describe a real study. The sections are written as if they were
meant for a paper in a scientific journal. Since this is an Applied Statistics course, the
goal is to learn through examples like the one we’re about to begin. We’ll return to this
example several times throughout the course, since it allows us to apply many of the statistical

7
8 CHAPTER 1. BASIC CONCEPTS

theories and methods that we’ll learn. These fall into two types: descriptive statistics and
inferential statistics. Descriptive statistics do just what it sounds like they do: allow us to
describe data so that we can summarize and communicate their characteristics. Inferential
statistics go further than just describing, but instead allow us to analyse data in order to learn
some deeper truth about the world. With that in mind, we’ll move onto our first example.

1.1.1 Background

Food waste is a serious and growing problem. In the EU alone, around 88 million tonnes of
food waste are generated each year. The annual costs associated with this waste are estimated
at 143 billion euros. There are many ways to reduce food waste; one is to ensure that food
doesn’t rot before it reaches the consumer. This includes the time spent transporting it from
the production and processing locations, and the time it will spend at the supermarket before
it’s purchased. In this example we’ll focus on the issue of reducing fruit spoilage.
Fruits are living tissues even after they’re harvested, and are expected to remain so until they
are consumed. We can improve storability and extend shelf life by controlling the respiration
rate of these living tissues. Application of edible coatings have long been used to retain
quality and extend shelf life of some fresh fruits and vegetables like citric fruits, apples and
cucumbers. Fruits are usually coated by being dipped in or sprayed with a range of edible
materials so that a semi-permeable layer is formed on the surface. This suppresses respiration
and controls moisture loss. Edible coatings with an appropriate formulation can be used for
most food to address the challenges associated with stable quality, market safety, nutritional
value and economic production cost.
Like other fruits, pears are most often consumed when they are fresh, firm and crunchy. Pears
stored at lower temperature can better maintain the quality attributes desired by consumers.
Temperature elevation and fluctuation in the supermarket have tremendous effects on the
freshness of pears. High metabolic activity and respiration cause texture loss and other
undesirable quality effects during storage and higher temperature. Edible coatings on some
fruits can induce some inhibitory factors for respiration and metabolic activity during storage
along with a minimum change in texture of the final product. A surface layer of edible coating
can also protect pears from environmental and mechanical stress.
To evaluate the suitability of the coating method for preserving pears, a scientific study was
set up. The goals of the study were to evaluate the effect of coating on the texture change of
pears, and to find out the optimum concentration of edible coating for extending shelf life.

1.1.2 Materials and methods

Materials
Conference pears (Pyrus communis) were used for this study, because this species respires
1.1. INTRODUCTION BY EXAMPLE 9

Figure 1.1: Pears treated with 0% (upper left), 0.5% (upper right), 0.8% (lower left) and 1%
(lower right) Bio-fresh coating.

continuously during storage. This allows testing throughout this stage. Pears were harvested
at an early stage of maturity, and 20 pears were randomly selected for each level of coating.
Preparation of coating solution
Bio-fresh (sucrose fatty acid ester) coating produced by Laiwu Hehui Chemical co. Ltd,
China was used. Solutions (0.5%, 0.8% and 1% concentrations) were prepared by dissolving
in water and 0% solution was used as a control.
Fruit coating
Pears were washed, rinsed and dried, and then coatings were applied by spraying. Hot air at
temperatures of 60 - 65 degrees Celsius was blown over the fruit to dry the coating.
Firmness measurement
Firmness was calculated as the maximum force needed by the plunger to press the pears over
a certain distance with a specific speed rate. Pear firmness evaluation was performed using
a universal testing machine: measuring device R9 (Lloyd material testing system), with load
cell value 500 N, self cutting plunger surface area 1 cm2 , penetration depth 8 mm and speed
8 mm/s.
Statistical analysis
... we’ll work our way up to this!

1.1.3 Descriptive statistics

A statistical analysis usually starts with an exploratory data analysis. This includes the
calculation of descriptive statistics or summary statistics. For example, in our pear
10 CHAPTER 1. BASIC CONCEPTS

Figure 1.2: A device to measure the firmess of fruit.

Table 1.1: Desciptive statistics of the coating experiment.

Concentration mean (N) standard deviation (N) C.I. (95%)


0.0% 5.897 1.250 (5.312, 6.482)
0.5% 4.842 1.305 (4.232, 5.453)
0.8% 4.724 0.946 (4.281, 5.167)
1.0% 4.632 1.527 (3.917, 5.346)

dataset we have 20 observed firmness measurements for each coating level. Instead of looking
at the 20 observations individually, it’s probably more convenient to only look at the mean
and the standard deviation (or variance) of these observations. The summary statistics are
intended to summarise the shape of the distribution of the observations. We’ll learn shortly
what we mean by these terms.
Data exploration may serve several purposes:

• checking the data for abnormalities (either due to the sampling, measurements, data-
entry, ...)

• getting a first impression of what the conclusion of the study might be

• assessing the assumptions needed for the statistical analysis (we’ll learn more about
this later).

Table 1.1 shows the sample means and sample standard deviations for the pear dataset. Later
we’ll discuss this dataset in more detail.
1.2. SOME NOTATION AND FORMULAE 11

1.2 Some notation and formulae

Before we dive into the interpretation of the summary statistics, we’ll introduce some notation
so that we can define the summary statistics using mathematical formulae.

1.2.1 Notation

• we use the notation y1 , y2 , . . . , yn to denote the n sample observations.


• an equivalent way to denote the sample observations is yi , i = 1, . . . , n.
• n is referred to as the sample size.
• when we have samples collected under different conditions (here: k = 4 coating levels),
we may use
y11 , y12 , . . . , y1n1 and y21 , y22 , . . . , y2n2 · · · and yk1 , yk2 , . . . , yknk
or
yji , j = 1, . . . , k; i = 1, . . . , nj .

1.2.2 Some basic formulae

The sample mean is given by


n
y1 + y2 + . . . + yn 1X
ȳ = = yi ,
n n i=1
or, when we have samples collected under different conditions,
nj
1 X
ȳj = yji (j = 1, . . . , k).
nj i=1

The sample variance is computed as


n
2 1 X
s = (yi − ȳ)2 .
n − 1 i=1
This is the average squared deviation between the sample observations and the sample mean.
The sample standard deviation (SD) is then given by

s = s2 .
Later we’ll come back to these summary statistics to give them a proper interpretation.
12 CHAPTER 1. BASIC CONCEPTS

1.3 Population and sampling

1.3.1 Sampling

Let’s go back to the very start of our fruit example. The design of the experiment included
the following sentence:

20 pears were randomly selected for each level of coating

Some questions about this choice come to mind:

• Why do we need more than one pear?

• What are the meaning and consequences of the term “randomly”?

First, why do we need more than one pear? Consider the following thoughts:

• An important aspect of a scientific study is that the researchers want their conclusions
to be generally valid, i.e. the conclusions must hold for all pears that will be treated
with the Bio-fresh coating.

• Not all pears are are the same (biological heterogeneity).

So if the experiment was done on only one pear then it’s very unlikely that the same firmness
would be obtained on any future pear treated with the coating. Moreover, we cannot even
predict whether a pear treated in the future will result in a smaller or a larger firmness. And
will the firmness of a future treated pear be much smaller or larger, or only slightly smaller
or larger? We cannot answer these questions when we only have data on one pear.
One of the objectives of the study is to assess whether the level of the coating affects the
firmness. Suppose that we treated only one pear with a coating of 0.5%, and also only one
pear with a coating of 1%. This study will result in two measured firmness values. If the
observed firmness of the 0.5% coating is smaller than the firmness of the 1% coating, will
this conclusion also hold for all future coated pears? Or is there actually no difference in
firmness between pears treated with 0.5% and 1% coating, and is the observed difference
only a consequence of the natural variability between pears? Again we cannot answer the
question when only one pear is used.
Second, what are the meaning and consequences of the term “randomly”, and what do they
imply for the results of our experiment? We mentioned already that not all pears are the same
and so we may expect variability among pears. To grasp the magnitude of the variability it’s
important to replicate the experiment with several pears.
1.3. POPULATION AND SAMPLING 13

Consider the following design: a helpful research assistant is sent to the supermarket to buy
20 pears for the experiment. The research assistant only selects pears that “look” firm and
fresh. Clearly, this procedure may give rise to measured firmnesses that are generally larger
then if old pears had also been selected. If we want the conclusions from our study to be
valid for all pears, both young and old, then we must be sure that such pears are present in
our sample.
Since we can’t have all possible types of pears in a sample of only 20 fruits, we must have a
procedure that assures us that the sample is representative for the population of pears to
which the conclusion should apply. Selecting the 20 pears randomly is a very good way of
guaranteeing that the sample is representative, or at least it is representative on average.
We’ll come back to this expression later.
We mentioned that by selecting only fresh looking pears we might introduce a bias in our
conclusions. This bias does not only apply to the magnitude of the firmness (i.e. we expect
fresh pears will be more firm), but the bias may also apply to the variability that is observed
between the sample observations. If our trusty research assistant selects only fresh looking
pears, then we may expect less variability among the measured firmness values because the
selected pears were all pretty similar to one another (after all, that’s how they were chosen).
This problem is illustrated in Figure 1.3.
Random sampling is closely related the concept of the population or the scope of the study.
Defining the scope of our study comes down to asking ourselves, based on a sample of pears,
whether we want to come to conclusions that hold for:

• all kind of pears?

• only young and fresh pears?

• only old pears?

Researchers should be clear about the scope of their study before the start. When the scope
is set at “all kind of pears”, then the pears tested in the study should be selected from the
population of all pears. For any statistical analysis to be valid, it’s required that the pears
are selected completely at random.
Selecting completely at random from a population implies that:

• all pears in the population should have the same probability of being selected in the
sample

• the selection of a pear in the sample should be independent from the selection of the
other pears in the sample.
14 CHAPTER 1. BASIC CONCEPTS

Fresh pears
8
6
Frequency

4
2
0

96 97 98 99 100 101 102

Firmness (N)

Old pears
8
Frequency

6
4
2
0

96 97 98 99 100 101 102

Firmness (N)

Random sample
0 1 2 3 4 5 6
Frequency

96 97 98 99 100 101 102

Firmness (N)

Figure 1.3: Histogram of 20 firmness measurements of fresh pears (top), old pears (middle)
and a random sample of pears.
1.3. POPULATION AND SAMPLING 15

The sample is thus supposed to be representative for the population, but still it is random.
How can these two concepts be merged?

1.3.2 Population distribution

To understand what we mean by the population underlying a sample, in this section we’ll
consider another example. The objective of this study is to describe the distribution of all
birth weights in Belgium between 1998 and 2008. We’ll use this example to introduce three
important concepts:

• the density function of a population,

• the histogram as an estimate of a population density function,

• the sampling variability.

There is an instructive reason for choosing this particular example: the population of Belgium
is completely known. Every child born in Belgium is registered and its birth weight reported.
First we’ll need some more definitions. The population of Belgian birth weights between
1998 and 2008 contains measurements of 500, 000 birth weights.

• The variable here is the birth weight. So we have 500, 000 measurements of our
variable. A variable is often denoted by a capital letter, e.g. Y .

• We will further assume that the birth weights are measured with infinite precision.
Such a variable is referred to as a continuous variable.

The population can firstly be represented by means of its distribution. We can do this by
constructing a histogram. Figure 1.4 shows a histogram in which 20 bins (or cells) have
been used. Since the total number of observations is very large, every bin corresponds to a
very large number of birth weights. So to get a more detailed description of the distribution,
we can increase the number of bins. Figure 1.5 shows histograms with 50 and 100 bins.
Thanks to the very large number of observations, the histogram with 100 bins looks very
smooth. We could increase the number of bins even further. But suppose that we did
not have 500, 000 observations, but a much much much larger number (in theory infinity).
Then we could look at a histogram with an infinite number of bins, with each bin having
an infinitely small width. This is of course not feasible in reality. However, theoretically a
histogram based on an infinite number of observations and an infinite number of bins becomes
the density function of the population. This is illustrated in Figure 1.6.
16 CHAPTER 1. BASIC CONCEPTS

20 cells

0.0014
0.0012
0.0010
0.0008
Density

0.0006
0.0004
0.0002
0.0000

2000 2500 3000 3500 4000 4500

birthweight

Figure 1.4: Histogram of the 500, 000 birth weights (20 bins).

50 cells 100 cells


0.0014

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014


0.0012
0.0010
0.0008
Density

Density
0.0006
0.0004
0.0002
0.0000

2000 2500 3000 3500 4000 4500 2000 2500 3000 3500 4000 4500

birthweight birthweight

Figure 1.5: Histograms of the 500, 000 birth weights (left: 50 bins; right: 100 bins).
1.3. POPULATION AND SAMPLING 17

100 cells infinte number of cells

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014


Density

density
2000 2500 3000 3500 4000 4500 2000 2500 3000 3500 4000 4500

birthweight birthweight

Figure 1.6: Histogram of the 500, 000 birth weights (100 bins) and the density function of
the distribution.

For populations with an infinite number of observations of a continuous variable, the density
function can be described by a continuous function. But even when the population contains
a finite, but very large number of elements, its distribution can be very well described by a
density function. When y represents birth weight, then it’s common to denote its density
function as f (y).
We’ll now give a slightly more formal definition. Consider the observations y1 , y2 , . . . , yn .
For a dataset with range δ (where the range is the difference between the maximum and
minimum of the observations), a histogram based on c bins (cells) is a plot of
number of obs in [y − 2δ , y + 2δ ]
fc,n (y) =
n
for y = ymin , ymin + δ, ymin + 2δ, . . . , ymax .
When c becomes very large, and the sample size n is also very large, then f (y) approximates
the density function:
f (y) = f∞,∞ (y) ≈ density function.

For example, if for any number of bins c and sample size n,


number of obs in [2800 − 2δ , 2800 + 2δ ]
fc,n (2800) =
n
1
= fc,n (3000)
2
1 number of obs in [3000 − 2δ , 3000 + 2δ ]
=
2 n
18 CHAPTER 1. BASIC CONCEPTS

then we can conclude that twice as many babies have a birth weight close to 3000g as
compared to babies with a birth weight of 2800g.
Furthermore, it also holds for c, n → ∞, i.e
1
f (2800) = f (3000)
2
This tells us that within the population there are twice as many babies with a birth weight
of 3000g compared to babies with a birth weight of 2800g.

1.3.3 Some characteristics of a distribution

An advantage of having a density function describing a population is that it contains all


information on the population. So if we know the density function of the birth weight
population, we no longer need to keep using the full dataset: the density function contains
all the information contained in the list of 500, 000 birth weights.
Some examples of density functions are shown in Figure 1.7. Two very important density
functions are those of:

• the normal distribution:


(y − µ)2
 
1
f (y) = √ exp − ,
2πσ 2 σ2
where µ and σ 2 are the parameters of the normal distribution. They represent the
mean and variance of the distribution.
• the exponential distribution:

f (y) = λ exp(−λy),

where λ is the rate parameter.

We can see that these density functions are functions of y (an observation of the variable)
and one or more parameters that somehow characterize the distribution.
Although a density function is a very efficient object in the sense that it replaces all obser-
vations in the population, it may not always be easy to interpret. Some people prefer to
summarise the population (or the density function) using some of the distribution character-
istics such as the mean, the median, quartiles, inter-quartile range, the variance, the standard
deviation, the mode, or others.
In the following sections, we’ll introduce some of the more important characteristics. The
first of these is a characteristic that everyone will already be familiar with: the mean.
1.3. POPULATION AND SAMPLING 19

normal distribution (mean=0,sd=1) exponential distribution

0.4

1.0
0.8
0.3

0.6
0.2
f(y)

f(y)

0.4
0.1

0.2
0.0

0.0
−4 −2 0 2 4 0 1 2 3 4

y y

Figure 1.7: Examples of density functions: the normal (µ = 0 and σ 2 = 1) and exponential
(λ = 1) distributions.

The mean

The (population) mean of a variable or of a distribution is defined as


Z ∞
µ = E [Y ] = yf (y)dy.
−∞

The mean of Y is also referred to as the expected value of Y .


When the population is finite (e.g. our birth weight dataset with 500, 000 samples), then the
population mean is simply the mean of all population elements. Let N denote the population
size.
The population mean can then be calculated as
N
y1 + y2 + · · · + yN 1 X
µ= = yi .
N N i=1

Based on a histogram with bin centres equal to yc1 , . . . , ycc , the mean can be approximated
by
Xc
µ ≈ µc = yci fc,N (yci ).
i=1

This can be thought of as the “centre of mass” of the histogram. When c and N become
very large,
X c Z ∞
µc = yci fc,N (yci ) → yf (y)dx = E [Y ] = µ.
i=1 0
20 CHAPTER 1. BASIC CONCEPTS

infinte number of cells

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014


density

2000 2500 3000 3500 4000 4500

birthweight

Figure 1.8: Density function of the birth weight population, with the mean indicated by the
vertical line.

Since our birth weight dataset is finite, it’s straightforward to calculate the population mean:
it equals 3341. This is shown in Figure 1.8.
The population mean has another interpretation:

1. randomly sample one observation from the population

2. repeat step (1) many many times, say n → ∞

3. the average of the sampled observations equals E [Y ] when n → ∞.

This interpretation is related to the principle of repeated sampling, which we’ll return to
later in this chapter.
In any real setting the population mean µ is unknown. Imagine for example the population
that underlies our pear example: this is the population of all pears in Belgium. Clearly,
we have no idea what the true mean of this population is, and no possible way to learn it.
Instead, we can use an estimate of the true mean.
When a random sample of n observations is available, the population mean can be estimated
by the sample mean,
n
1X
µ̂ = ȳ = yi .
n i=1

The hat-notation (µ̂) is used for denoting estimators. Thus µ̂ is the estimator of µ, the
unknown population mean.
1.3. POPULATION AND SAMPLING 21

The variance and the standard deviation

The (population) variance of a variable or of a distribution is defined as

σ 2 = Var [Y ] = E (Y − E [Y ])2 .
 

So the variance is the expected (or mean) squared deviation from the mean.
The standard deviation is defined as the square root of the variance, i.e.
p
σ = SD [Y ] = Var [Y ].

Compared to the other summary statistics, the standard deviation has the advantage of being
expressed in the same units as the variable under study.
Its name already suggests that the variance is some kind of measure of the variability of
a random variable. To illustrate this interpretation, let’s consider a very extreme limiting
 2
case. Suppose that Var [Y ] = 0. Then E (Y − E [Y ]) = 0. So if, in a repeated sampling
experiment, the sampled y observations were measured an infinite number of times, and each
time the deviation from the mean (y − E [Y ])2 is computed, then in order to end up with an
average of the (y − E [Y ])2 equal to zero, every repetition must result in (y − E [Y ])2 equal
to zero. That is, every y must equal its expectation E [Y ].
The reason why is that when a sampled observation y 6= E [Y ], its contribution (y − E [Y ])2 is
positive (since it’s squared). To make the average of all (y − E [Y ])2 equal to zero, a positive
contribution must be compensated with a negative (y − E [Y ])2 . But since a squared quantity
cannot be negative, all y’s must equal E [Y ]. Hence, when Var [Y ] = 0, we may be sure that
every sampled y is exactly equal to E [Y ]. We say that there is no variability. Sometimes
we’ll also say that there is no uncertainty about Y .
Figure 1.9 shows two normal distributions with different variances.
In a real setting. the population variance σ 2 is unknown, just like the population mean. When
a random sample of n observations is available, the population variance can be estimated by
the sample variance,
n
1 X
σ̂ 2 = s2 = (yi − ȳ)2 .
n − 1 i=1

Note again the hat notation for denoting the estimator of σ 2 .

Percentiles and quantiles

Now we know how to describe the general shape of a distribution. But can we give some more
specific descriptions? Going back to our birth weight dataset, can we for example determine
what is the probability that a baby is born with a birth weight of less than 3000g?
22 CHAPTER 1. BASIC CONCEPTS

normal distribution (mean=0,sd=1) normal distribution (mean=0,sd=0.5)

0.4

0.8
0.3

0.6
0.4
0.2
f(y)

f(y)

0.2
0.1

0.0
0.0

−4 −2 0 2 4 −4 −2 0 2 4

y y

Figure 1.9: Density functions of normal distributions with different standard deviations:
σ = 1 (left) and σ = 0.5 (right).

When all elements of a finite population (of size N ) are known, we can just compute this
probability as a frequency, e.g. how many times it was observed in the population:

number of population elements < 3000


P [Y < 3000] =
N
or in mathematical notation:
c
X
P [Y < 3000] ≈ fc,N (yci )I {yci < 3000}
i=1

where 
1 if yci < 3000
I {yci < 3000} =
0 if yci ≮ 3000

So, for finite populations, when c and N become very large we replace the histogram repre-
sentation by the density function and we have that:
Z 3000
P [Y < 3000] = f (y)dy.
0

Figure 1.10 shows the section of the birth weight distribution that corresponds to P [Y < 3000].
This is known as the quantile y = 3000, and it corresponds to the 11.1% percentile for the
distribution, i.e. P [Y < 3000] = 11.1%.
We can also give a repeated sampling interpretation to probabilities. For example, for
P [Y < 3000]:
1.3. POPULATION AND SAMPLING 23

infinte number of cells

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014


density
2000 2500 3000 3500 4000 4500

birthweight

Figure 1.10: Density function indicating the quantile y = 3000 corresponding to the 11.1%
percentile.

1. randomly sample one observation from the population


2. repeat step (1) many many times, say n → ∞
3. the relative frequency of events y < 3000 equals P [Y < 3000] when n → ∞.

We can also determine two quantiles at once. Figure 1.11 illustrates the expressions
P [3000 < Y < 3500] = 60.3% and P [Y > 4000] = 0.9%.
To make this easier, we’ll use the following notation very often in the remainder of this course.
The 1 − α quantile or percentile of the distribution of Z is defined as zα so that
P [Z ≤ zα ] = 1 − α.

When the distribution of Z is symmetric, then


zα = −z1−α .

In a real setting, the population quantiles and probabilities are once again unknown. When a
random sample of n observations is available, the quantiles and probabilities can be estimated.
We’ll illustrate this with two examples.

1. Estimating probabilites
The probability P [3000 < Y < 3500] is estimated by
1
(number of sample observations yi for which 3000 < yi < 3500) .
n
24 CHAPTER 1. BASIC CONCEPTS

infinte number of cells infinte number of cells

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014


density

density
2000 2500 3000 3500 4000 4500 2000 2500 3000 3500 4000 4500

birthweight birthweight

Figure 1.11: Density function illustrations for P [3000 < Y < 3500] = 60.3% and
P [Y > 4000] = 0.9%.

Probabilities that are estimated in this way are referred to as empirical probabilities.
They do not rely on any assumptions about the underlying distribution.
Consider a random sample of 20 birth weights:

3468.730 3795.857 3124.882 3043.660 2972.858 3491.030


3471.799 3038.800 3383.798 3180.232 3302.670 3634.995
3442.995 3271.510 3078.688 3393.587 3509.966 3408.600
3124.979 3250.156

Of the 20 sample observations, there are 16 of them between 3000g and 3500g. Hence,
the probability P [3000 < Y < 3500] is estimated as 16/20 = 0.8.
2. Estimating quantiles/percentiles
The 10% percentile (or quantile) is estimated by the largest sample observation yj for
which
1
(number of sample observations yi for which yi ≤ yj ) ≤ 0.10.
n
Suppose we have another random sample of 20 birth weights, which are ordered:

2972.858 3038.800 3043.660 3078.688 3124.882 3124.979


3180.232 3250.156 3271.510 3302.670 3383.798 3393.587
3408.600 3442.995 3468.730 3471.799 3491.030 3509.966
3634.995 3795.857
1.3. POPULATION AND SAMPLING 25

infinte number of cells

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014


density

2000 2500 3000 3500 4000 4500

birthweight

Figure 1.12: Density function with the quartiles illustrated.

The estimated 10% percentile is therefore 3038.8g.

Note in the last expression that it is only required that the estimated probability is not
larger than 10%. The reason is that, for a finite sample, it is not always possible to have
an empirical probability that equals exactly 10%. For example, when n = 5, the empirical
probabilities can at most take the values 0%, 20%, 40%, 60%, 80% and 100%.

Quartiles and median

Quartiles refer to the 25%, 50% and 75% percentiles. This is illustrated in Figure 1.12.
In our birth weight dataset, the quartiles are:

• Q1 = 3152: P [Y ≤ 3152] = 25%,

• Q2 = 3342 =: P [Y ≤ 3342] = 50%,

• Q3 = 3530: P [Y > 3530] = 25%

The interquartile range (IQR) is defined as Q3 − Q1 .


Since the quartiles are unknown in a real population, they may be estimated as follows, given
a sample of n observations:
26 CHAPTER 1. BASIC CONCEPTS

4000
3500
birth weight

3000
2500

Figure 1.13: Boxplot of a random sample of 100 birth weights.

• Q̂1 is the largest sample observation yj for which


1
(number of sample observations yi for which yi ≤ yj ) ≤ 0.25.
n

• Q̂2 is the largest sample observation yj for which


1
(number of sample observations yi for which yi ≤ yj ) ≤ 0.5.
n
and is also known as the sample median.
• Q̂3 is the largest sample observation yj for which
1
(number of sample observations yi for which yi ≤ yj ) ≤ 0.75.
n

The boxplot

The boxplot is a figure that gives more or less the same information as a histogram. It is
illustrated in Figure 1.13 for a random sample of 100 birth weights. The boxplot contains
the following information:

• Q̂1 : the lower edge of the box


• Q̂3 : the upper edge of the box
• Q̂2 (median): the middle line within the box
• the smallest sample observation (unless there are outliers; see further): the lower
whisker
1.3. POPULATION AND SAMPLING 27

• the largest sample observation (unless there are outliers; see further): the upper whisker
• individual outliers are indicated by circles. An observation is an outlier when it is
outside the interval [Q̂1 − 1.5 × IQR, Q̂3 + 1.5 × IQR], with IQR = Q̂3 − Q̂1 .

Now that we have enough tools to describe distributions in some detail, we’ll look at one of
the most important and often used distributions in the sciences: the normal distribution.

1.3.4 The normal distribution

When Y is normally distributed with mean µ and variance σ 2 (standard deviation σ), it is
denoted by
Y ∼ N (µ, σ 2 ). (1.1)
This notation indicates that µ and σ are the parameters of the distribution.
When µ = 0 and σ 2 = 1, the corresponding normal distribution is called the standard
normal distribution. This distribution has many special properties, and so we will often
use the symbol Z to indicate a standard normal distributed random variable. This is a
distribution that is symmetric about µ = 0. Therefore,
P [Z ≤ −z] = P [Z > z] , (1.2)
and so zα = −z1−α .
Suppose that Y is normally distributed with parameters µ and σ 2 . Then the following
important properties hold:
Y ∼ N (µ, σ 2 ) (1.3)
Y − µ ∼ N (0, σ 2 ) (1.4)
Y −µ
∼ N (0, 1), (1.5)
σ

Y −µ
In other words, the standardized variable σ
is a standard normal variable. More gen-
erally, for constants a and b it holds that
Y ∼ N (µ, σ 2 ) (1.6)
Y − a ∼ N (µ − a, σ 2 ) (1.7)
µ − a σ2
 
Y −a
∼ N , 2 . (1.8)
b b b

Going back to our birth weight example, we have already calculated the mean µ = 3342 and
variance σ = 280. So, the variable Z = X−3342
280
is standard normally distributed. This is
illustrated in Figure 1.14.
28 CHAPTER 1. BASIC CONCEPTS

100 cells infinte number of cells

0.4

0.4
0.3

0.3
Density

density
0.2

0.2
0.1

0.1
0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4

z birthweight

Figure 1.14: The standardised birth weight distribution: histogram (left) and density function
(right).

1.3.5 The t-distribution

The t-distribution is also a very important distribution in statistics. It looks similar to the
standard normal distribution, but its tails are heavier (i.e. there is relatively more probability
mass in the tails). The t-distribution has one parameter: the degrees of freedom. In general
a t-distribution with d degrees of freedom is denoted by td . The parameter d does not affect
the mean of the t-distribution, which is always zero. The parameter d only affects the variance
and the thickness of the tails of the distribution. When d is very large (theoretically d → ∞),
td approximates a standard normal distribution.
Y ∼ N (µ, σ 2 ) with Y1 , . . . , Yn a random sample from this population, and let S 2 =
Let P
1 n 2
n−1 i=1 (Yi − Ȳ ) .

Then,
Yi − µ Y −µ
∼ ∼ tn−1 ,
S S
where tn−1 is a t-distribution with n − 1 degrees of freedom (parameter of the t-distribution).
So when the sample size is very large, the tn−1 distribution is indistinguishable from a stan-
dard normal distribution.

1.3.6 Sampling variability

We’ll illustrate the effect of sampling variability using two concepts: the sample distribution
and the sample mean.
Suppose that we take a random sample of 20 observations from the population of birth
1.3. POPULATION AND SAMPLING 29

20 random observations

0.0010
Density

0.0000
2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.15: Histogram and boxplot of a random sample of 20 birth weights. The sample
mean is ȳ = 3272.

weights in Belgium between 1998 and 2008. A histogram and boxplot are shown in Figure
1.15. For this particular random sample, we can calculate that sample mean is ȳ = 3272.
Now suppose that the same study was also carried out by a colleague, Dr. A, who performs
the study in exactly the same way. As before, the design of the study specifies that the 20
birth weights have to be collected by sampling completely at random. This implies that this
colleague will most likely end up with a sample of 20 birth weights which are different from
those in our sample. Figure 1.16 shows the histogram and boxplot of the data that Dr. A
collected. Notably, he found a sample mean of ȳ = 3395, which is different than ours.
Suppose once again that the same study is replicated by another colleague, Dr. B, who again
performs the study in exactly the same way. The design of the study still specifies that the 20
birth weights have to be collected by sampling completely at random. This implies that this
colleague will again most likely end up with a sample of 20 birth weights which are different
from those in the previous samples. Figure 1.17 shows the histogram and boxplot of the data
that Dr. B collected. She found a sample mean of ȳ = 3479, again different from the two
previous sampling experiments.
Now, we send one more colleague, Dr. C, out to replicate the same study. Once more, the 20
birth weights have to be collected by sampling completely at random, and so our colleague
will again most likely end up with a sample of 20 birth weights which are different from those
in the previous samples. Figure 1.18 shows the histogram and boxplot of the data that Dr. C
collected, which has a sample mean of ȳ = 3292: a fourth different sample mean.
Having run out of colleagues who are willing to repeat this experiment, we now decide to
use a sample size of 100, and find some Masters students to run the new experiment. So the
students sample the birth weights of 100 babies completely at random. Figures 1.19, 1.20,
30 CHAPTER 1. BASIC CONCEPTS

20 random observations

0.0010
Density

0.0000
2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.16: Histogram and boxplot of Dr. A’s random sample of 20 birth weights. The
sample mean is ȳ = 3395.

20 random observations
0.0010
Density

0.0000

2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.17: Histogram and boxplot of Dr. B’s random sample of 20 birth weights. The
sample mean is ȳ = 3479.
1.3. POPULATION AND SAMPLING 31

20 random observations

0.0010
Density

0.0000
2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.18: Histogram and boxplot of Dr. C’s random sample of 20 birth weights. The
sample mean is ȳ = 3292.

1.21 and 1.22 show histograms and boxplots of these random samples. The corresponding
sample means are 3359, 3325, 3347 and 3342.
Now we decide to repeat the same type of experiment, but with a sample size of 1000.
The Masters students are fed up, so we have to find some Bachelor students. We send the
Bachelor students out to sample the birth weights of 1000 babies completely at random.
Figures 1.23, 1.24, 1.25 and 1.26 show histograms and boxplots of these random samples.
The corresponding sample means are 3341, 3344, 3343 and 3341.
What have we learned from these experiments?

• Sampling variability: every sample resulted in a different empirical distribution (his-


togram/boxplot) and therefore also in a different sample mean.

• Sample size effect: the larger the sample size, the smaller the sampling variability.

These properties are clearly illustrated in Figure 1.27, where we show the sample mean for
1000 repeated samples, i.e. as if we had bribed 999 colleagues to join us and each repeat the
experiment.
These simulations suggest that estimators have sampling distributions. In particular,
the histograms of the estimates after repeated sampling suggest that the behaviour of the
estimates can be described by distributions, just like the behaviour of the random variable
Y (e.g. birth weight) can be described by a distribution. Later we’ll provide more precise
details of this important concept of sampling distributions.
32 CHAPTER 1. BASIC CONCEPTS

100 random observations

0.0010
Density

0.0000
2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.19: Histogram and boxplot of a random sample of 100 birth weights. The sample
mean is ȳ = 3359.

100 random observations


0.0012
Density

0.0006
0.0000

2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.20: Histogram and boxplot of a second random sample of 100 birth weights. The
sample mean is ȳ = 3325.
1.3. POPULATION AND SAMPLING 33

100 random observations

0.0012
Density

0.0006
0.0000
2000 2500 3000 3500 4000 4500

birth weight

● ● ● ●

2000 2500 3000 3500 4000 4500

Figure 1.21: Histogram and boxplot of a third random sample of 100 birth weights. The
sample mean is ȳ = 3347.

100 random observations


0.0010
Density

0.0000

2000 2500 3000 3500 4000 4500

birth weight

2000 2500 3000 3500 4000 4500

Figure 1.22: Histogram and boxplot of a fourth random sample of 100 birth weights. The
sample mean is ȳ = 3342.
34 CHAPTER 1. BASIC CONCEPTS

10000 random observations

0.0000 0.0006 0.0012


Density
2000 2500 3000 3500 4000 4500

birth weight

●●

● ●●

●●
●●●●

●●

●●● ●

●●
●●
●●

●●●
●●●
●●

2000 2500 3000 3500 4000 4500

Figure 1.23: Histogram and boxplot of a random sample of 1000 birth weights. The sample
mean is ȳ = 3341.

10000 random observations


0.0000 0.0006 0.0012
Density

2000 2500 3000 3500 4000 4500

birth weight

● ●
●●●●

●●
●●●
●●●

●●

●●
●● ●

●●

●●
●●

●●
●●
● ●● ●

2000 2500 3000 3500 4000 4500

Figure 1.24: Histogram and boxplot of a second random sample of 1000 birth weights. The
sample mean is ȳ = 3344.
1.3. POPULATION AND SAMPLING 35

10000 random observations

0.0008
Density

0.0000
2000 2500 3000 3500 4000 4500

birth weight

●● ● ●●

●●●

●●

●●●


●● ●
●●

●●

●●
●●
●●
●●●


●●●

●●●●

2000 2500 3000 3500 4000 4500

Figure 1.25: Histogram and boxplot of a third random sample of 1000 birth weights. The
sample mean is ȳ = 3343.

10000 random observations


0.0000 0.0006 0.0012
Density

2000 2500 3000 3500 4000 4500

birth weight

● ●●●
●●●

●●
●●●
●●


●●
●●
●●
● ●●


●●


●●
●●

●●

●●

●● ●●

2000 2500 3000 3500 4000 4500

Figure 1.26: Histogram and boxplot of a fourth random sample of 1000 birth weights. The
sample mean is ȳ = 3341.
36 CHAPTER 1. BASIC CONCEPTS

sampling distribution of Y−bar: n=20 sampling distribution of Y−bar: n=100 sampling distribution of Y−bar: n=10000

0.15
0.015
0.006
0.005

0.10
0.010
0.004
Density

Density

Density
0.003

0.005

0.05
0.002
0.001
0.000

0.000

0.00
3000 3100 3200 3300 3400 3500 3000 3100 3200 3300 3400 3500 3000 3100 3200 3300 3400 3500

sample mean sample mean sample mean

Figure 1.27: Histograms of sample means of repeated experiments with sample sizes n = 20
(left), n = 100 (middle) and n = 1000 (right), based on 1000 repeated samples.

1.4 Estimator of the mean

1.4.1 Random variables: estimates versus estimators

In the previous section, we introduced sampling variability and illustrated it with examples
from the birth weight dataset, using the sample mean. Most importantly, we saw that every
sample results in a different value for the sample mean. This implies that the sample mean
itself possesses a distribution. To understand what is happening here, we step back a bit to
consider the sample observations as realisations of a random variable.
We have stressed several times that the n sample observations arise by random sampling.
Previously we denoted the n sample observations by y1 , . . . , yn . However, we now also in-
troduce the notation Y1 , . . . , Yn to denote the random sample. In particular, this means
that we use the Yi notation to stress that it’s a randomly selected observation. This also
implies that Yi takes no particular value; we only know that it is randomly selected from a
population, which is described by a distribution.
On the other hand, we use yi to denote a particular sample observation that resulted from
the random sample. So yi is not described by a distribution, because it has a value and has
been already observed (i.e. randomness is no longer associated with it).
The previous paragraph further implies that all formulae that we have introduced before
in terms of the sample observations y1 , . . . , yn now have analogues in terms of the random
observations Y1 , . . . , Yn . For example, the sample mean can also be expressed as
n
1X
Ȳ = Yi .
n i=1

When we use the Ȳ notation, we stress that the sample mean is a random variable, i.e. Ȳ
is a function of n randomly selected observations. These n Yi are randomly selected from a
1.4. ESTIMATOR OF THE MEAN 37

population and so the Yi are described by the population distribution. Since Ȳ is a function
of the Yi , we can conclude that Ȳ is also random and it can therefore also be described by a
distribution and density function.
The randomness of Ȳ and its distribution can be easily seen by condering them from the
perspective of the repeated sampling concept. Let’s think of the sample observations Yi as
the result of random sampling. So every time the same experiment is repeated, and this
experiment starts with random sampling, another sample of n observations is obtained. The
sample mean, which is a function of the n sample observations, is therefore also different
for each new random sample. So Ȳ is also random, and it can therefore be described by a
density function. The density function of Ȳ depends on the distribution of the observations
Yi and on the sample size n.
Now we need some more terminology. Once a sample is observed, we use the notation
y1 , . . . , yn , because these are now known (observed) values and therefore not random. When
this sample is used to estimate the mean (or more generally, any parameter), we use the
notation ȳ, which is also not random, because it is calculated starting from a set of known
values.
However, before the experiment has been performed, we only know that the n sample ob-
servations will be obtained by random sampling from the population. Thus the sample
observations (before observing) are random, and they can only be described by a distribu-
tion (density function). At this stage, the sample mean (a function of the n observations to
be randomly selected), is also random. Therefore the sample mean is denoted by Ȳ . Since
it’s a function of the n random observations (described by a density function), the sample
mean Ȳ is also random and can therefore also be described by a density function.
This “repeated sampling” perspective has already been illustrated in Figure 1.27.
We refer to ȳ as an estimate of the population mean, and to Ȳ as an estimator of the
population mean. The former is described by a single value, while the latter is described by
a distribution (depending on the distribution of the Yi ).

1.4.2 The sampling distribution of the sample mean

The distribution of the sample mean under the normal assumption

Suppose the observations are normally distributed, i.e.

Yi ∼ N (µ, σ 2 ), (1.9)

and that we consider a random sample Y1 , . . . , Yn of size n. Further, we will assume that all n
observations are independent (e.g. by independent sampling from a lot of helpful colleagues).
38 CHAPTER 1. BASIC CONCEPTS

Sometimes, when we want to stress that the observations Yi are independent, we will write

Yi i.i.d. N (µ, σ 2 ), (1.10)

where “i.i.d.” stands for “. . . identically and independently distributed as . . .”.


Under these conditions it can be shown that the sample mean is also normally distributed:
σ2
 
Ȳ ∼ N µ, . (1.11)
n

Hence,
 
• E Ȳ = µ, and we say that Ȳ is an unbiased estimator of µ
2
• Var Ȳ = σn decreases as n increases, so we say that Ȳ is a consistent estimator of µ
 

• Ȳ is normally distributed

The distribution of the sample mean without the normal assumption

Even without assuming that the sample observations Yi are normally distributed, it still holds
that
 
• E Ȳ = E [Y ] = µ, i.e. Ȳ is an unbiased estimator of µ
  Var [Y ] σ2
• Var Ȳ = = and hence Ȳ is a consistent estimator of µ
n n
q  
Note that Var Ȳ is called the standard deviation of the mean. In some literature,
you might read standard error of the mean, or simply standard error.

The Central Limit Theorem

From the examples and discussion in the previous section, and particularly the birth weight
examples summarized in Figure 1.27, it seems that the following properties hold:

• the
 mean
 of the sampling distribution is the same as the mean of the population, i.e.
E Ȳ = E [Y ] = µ

• the standard deviation of the sample mean distribution (i.e. the standard error) de-
creases as the sample size n increases
1.5. QQ-PLOTS 39

• as the sample size n increases, the shape of the sampling distribution Ȳ becomes normal

In fact there is a powerful mathematical theorem that proves that all of these properties hold
true: the Central Limit Theorem.
The Central Limit Theorem is one of the most important results in probability theory, and
is useful for many applications. It tells us that for any dataset with an unknown distribution
(it could be uniform, exponential or completely random), as long as the sample size n is
sufficiently large then the sample means will approximate the normal distribution. This is
a very powerful statement, and one that forms the basis of many statistical methods like
hypothesis testing and confidence intervals (as we’ll see in the next chapter).
Since this is an applied course rather than a theoretical once, we’ll only give a “loose”
formulation of the CLT, and we won’t examine its proof. Like most introductory statistics
course, we’ll focus on the specific consequences of the CLT rather than trying to understand
its mathematical and probabilistic foundations.
The CLT can be loosely formulated in two ways:

• the sample mean Ȳ is asymptotically normally distributed, as n → ∞,


σ2
 
Y1 + . . . + Yn
Ȳ = → N µ, (1.12)
n n

• the sample mean Ȳ is approximately normally distributed for large samples


σ2
 
.
Ȳ ∼ N µ, (1.13)
n

The first formulation is deals with the asymptotic limit: this is what happens as the sample
size grows so large that it approaches infinity. Of course, in practice we’ll never deal with
infinite sample sizes, so we might be more interested in the second formulation. It tells us
the same thing: if we take large enough samples from any unknown distribution, then the
distribution of the sample means will become approximately normal.

1.5 QQ-Plots

In the previous section we have learned that the sample mean has a normal distribution when
(1) the observations are sampled from a normal distribution, or when (2) the sample size is
very large.
Many of the results that we will discuss in the next lectures (e.g. confidence intervals and
hypothesis tests) rely on the normality of the sample mean, and therefore also on the normal-
ity of the observations in small samples. In a situation where the validity of the statistical
40 CHAPTER 1. BASIC CONCEPTS

method relies on an assumption (here: normality of the observations), it is important to


assess whether the assumption holds true.
In this section we present a graphical tool to assess a distributional assumption of the obser-
vations: the QQ-plot. QQ stands for quantile-quantile.
To build a QQ-plot, we use the following procedure:

1. calculate the sample mean ȳ and sample variance s2

2. order the sample observations so that y1 < y2 < · · · < yn

3. note that by definition, y1 is the sample quantile that corresponds to the n1 -th percentile,
y2 is the sample quantile that corresponds to the n2 -th percentile, etc. For example, y1
is the largest sample observation for which
1 1
(number of sample observations ≤ y1 ) ≤
n n

4. let qi denote the quantile of the i/n-th percentile of a normal distribution with mean ȳ
and variance s2

5. the QQ-plot is then the plot of yi versus qi (i = 1, . . . , n)

In the context of a QQ-plot, the observations yi are referred to as the observed (or sample)
quantiles, and the qi as the expected (or theoretical) quantiles.
When the sample observations come from a normal distribution, we expect that the points
(qi , yi ) lie closely scattered around the diagonal line. That’s because the diagonal line repre-
sents a one-to-one agreement between the yi observed quantiles and the theoretical normal
quantiles qi . The closer that the (qi , yi ) points are to the diagonal, the closer the distri-
bution is to being normal. Any systematic deviation suggests a violation of the normality
assumption. This is illustrated for a sample of 20 birth weights in Figure 1.28.
When interpreting a QQ-plot, keep in mind that you are looking at a random sample! When
the sample size is small we may still frequently expect some deviation from the diagonal, due
to the sampling variability. The larger the sample size the more stable the QQ-plot is from
random sample to random sample. This is illustrated in Figures 1.29 and 1.30.
To demonstrate what might otherwise happen, Figure 1.31 shows three QQ-plots of samples
that were taken from non-normal distributions. The systematic deviation from the diagonal
is very clear.
1.5. QQ-PLOTS 41

Normal Q−Q Plot

3800


3600

Sample Quantiles


3400

● ●


3200

3000
● ●

−2 −1 0 1 2

Theoretical Quantiles

Figure 1.28: A QQ-plot of a sample of 20 birth weights.

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

3800
● ● ● ●
3800


3800

3700

● ● ●


3600

3600



3600

● ● ●
Sample Quantiles

Sample Quantiles

Sample Quantiles

3500

3400

● ●
● ●


● ● ● ●
● ● ●
3400


3400


3200

● ●

● ●
3300

● ●
● ●
● ●



3000
3200

● ● ●

3200




3100

● ●
2800

● ● ●

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Figure 1.29: Three QQ-plots for three random samples of 20 birth weights (normal distribu-
tions).

Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot


4000

● ● ● ●
●● ● ●
3800

4000


●● ●
●● ●

●● ●

3800

●● ●
●●●
3600

3800


●●
●●● ●●●●

●●● ●● ●
● ●●●●
3600


●●●● ● ●●

● ●● ●●
●● ●● ●●
●●● ●● ●●
3600


●● ●●● ●●
3400

● ●
Sample Quantiles

Sample Quantiles

Sample Quantiles

●● ●●●
●●●● ●
● ●


● ●● ●●●●
●●
●● ● ●
3400

●● ●●


●●
●●
●● ●


●●●●●●

● ● ●●●


● ● ●●
●●
●● ● ●
●●

●●
●● ●
3400

● ●
●●
●●
● ●●
● ●●
●●
●●

●●
●●● ●● ●●
3200

●● ●
●● ●
●● ●
●●● ●
● ●
3200

● ●●

●● ●
●●●
● ●
● ●● ●●●


●●● ●●● ●●

●● ●● ●●
●● ●
3200

● ●● ●●●
●● ●● ●●●
●● ●●●
●●●
3000

3000

● ●●
● ●●●●
●●● ●●●●
●● ●●
●●●● ●●

3000

●● ●

● ● ●
2800
2800


2800

● ●

2600

● ● ●

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

Figure 1.30: Three QQ-plots for three random samples of 100 birth weights (normal distri-
butions).
Normal Q−Q Plot Normal Q−Q Plot Normal Q−Q Plot

5000

6000
3000
● ● ●

4000

5000
2500
● ●

●● ●



●●

2000

4000
● ●

3000
● ●●●●

Sample Quantiles

Sample Quantiles

Sample Quantiles
● ●
●● ●

● ●●●●●

1500

3000
● ●●●
● ● ●
●●●●●●●●●●●●●●●

2000

●●●
● ●●
●● ●● ●
●●
●●
●●
●●
●●
●●●
●●●

1000
● ●

●●● ●●●

2000
●● ●
●● ●●
●●●
●●●
●●●
●●
●●
●●
●●
●●
●●
●●
●● ●
●● ●
●●

1000




● ●
●●●
●●
● ●●●●●●●●●●●●●

500


● ●

●●● ●
●●

●●
●● ●●

●● ●

1000

●●
●●
●● ●
●●


●●
● ●
●●
● ●
●● ● ● ●●●●●●●●

●●
●● ●●
●●

●●●
●●●
●● ●●
●●●
●●● ●●●

●●●
●●●
●●● ●●●●●●●●●●
●●●●●●● ●●●●●
● ● ● ● ●●●● ●●
● ● ● ● ● ● ● ● ●●●●● ● ● ● ●

0
0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles Theoretical Quantiles

1e−03

1.0
0.0015

8e−04

0.8
0.0010

6e−04

0.6
f(y)

f(y)
d

4e−04

0.4
0.0005

2e−04

0.2
0.0000

0e+00

0.0
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000

x y y

Figure 1.31: Three QQ-plots (top panels) and density functions (bottom panels) for three
random samples of 100 observations from non-normal distributions (left: log-normal; middle:
exponential; right: Poisson).
Chapter 2

Confidence Intervals and Hypothesis


Tests

In this chapter, we will learn:

• the purpose and use of a confidence interval

• how to calculate confidence intervals in different situations

• the purpose and use of a hypothesis test

• the theoretical basis of one-sample and two-sample hypothesis tests

• the types of statistical conclusions we can make, based on the p-value

• how confidence intervals and hypothesis tests are linked

43
44 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

2.1 Confidence interval of the mean

2.1.1 Example: birth weights

Consider again the baby birth weight example. Suppose a sample of 20 observations was
randomly selected from the population. The histogram and boxplot of this sample are shown
in Figure 2.1. Remember that the objective of the study is to estimate the average birth
weight in Flanders for babies born between 1998 and 2008, based only on the sample data.
Based on this sample we find ȳ = 3295.6g. However, if only the sample mean is reported,
the reader would not know to what extent they can trust this value. The reader knows that
it’s only an estimate and that the estimate is not exactly equal to the population mean, but
they don’t know how far this estimate might be from the true mean.
Of course, it’s impossible to calculate exactly the difference between the sample mean and
the true mean. That calculation would require knowing the true mean, and if we knew the
true mean then we wouldn’t have to bother to estimate it! Fortunately, statistics can give
a probabilistic statement about the difference between the sample mean and the true mean.
This is the topic of this section: the confidence interval of the mean.
First we show below how this is reported in R. You’ll learn how to do this yourself in the
practical sessions, so in these notes we’ll focus on the theoretical derivation and interpretation.
The 20 sample observations are presented first, and then the results of computing the confi-
dence interval for the mean.

3437.9 3474.1 3381.2 3263.1 3632.6 3610.1 2947.6 3337.5 3360.0


2954.0 3396.5 3444.5 3026.9 3885.8 3122.1 3244.9 3166.3 2876.0
3233.0 3119.1

One Sample t-test

data: bw
t = 57.9233, df = 19, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
3176.562 3414.735
sample estimates:
mean of x
3295.648

From this output we can read off the sample mean ȳ = 3295.6, as well as the 95% confidence
2.1. CONFIDENCE INTERVAL OF THE MEAN 45

6
Frequency

4
2
0
2800 3000 3200 3400 3600 3800 4000

Birth weight

3000 3200 3400 3600 3800

birth weight

Figure 2.1: Histogram and boxplot of a random sample of 20 birth weights.

interval, which is here given by [3176.6; 3414.7]. Now the question is: how should we interpret
this interval?

2.1.2 The confidence interval when the variance σ 2 is known

Before we continue, we’ll assume that the observations are sampled from a normal distribu-
tion, i.e.
Yi ∼ N (µ, σ 2 ),
and we assume that the variance σ 2 is known. For the birth weight example, we have a
finite dataset (500,000 observations), so we can just calculate these parameters and find
σ 2 = 78497.52 or σ = 280.17.
Of course, we should note that in most realistic settings the population is unknown, and
therefore so is σ 2 . In these cases, to assess the normality assumption we should look at the
sample’s histogram and the boxplot (Figure 2.1), and at its normal QQ-plot (Figure 2.2).
The histogram and the boxplot show some minor skewness in the data, but remember that
the plots are based on only 20 observations. The QQ-plot, on the other hand, does not
show any serious deviation from normality. So we feel safe to proceed under the normality
assumption with this sample.
We’ll now show briefly how the 1 − α confidence interval is constructed, using the theory we
discussed in Section 3.4 of the previous chapter on the normal distribution. We know that

Ȳ ∼ N (µ, σ 2 /n),
46 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

Normal Q−Q Plot

3800

3600

Sample Quantiles

● ●

3400





3200

● ●

3000

● ●

−2 −1 0 1 2

Theoretical Quantiles

Figure 2.2: A normal QQ-plot of a random sample of 20 birth weights.

and thus also that


Ȳ − µ
√ ∼ N (0, 1).
σ/ n
Ȳ −µ
Recall that this is the standard normal distribution. So let Z = √ .
σ/ n
From the definition
of the standard normal quantile zα/2 we know that
 
P −zα/2 ≤ Z ≤ zα/2 = 1 − α.

Ȳ −µ
Now we substitute √
σ/ n
for Z so that the probability statement becomes
 
Ȳ − µ
P −zα/2 ≤ √ ≤ zα/2 = 1 − α.
σ/ n

After some simple algebra this statement becomes


 
σ σ
P Ȳ − zα/2 √ ≤ µ ≤ Ȳ + zα/2 √ = 1 − α. (2.1)
n n

This expression tells√us that, with a probability


√ of 1 − α, the true mean µ lies in the interval
between Ȳ − zα/2 σ/ n and Ȳ + zα/2 σ/ n. This gives immediately the limits of the 1 − α
confidence interval and its interpretation! We’ll expand on this interpretation, since this is a
key concept.
The 1−α confidence interval (CI) has a probabilistic interpretation, as seen from Eq. (2.1). In
the previous chapter we saw that probabilities can be interpreted from a repeated sampling
viewpoint. In particular, suppose that we repeatedly take samples of 20 observations from
the population, for example 20 times. For each sample, we compute√ the sample√mean ȳ, as
well as the limits of a 95% CI (i.e. α = 0.05), i.e. ȳ − zα/2 σ/ n and ȳ − zα/2 σ/ n.
2.1. CONFIDENCE INTERVAL OF THE MEAN 47

20

15

repeated experiment

10

5


0

3100 3200 3300 3400 3500 3600

birth weight (g)

Figure 2.3: 95% CIs for 20 repeated samples of n = 20 observations. The vertical reference
line corresponds to the true population mean µ = 3342.

For the birth weight example, we have σ = 280.17, and so for α = 0.05 we have zα/2 = z0.025
√=
√ CI for the birth weight example are therefore ȳ−1.95×280.17/ 20
1.96. The limits of the 95%
and ȳ + 1.96 × 280.17/ 20.
In these CI limits the sample mean ȳ changes from sample to sample. For 20 repeated
samples this is illustrated in Figure 2.3. This figure shows that the CI limits also change
from sample to sample, i.e. the CI limits are actually random (because they depend on the
random sample observations). Of course, the true population mean µ = 3342 is fixed. From
Figure 2.3 you can see that most of the CIs include the true mean, but some of the CIs do
not cover it.
For example, in Figure 2.3 we count 19 CIs that include µ, and only one CI that does not
include µ. Thus, 19/20 = 95% of the CIs include the true mean. This is no coincidence:
a 95% CI is constructed in such a way that on average 95% of the repeatedly calculated
intervals cover the true mean. The fact that with only 20 repeated samples the true mean
is covered exactly 19 (out of 20) times is a bit lucky. The 95% coverage is only expected
with an infinite number of repeated samples. Figure 2.4 shows the results from 100 repeated
samples. Now 93 out of the 100 repeatedly calculated intervals cover the true mean.
Figure 2.5 shows results from 100 repeated samples of n = 100 observations. From this graph
we can see that the width of the CI is smaller than with n = 20 observations. What does
this mean? The larger the sample size, the more narrow the interval becomes.
So we’ve seen how the sample size can affect the confidence interval. For a different com-
parison, we’ll look at how the choice of the confidence level α affects the CIs. Figure 2.6
48 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

100
● ●
● ●
● ●
● ●

●● ●
● ●
● ●
● ●

●●

80
● ●
● ● ● ●
● ●
● ●
● ●
● ●
● ●

repeated experiment
● ●

60
●●
● ●
● ●
● ●

● ● ●
● ●
● ● ● ●

●● ●
40 ● ●

● ●
● ●
● ●
● ●

● ● ●
● ●

20

● ●
● ●
● ●
● ●


● ●

● ●●
● ● ●

0

3100 3200 3300 3400 3500 3600

birth weight (g)

Figure 2.4: 95% CIs for 100 repeated samples of n = 20 observations. The vertical reference
line corresponds to the true population mean µ = 3342.

shows the results from 100 repeated samples of n = 20 observations, but now 90% CIs are
constructed. With α = 0.10 the standard normal quantile becomes z0.10/2 = 1.64, and thus
the intervals are more narrow than with α = 0.05, but the sample mean is equally variable.
As a result, fewer CIs cover the true mean; only 90%.
Let’s go back to our one sample of 20 birth weights. For this sample we would find the 95%
CI to be
√ √
[3295.6 − 1.96 × 280.17/ 20; 3295.6 + 1.96 × 280.17/ 20]
i.e. [3172.81; 3418.39]. But in the software output at the start of this chapter, we read
[3176.6; 3414.7]. Why are they different?
The reason is that the CI we computed in this section is based on the true standard deviation
σ = 280.17, whereas the CI in the output of the software uses the sample standard deviation
s = 254.5, which is here (by coincidence) smaller than the true SD. This situation is discussed
in the next section.

2.1.3 The confidence interval when the variance σ 2 is unknown

Now we are in a much more realistic scenario: we have a sample with an unknown population
mean µ and unknown population variance σ 2 . But we can still construct a confidence interval
for the mean, as long as we can rely on the assumption of the normality of the observations.
When the true standard deviation σ is unknown, it can be estimated by the sample standard
2.1. CONFIDENCE INTERVAL OF THE MEAN 49

100
● ● ●
● ●

● ●●
● ●
● ●

● ●
●● ●
●●

80
● ●
● ●

●● ●
● ●
●●

●● ●

repeated experiment
●● ●

60
● ●

● ● ●
●●

● ●
● ● ●
● ●
● ●
● ●

40

● ●●
● ●
● ●
● ●●
● ●

●● ●
● ●
●●●
20

● ●

● ●

● ●
●● ●
● ●
●● ●
● ●
0

3100 3200 3300 3400 3500 3600

birth weight (g)

Figure 2.5: 95% CIs for 100 repeated samples of n = 100 observations. The vertical reference
line corresponds to the true population mean µ = 3342.
100

● ●
● ●
●● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ●
80

● ●●
● ●
● ●
● ●

● ●●
● ●●
repeated experiment


●●
60

● ●
● ●●
● ●
●● ●
● ●
● ●
● ●

●● ●
40

● ●
●●
● ●
● ●
● ●
● ●
● ●
● ●
● ●

●●
20

● ●
● ●
● ●
● ●
● ● ●
● ●

●● ●
● ●
0

3100 3200 3300 3400 3500 3600

birth weight (g)

Figure 2.6: 90% CIs for 100 repeated samples of n = 20 observations. The vertical reference
line corresponds to µ = 3342.
50 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

Figure 2.7: Density functions of t-distributions with 2 (black), 3 (red), 5 (green) and 10 (blue)
degrees of freedom. The dotted density is the density of a standard normal distribution. The
vertical reference lines correspond to −1.96 and +1.96.

deviation S. This will have an effect on how the CI must be computed.


In the previous section we started from the standard normality
√ of (Ȳ − µ)/(σ/n). How-
ever, when σ is replaced with S, then T = (Ȳ − µ)/(S/ n is no longer standard normally
distributed; it still has mean zero but the variance is larger than one.
This can be understood as follows: whereas σ was a constant, S is no longer constant. S is
computed from the sample observations, and in repeated sampling experiments this means
that S takes different values from sample to sample. This will obviously increase the variance
of the standardised mean.
Sometimes a difference
√ in terminology is used to stress the difference between (Ȳ − µ)/(σ/n)
and (Ȳ − µ)/(S/ n). The former is referred to as the standardised mean, whereas the
latter is known as the studentised mean.

It can be shown that T = (Ȳ − µ)/(S/ n) follows a t-distribution with n − 1 degrees of free-
dom. Importantly, this property only holds when the observations are normally distributed.
This is another reason to always check the normality assumption before proceeding!
Figure 2.7 shows density functions of several t-distributions with different degrees of freedom.
From this plot we can see that:

• the t-distribution is symmetric

• the t-distribution has a larger variance than the standard normal distribution
2.2. THE ONE-SAMPLE T -TEST 51

• the larger the degrees of freedom, the closer the t-distribution approximates the stan-
dard normal distribution (in theory the standard normal distribution is the limit of a
t-distribution when the degrees of freedom approach infinity)

• the 1 − α-quantile of a t-distribution with d degrees of freedom (denoted by td;1−α ) is


larger than the 1 − α-quantile of a standard normal distribution.

So now that we know that


Ȳ − µ
√ ∼ tn−1
S/ n
we can find the 1 − α CI in exactly the same way as in the previous section. This gives the
probability statement
 
Ȳ − µ
P −tn−1;α/2 ≤ √ ≤ tn−1;α/2 = 1 − α.
S/ n

Hence,  
S S
P Ȳ − tn−1;α/2 √ ≤ µ ≤ Ȳ + tn−1;α/2 √ = 1 − α.
n n
The 1 − α CI is thus given by
S S
[Ȳ − tn−1;α/2 √ ; Ȳ + tn−1;α/2 √ ].
n n

For the birth weight example this gives [3176.6; 3414.7], exactly as provided by the R output
at the start of the chapter.

2.2 The one-sample t-test

2.2.1 Example: birth weight

Consider again the birth weight example. Suppose that systematic birth weight registrations
in Spain have shown that the (true) average birth weight in Spain is 3250 grams. Belgian
scientists want to study whether the same average applies to Belgium.
In contrast to the previous birth weight example, we are now in the situation that we do
not know the birth weight population in Belgium. If the mean birth weight in Belgium is
not equal to 3250 grams then scientists hypothesise that it will be larger in Belgium. In
statistical terms these hypotheses are formulated as

H0 : µ = 3250 versus H1 : µ > 3250.


52 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

H0 is referred to as the null hypothesis and H1 as the alternative hypothesis. In this


section we will construct a statistical test that can be used for this testing problem. Since
our hypotheses relate to a single sample, this is known as a one-sample test. We first show
the R output for the sample of 20 birth weights.

One Sample t-test

data: bw
t = 0.8023, df = 19, p-value = 0.2162
alternative hypothesis: true mean is greater than 3250
95 percent confidence interval:
3197.266 Inf
sample estimates:
mean of x
3295.648

This output shows again the same sample mean, but note that the CI is different. Later
we’ll come back to this; it is related to the settings applied to the software routine For the
interpretation of the hypothesis test, the p-value is of particular interest (p = 0.2161). But
first we’ll show how the test is constructed.

2.2.2 The one-sided one-sample t-test

In the example we want to test whether the true mean of the Belgian birth weights equals
3250 grams, or whether it is greater than 3250 grams. We introduce the notation µ0 for the
hypothesised mean. In this example µ0 = 3250.
Before we continue, we’ll assume that the birth weight observations are sampled from a
normal distribution, i.e.
Yi ∼ N (µ, σ 2 ).
Suppose that the null hypothesis is true, i.e. µ = µ0 . From Section 2.1.3 we know that

Ȳ − µ
√ ∼ tn−1 ,
S/ n

and therefore
Ȳ − µ0 H0
T = √ ∼ tn−1
S/ n
2.2. THE ONE-SAMPLE T -TEST 53

H
(the notation ∼0 means that the distribution is conditional on the null hypothesis being true).
The statistic T is referred to as the test statistic, and the tn−1 distribution is here referred
to as the null distribution.
From the definition of the 1 − α quantiles we know that

P [T > tn−1;α ] = α

if T is tn−1 -distributed. This happens when H0 is true. To stress this condition we write

P0 [T > tn−1;α ] = α. (2.2)

How do we interpret Eq. (2.2)? Again we look at it from a repeated sampling perspective.
To make it more concrete, we take α = 0.05. This gives t20−1;0.05 = 1.73.
Suppose that there are 200 students in this statistics course and that each individual student
performs this study. Each student therefore starts with sampling 20 birth weights at random.
Each student will find a different sample mean ȳ and sample standard deviation s, and
therefore each student will find a different value for the test statistic
ȳ − µ0
t= √ .
s/ n

Some of the students will find a t value that is larger than t20−1;0.05 = 1.73, and the other
students will have t ≤ t20−1;0.05 = 1.73. Now suppose that H0 is true. From Eq. (2.2) we know
that we expect only 0.05 × 200 = 10 students will find a t value larger than t20−1;0.05 = 1.73,
but only if H0 is true.
What if H0 is not true and H1 is? In other words, what if µ > µ0 ?
When µ > µ0 , we can see in Figure 2.8 that the distribution of the test statistic T will
be shifted to the right. This means that, compared to the null distribution, there will be
relatively more probability in the tail of this distribution larger than t20−1;0.05 = 1.73. So
we expect that when the students sample repeatedly, they would often find a test statistic
T = ȲS/−µ
√ 0 that’s larger than t20−1;0.05 . We actually even expect that many students will now
n
find t-statistics that are larger than those found in case that µ = µ0 .
From Figure 2.8 we learn:

• when H0 is true and µ = 3250, there is 5% chance that T exceeds 1.73

• when H1 is true and µ > 3250, there is more than 5% chance that T exceeds 1.73.

This observation suggests that finding a large test statistic indicates that it is more likely
that H1 is true.
54 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

Figure 2.8: Density functions of the distributions of the test statistic T when n = 20 and
µ = 3250 (full line) and when µ = 3300 (dotted line). The vertical reference line corresponds
to 1.73 = t20−1;0.05 .

Now back to the repeated sampling experiment under the assumption that H0 is true. Only
10 students out of 200 have found a t value larger than 1.73. Suppose that the statistic t is
used as follows to reach a final conclusion:

• when t > tn−1;α then H0 is rejected and H1 is concluded

• when t ≤ tn−1;α then H0 is accepted.

So only 10 out of the 200 students will formulate the wrong conclusion, and 190 students will
formulate the correct conclusion. The statistical hypothesis test is therefore constructed in
such a way that it controls the probability for making a decision error when H0 is correct.
This probability (or error rate) is controlled by α.
Some important terminology:

• α is known as the significance level, or the type I error rate

• tn−1;α is known as the critical value.

Consider now again a repeated sampling experiment performed by the 200 students, but now
suppose that the true mean is larger than µ0 = 3250. Suppose that µ = 3300 grams. We see
from Figure 2.8 that we expect more students to find larger t values. In this example, theory
(details not shown) says that about 40 students (20%) will have a t-value larger than the 5%
level critical value 1.73.
2.2. THE ONE-SAMPLE T -TEST 55

0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

Figure 2.9: Density functions of the distributions of the test statistic T when n = 100 and
µ = 3250 (full line) and when µ = 3300 (dotted line). The vertical reference line corresponds
to 1.66 = tn−1;α .

What happens if we change the size of our sample? Suppose that the repeated sampling
experiment is repeated but now with n = 100. The 5% level critical value is now t99;0.95 =
1.66. The density functions of T under H0 and under H1 with µ = 3300 are shown in
Figure 2.9. We now expect that about 100 out of the 200 students (50%) will reject the
null hypothesis in favour of the alternative, i.e. about half of the students find the correct
conclusion. The probability of finding the correct conclusion for a particular µ > µ0 is known
as the power of the test. Clearly, we’d like our tests to have as much power as possible.
How can we achieve this? Comparing Figures 2.8 and 2.9 tells us that the power increases
with sample size.
Figure 2.10 illustrates that the power also increases with µ−µ0 , i.e. the greater the difference
between the true mean and the hypothesised mean, the larger the power. In particular, when
µ = 3400 the power is larger than 99%. In our repeated sampling experiment we expect then
that at least 198 students out of the 200 will arrive at the correct conclusion.

2.2.3 Strong and weak conclusions

With a finite sample size n, it is not possible to find a statistical test that never (i.e. with
probability zero) results in decision errors: if the sample size is smaller than the population
then there will always be some uncertainty around our conclusions that we can’t completely
overcome.
56 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4 6 8

Figure 2.10: Density functions of the distributions of the test statistic T when n = 100 and
µ = 3250 (full line) and when µ = 3300 (dotted line) and when µ = 3400 (dashed line). The
vertical reference line corresponds to 1.66 = tn−1;α .

Decision errors can be split into two types. These are presented in Table 2.1, where we read
the type I error rate (significance level)

P0 [reject H0 ] = P [reject H0 |µ = µ0 ] = α.

We read this as “the probability of rejecting H0 given that µ = µ0 ”, or in other words the
probability that we reject H0 when it’s actually true. As we have seen before, statistical tests
are constructed in such a way that they guarantee that this error rate is not larger than α.
The probability β is referred to as the type II error rate. It is given by

P [accept H0 |µ] = β.

We read this as “the probability of accepting H0 given that H1 is true”. This error is related
to the power of the test, defined as

P [reject H0 |µ] = 1 − P [accept H0 |µ] = 1 − β.

It is important to see that the type II error rate (and thus also the power) is not explicitly
controlled by the statistical test. The type II error rate is affected by the true mean µ,
which is unknown to us, and on the sample size (see again Figures 2.8, 2.9 and 2.10), but
we cannot explicitly determine the type II error rate from these quantities. This brings us
to the understanding that a statistical test can result in strong and weak conclusions.
Consider the following two situations.
2.2. THE ONE-SAMPLE T -TEST 57

Table 2.1: Types of decision errors

H0 true H1 true
accept H0 correct β error
reject H0 α error correct

• Based on a random sample, we reach the conclusion to reject the null hypothesis. In
Table 2.1, we are therefore in the second row. There are now two possibilities: either
we have reached the correct conclusion, or we have reached the wrong conclusion. The
latter situation, however, happens only with a small probability (α). It is therefore
possible but unlikely that we have reached the wrong conclusion. Rejecting the null
hypothesis is therefore a strong conclusion.

• Based on a random sample, we reach the conclusion to accept the null hypothesis. In
Table 2.1, we are therefore in the first row. There are now two possibilities: either
we have reached the correct conclusion, or we have reached the wrong conclusion. The
latter situation, however, happens with a probability (β) that is not explicitly controlled
by the statistical test. When we don’t know β we must be careful! It may be a small
or a large probability: we simply don’t know. Hence accepting the null hypothesis is a
weak conclusion.
Actually, we didn’t really reach a conclusion: rather we only concluded that we are not
able to reject the null hypothesis. It’s said that there was insufficient evidence in the
data to reject the null hypothesis in favour of the alternative, which indeed sounds like
a weaker conclusion than in the previous case.

2.2.4 The p-value

In the R output of the birth weight example at the start of the section, we read

t = 0.8023, df = 19, p-value = 0.2162

Since the observed test statistic t = 0.8 is less than the critical value 1.73, we conclude that
we should accept the null hypothesis at the 5% level of significance. The output also shows
a p-value (p = 0.2162). In this section we’ll explain what this means.
For the one-sided one-sample t-test the p-value is defined as

p = P0 [T ≥ t] . (2.3)

In this expression T is the random test statistic and t is the observed test statistic, based on
one single sample. This probability statement can be read as: the probability that the test
58 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

statistic is more extreme then the one that we observed, given that the null hypothesis holds
true.
Going back to the repeated sampling experiment with 200 students, in the situation where
µ = µ0 (i.e. H0 is true) the p-value can be understood as follows. Suppose that you observed
p = 0.22 with t = 0.8. Then it is expected that about 22% of the students have found a test
statistic that exceeds yours (t = 0.8). This means that your observed test statistic of t = 0.8
is not unexpectedly large; there are still many students that have found t-values larger than
yours. Remember that large t-values give more support to the alternative hypothesis, i.e.
the larger the t-value the more likely it is that H1 is true rather than H0 .
Suppose now that you sampled 100 birth weights instead of 20. The output of the software
could, for example, now look as follows.

One Sample t-test

data: bw
t = 2.8711, df = 99, p-value = 0.002501
alternative hypothesis: true mean is greater than 3250
95 percent confidence interval:
3284.978 Inf
sample estimates:
mean of x
3332.946

Now the observed test statistic is t = 2.87 and the p-value equals p = 0.0025. So if H0 were
true then you expect that only 0.25% of the other students (that is less than one student!)
would find a test statistic that is larger than yours. Thus if the null hypothesis were true, it is
very unlikely to find a test statistic as large as yours. However, we know that large t-statistics
are much more likely to happen when H1 is true. When the p-value is small we therefore
conclude that we should reject the null hypothesis in favour of the alternative hypothesis.
In order to make this decision, how small should the p-value be? Of course, by construction we
want our decision to control the type I error rate at the significance level α. Suppose (although
very improbable) that you found t = tn−1;α , a test statistic exactly equal to the critical value.
With n = 100 and α = 5% this is t = 1.66. Eq. (2.3) now becomes p = P0 [T > tn−1;α ], which
is exactly equal to α by definition.
The decision to accept or reject H0 can thus equivalently be formulated as

• when p < α then H0 is rejected and H1 is concluded


2.2. THE ONE-SAMPLE T -TEST 59

• when p ≥ α then H0 is accepted.

The p-value itself is very informative. In reporting the results of a statistical test the p-value
should always be reported, since it gives information about the strength of the conclusion.

2.2.5 Another one-sided one-sample t-test

Example: Calcium concentration


A food product company makes yoghurt for which they advertise that one portion con-
tains 500mg calcium. A suspicious consumer organisation sets up an experiment to test the
company’s statement. They sample 100 portions of that particular yoghurt from several
supermarkets.
The consumer organisation is actually only interested in detecting if the company puts less
calcium in the yoghurt than advertised. Therefore they formulate the hypotheses as

H0 : µ = 500 versus H1 : µ < 500.

The alternative hypothesis is again a one-sided hypothesis, but now “to the left”.
The sample data for the yoghurt calcium measurements are listed below.

424.0 474.5 424.8 450.9 466.9 429.8 478.4 445.3 448.1 472.8 501.2 436.8 484.9
473.4 443.6 427.1 419.9 476.8 430.4 499.6 445.2 473.8 464.3 459.7 453.3 448.1
458.2 402.5 449.1 468.9 464.1 424.0 411.4 425.0 452.5 388.5 426.8 509.8 427.3
476.5 477.2 450.8 479.3 418.9 416.4 406.5 396.7 456.2 465.6 423.9 454.8 477.1
512.7 493.0 449.5 423.9 453.0 468.2 445.2 459.5 403.0 451.7 468.2 475.0 435.7
463.4 491.0 395.0 468.2 426.8 412.9 463.4 474.3 479.2 452.3 453.2 465.3 445.8
421.3 444.5 386.1 480.4 372.6 416.3 424.8 405.9 435.4 447.8 460.9 450.5 381.2
469.2 509.5 400.7 429.8 489.6 455.0 432.4 422.9 498.9

The R output is shown next.

One Sample t-test

data: Ca
t = -17.0818, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is less than 500
95 percent confidence interval:
-Inf 453.2945
sample estimates:
mean of x
60 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

448.2658

This testing problem uses the same one-sample test statistic as before, but now the decision
rule becomes

• when t < tn−1;1−α then H0 is rejected and H1 is concluded

• when t ≥ tn−1;1−α then H0 is accepted.

Note that tn−1;1−α = −tn−1;α due to the symmetry of the t-distribution.


The test can also be performed by using the p-value. The p-value must measure how likely
it is to observe a test statistic “more extreme” (in the direction of the alternative) than the
one observed, given that the null hypothesis is true. However, unlike with the first example
of the one-sided test with H1 : µ > µ0 , since we now have H1 : µ < µ0 the interpretation
of “more extreme” is now “smaller than” the observed test statistic. The p-value must thus
now be defined as
p = P0 [T ≤ t] .

In the R output we find

t = -17.0818, df = 99, p-value < 2.2e-16

The p-value is very small: p < 2.2 × 10−16 . Thus, at the 5% level of significance the null
hypothesis is rejected in favour of the alternative hypothesis that the mean calcium concen-
tration in the yoghurt sample is indeed less than 500mg.
We shouldn’t ever forget that the correctness of the t-test depends on the normality assump-
tion. Actually we should have checked this first! Figure 2.11 shows the normal QQ-plot of
the calcium data set. This plot shows no severe deviation from normality. Moreover, even if
some deviation were observed, the central limit theorem could have been used to argue that
the sample mean is approximately normally distributed. This argument is sufficient to still
correctly interpret the p-value, particularly when the p-value is not close to the significance
level α.

2.2.6 The two-sided one-sample t-test

Consider again the calcium example. Based on the previous test, the consumer organisation
is even more suspicious of the yoghurt company’s advertisting claims, so it decides to carry
2.2. THE ONE-SAMPLE T -TEST 61

Normal Q−Q Plot


● ●

500

●●

●●

480

●●●
●●●●
●●●
●●●
●●
●●●●●
●●


460

Sample Quantiles
●●


●●
●●
●●

●●
●●
●●
●●

●●
●●

440

●●

●●


●●●
●●●
●●●
●●

420

●●
●●
●●
●●
●●

400




380 ●

−2 −1 0 1 2

Theoretical Quantiles

Figure 2.11: A normal QQ-plot of the calcium concentration of 100 randomly selected yo-
ghurts.

out another test. This time it will check if there is an excess of calcium in the yoghurt
compared to the stated 500mg. Therefore the hypotheses are formulated as

H0 : µ = 500 versus H1 : µ 6= 500.

The alternative hypothesis is referred to as a two-sided alternative, since we’re now inter-
ested in learning if the mean is less than or greater than the null hypothesis states.

One Sample t-test

data: Ca
t = -17.0804, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 500
95 percent confidence interval:
442.2572 454.2768
sample estimates:
mean of x
448.267

Again the same test statistic is applied, but now the null hypothesis will be rejected for both
extremely large and extremely small values of the test statistic. In particular,
62 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

0.4
0.3
f(x)

0.2
0.1
0.0

−4 −2 0 2 4

Figure 2.12: The density function of t99 . The vertical reference lines correspond to the critical
values ±t99;0.025 = ±1.98.

• when t < tn−1;1−α/2 or t > tn−1;α/2 then H0 is rejected and H1 is concluded

• when tn−1;1−α/2 ≤ t ≤ tn−1;α/2 then H0 is accepted.

Note that tn−1;1−α/2 = tn−1;α/2 due to symmetry. Moreover, the null distribution of T is also
symmetric. Hence,

• when |t| > tn−1;α/2 then H0 is rejected and H1 is concluded

• when |t| ≤ tn−1;α/2 then H0 is accepted.

Note that the critical value is now given by tn=1;α/2 , i.e. with probabilities α/2 in the two
tails. When n = 100 and α = 0.05, we can calculate that tn−1;α/2 = 1.98. This is illustrated
in Figure 2.12.
Since the alternative hypothesis is different from those of the one-sided tests, the p-value also
requires a new definition.
In general the p-value is defined as the probability, given that H0 is true, that the test statistic
is more extreme than the observed test statistic. “More extreme” must be interpreted in the
direction of the alternative hypothesis.
For the two-sided test this becomes

p = P0 [T ≤ t or T ≥ t] .
2.3. THE PAIRED T -TEST 63

This can be simplified because the null distribution of T is symmetric. Thus,

p = P0 [|T | ≥ t] .

Finally, we show that there exists an equivalence between the two-sided one-sample t-test at
the α-level of significance and the 1 − α confidence interval.
The two-sided one-sample t-test at the α-level of significance accepts H0 if

−tn−1;α/2 ≤ t ≤ tn−1;α/2 .
ȳ−µ
Now we replace t with its definition t = √0 .
s/ n

The condition for accepting H0 thus becomes


ȳ − µ0
−tn−1;α/2 ≤ √ ≤ tn−1;α/2 ,
s/ n

or, equivalently,
√ √
ȳ − tn−1;1−α/2 s/ n ≤ µ0 ≤ ȳ + tn−1;α/2 s/ n.
So, if µ0 lies within the 1 − α CI, then H0 is accepted at the α level of significance (two-sided
test). The reverse of that statement is also true.

2.3 The paired t-test

Example: heavy metals in soils


A company specialised in soil sanitation wants to test an experimental method to remove
heavy metals from contaminated soil. It sets up a lab experiment using 10 soil samples
randomly selected from several fields, using the following procedure:

• Measure the total heavy metal concentration in each of these n = 10 soil samples:
X11 , . . . , X1n .

• Treat the soil samples with the experimental method.

• Measure again the total heavy metal concentration in these soil samples after the treat-
ment: X21 , . . . , X2n

Note that the observations are paired! We want to compare the same soil samples before and
after treatment, so we can write the sample observations as (X11 , X21 ), (X12 , X22 ), . . . , (X1n , X2n ).
64 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

Let µ1 denote the mean of the pre-treatment response, and let µ2 denote the mean of the
post-treatment response. The company is interested in reducing the concentrations of heavy
metals, so the hypotheses of interest are:

H0 : µ1 = µ2 versus H1 : µ1 > µ2 .

The problem can be reduced to a one-sample testing problem in the following way:

• Let Yi = X2i − X1i , i = 1, . . . , n (i.e. the Yi represent the pairwise differences).

• When µ represents the mean of Yi , i.e. µ = E [Yi ] = E [X2i − X1i ] = µ2 − µ1 , then the
hypotheses can be reformulated as

H0 : µ = 0 versus H1 : µ < 0.

This problem can therefore be addressed with a one-sample t-test using the Yi as sample
observations. What are the assumptions underlying the method? Remember that we need
to check these first.

• Do the X1i need to be normally distributed? Do the X2i need to be normally dis-
tributed? No, not necessarily.

• The Yi do need to be normally distributed, because they are used in the one-sample
t-test.

Figure 2.13 shows the boxplot and the normal QQ-plot for the pairwise sample difference
variable of the metal example. None of the plots show a severe deviation from normality,
and therefore the paired t-test is valid for this sample.
The R output for this test is shown below. What do you conclude?

Paired t-test

data: post and pre


t = -3.6943, df = 9, p-value = 0.002482
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -11.82011
sample estimates:
mean of the differences
-23.462
2.4. THE TWO-SAMPLE T -TEST 65

Normal Q−Q Plot


0
−10

−10

Sample Quantiles
−20

−20

post−pre

−30

−30

−40

−40

−50

−50

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Theoretical Quantiles

Figure 2.13: Boxplot (left) and normal QQ-plot (right) of the pairwise sample difference
variable Yi for the metals data set.

2.4 The two-sample t-test

Now we’ll consider an example where we want to compare two samples of data, rather than
just one as in the previous section. Unsurprisingly, this type of test is called a two-sample
test.
Example: pudding elasticity
The elasticity of puddings is an important physical property of the product in the sense that
customers don’t only like puddings because of the taste, but also because of their texture.
This particular feeling can be measured for example by the elasticity of the pudding, which
is influenced by its ingredients. It’s believed that elasticity is particularly influenced by the
concentration of carrageenan, a compound derived from seaweed.
Therefore, an experiment is set up in which two pudding recipes are compared in terms of
their elasticity. In one sample, n1 = 20 puddings are prepared with a high concentration of
carrageenan, and in the other sample n2 = 20 puddings with a low concentration.
The boxplots of the samples are shown in Figure 2.14.
We are interested in testing whether the mean elasticities of both types of pudding are equal,
i.e.
H0 : µ1 = µ2 ,
where µ1 and µ2 are the mean elasticity of the high and low concentration groups, respectively.
Suppose that the researcher in charge of this experiment is certain that the elasticity is
definitely not smaller in the high concentration group. Then the alternative hypothesis of
interest is
H1 : µ1 > µ2 .

We will make the following assumptions:


66 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

9400
9200
elasticity (Pa)

9000
8800

1 2

pudding

Figure 2.14: Boxplots of the carrageenanan concentration of the two types of puddings.

• the observations in the first sample are normally distributed, i.e.


Y1i i.i.d. N (µ1 , σ12 ) (i = 1, . . . , n1 )
• the observations in the second sample are normally distributed, i.e.
Y2i i.i.d. N (µ2 , σ22 ) (i = 1, . . . , n2 )
• we will assume that the variances are equal, i.e. σ12 = σ22 .
1
Pn1 1
Pn2
Let Ȳ1 = n1 i=1 Y1i and Ȳ2 = n2 i=1 Y2i .
The QQ-plots are shown in Figure 2.15. Only a minor deviation from the normality assump-
tion is observed. We do not expect important consequences of this, particularly when we’ll
see later that the p-value is sufficiently far away from the significance level.
For the construction
  of the
 test   we first note that Ȳ1 − Ȳ2 is an estimate of µ1 − µ2
 statistic,
(because E Ȳ1 − Ȳ2 = E Ȳ1 − E Ȳ2 = µ1 − µ2 ). Therefore, we will reject H0 in favour of
H1 for large values of Ȳ1 − Ȳ2 . We can show that
σ12 σ22
 
Ȳ1 − Ȳ2 ∼ N µ1 − µ2 , +
n1 n2
which under H0 : µ1 = µ2 becomes
σ2 σ2
 
H0
Ȳ1 − Ȳ2 ∼ N 0, 1 + 2 .
n1 n2
And, by standardization,
Ȳ − Ȳ H
q 1 2 2 2 ∼0 N (0, 1). (2.4)
σ1 σ
n1
+ n22
2.4. THE TWO-SAMPLE T -TEST 67

Normal Q−Q Plot Normal Q−Q Plot

● ● ●

9300
● ●

9400

9200
● ●



9100
9200

Sample Quantiles

Sample Quantiles
● ●
● ● ●

9000



● ●

9000
● ● ● ●

8900



● ●

8800
8800


8700
● ●

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Figure 2.15: Normal QQ-plots of the carrageenanan concentration of the two types of pud-
dings.

So the following statistic seems to be a good test statistic since it has a nice null distribution
(easy to calculate and interpret) and it measures the deviation from the null hypothesis in
the direction of the alternative:

Ȳ1 − Ȳ2 H0
T =q 2 ∼ N (0, 1). (2.5)
σ1 σ22
n1
+ n2

Unfortunately it depends on the true variances and these are in practice often unknown.
We can instead replace σ12 and σ22 with their sample estimators S12 and S22 :
1 n 2 n
1 X 1 X
S12 = (Y1i − Ȳ1 )2 and S22 = (Y2i − Ȳ2 )2 .
n1 − 1 i=1 n2 − 1 i=1

This results in the test statistic


Ȳ1 − Ȳ2
T =q 2 .
S1 S22
n1
+ n2

What are the characteristics of this null distribution?

• when the samples sizes n1 and n2 are small, the exact null distribution of T is unknown
but it can be approximated by a t-distribution with ν degrees of freedom (ν has to be
estimated from the data; details not shown here);

• when the sample sizes n1 and n2 are very large, it can be shown that T is approximately
standard normally distributed under H0 .
68 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

Under the additional assumption that σ12 = σ22 (known as homoscedasticy), only one variance
needs to be estimated. Let σ 2 = σ12 = σ22 denote this variance. It can be estimated by the
pooled variance estimator:

(n1 − 1)S12 + (n2 − 1)S22


Sp2 = .
n1 + n2 − 2

The pooled variance estimator is a weighted average of S12 and S22 , and since E [S12 ] = σ12 = σ 2
(i.e. S12 is unbiased), and since E [S22 ] = σ22 = σ 2 (i.e. S22 is unbiased), we have E Sp2 = σ 2 ,
i.e. Sp2 is an unbiased estimator of the common variance σ 2 .
The test statistic now becomes
Ȳ1 − Ȳ2
T =q 2 ,
Sp Sp2
n1
+ n2

which, under H0 , follows a tn1 +n2 −2 distribution.


Now, back to our puddings. In this example, the alternative hypothesis is

H1 : µ1 > µ2 .

Therefore a one-sided test is required, i.e. H0 will be rejected in favour of this alternative for
large values of T .
First we have to assess the assumption of normality and the assumption of equality of vari-
ances. Normality was already concluded from the QQ-plots of Figure 2.15. From the boxplots
in Figure 2.14 we may also conclude that the variances can be considered as equal.
For a one-sided test the decision rule becomes

• if t = rȳ1 −ȳ2 ≤ tn1 +n2 −2,α then accept H0


s2
p s2
n1
+ np
2

• if t = rȳ1 −ȳ2 > tn1 +n2 −2,α then reject H0 in favour of H1


s2
p s2
n1
+ np
2

Finally, below we show the R output we obtain for this hypothesis test. The p-value is
0.0074. Since this is smaller than α = 5% we conclude at the 5% level of significance that
the carrageenanan concentration does on average affect the elasticity. Moreover, as this
is concluded from a one-sided test, we conclude that a larger carrageenanen concentration
produces a larger mean elasticity. Remember that the normality was slightly in doubt in one
of the two groups. However, as the p-value is much smaller than α, the small error in the
p-value will not result in a different conclusion.

Two Sample t-test


2.5. CONFIDENCE INTERVAL OF µ1 − µ2 69

data: G by concentration
t = 2.5554, df = 38, p-value = 0.007366
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
58.0111 Inf
sample estimates:
mean in group 1 mean in group 2
9156.840 8986.336

Suppose that the pudding researcher did not suspect beforehand what the effect of the con-
centration of carrageenan might be. Since it could increase or decrease elasticity, then a
two-sided test would be better. The hypotheses are then

H0 : µ1 = µ2 and H1 : µ1 6= µ2 .

The construction of the decision rule for this test is similar to the one-sample t-test, resulting
in

• if −tn1 +n2 −2,α/2 ≤ t ≤ tn1 +n2 −2,α/2 then accept H0

• if t < −tn1 +n2 −2,α/2 or t > tn1 +n2 −2,α/2 then reject H0 in favour of H1

2.5 Confidence interval of µ1 − µ2

Once again there is a theoretical equivalence between the hypothesis test that we have just
discussed and a certain confidence interval. Recall that for the two-sided one-sample t-test,
we showed that if µ0 lies within the 1 − α CI then H0 was accepted at the α significance level.
For the two-sample test, we start again from

σ12 σ22
 
Ȳ1 − Ȳ2 ∼ N µ1 − µ2 , + .
n1 n2

By assuming σ12 = σ22 and using the pooled variance estimator Sp2 for estimating the common
variance, we know that
Ȳ1 − Ȳ2 − (µ1 − µ2 )
q 2 ∼ tn1 +n2 −2 .
Sp Sp2
n1
+ n2

Note that this result is equivalent to Eq. (2.4) for the construction of the two-sample t-test,
except that we don’t impose µ1 = µ2 because we’re not hypothesis testing.
70 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

We proceed as in Section 2.1.3 of this chapter. For a T -distributed random variable we know
that  
Ȳ1 − Ȳ2 − (µ1 − µ2 )
P −tn1 +n2 −2;α/2 ≤ q 2 ≤ tn1 +n2 −2;α/2  = 1 − α.
Sp Sp2
n1
+ n2
Hence,  s
Sp2 Sp2
P Ȳ1 − Ȳ2 − tn1 +n2 −2;α/2 + ≤ µ1 − µ2
n1 n2
s 
Sp2 Sp2
≤ Ȳ1 − Ȳ2 + tn1 +n2 −2;α/2 +  = 1 − α.
n1 n2

This gives the lower and upper limits of a 1 − α CI of µ1 − µ2 .


In particular, the interval is given by
 s s 
Sp2 Sp2 Sp2 Sp2
Ȳ1 − Ȳ2 − tn1 +n2 −2;α/2 + , Ȳ1 − Ȳ2 + tn1 +n2 −2;α/2 + .
n1 n2 n1 n2

As before, there exists an equivalence between a two-sided test at the α level of significance
and a 1 − α confidence interval. If 0 is within the 1 − α CI then H0 : µ1 = µ2 is accepted at
the α level of significance (two-sided test), and the other way around.
72 CHAPTER 2. CONFIDENCE INTERVALS AND HYPOTHESIS TESTS
Chapter 3

Analysis of Variance

In this chapter, we will learn:

• the purpose of analysis of variance (ANOVA)

• the theoretical basis of the ANOVA model

• how to interpret the results of an ANOVA model

• how to check whether we can use the ANOVA model or not

3.1 One-way ANOVA

3.1.1 Introduction and motivation

Let’s go back to our very first example: treating pears with an edible coating to try to reduce
the rate at which they spoil. Remember from Section 1.1 that this experiment involves
applying different concentrations of coating to the pears, and then using an instrument to
measuring the firmness of the pears. Four treatments were tested, representing concentrations
of 0.5%, 0.8% and 1% concentrations, as well as a 0% solution as a control. Figure 3.1 shows
the boxplots of the measured firmness values for each of these four groups.

73
74 CHAPTER 3. ANALYSIS OF VARIANCE

8

7
firmness (N)

6
5
4
3

1 2 3 4

coating

Figure 3.1: Boxplot of the firmness for the four coating treatments.

We would like to know if the treatment (the coating concentration) has an effect on the
firmness of the pears. If we examine the boxplots in Figure 3.1, we think we might indeed see
an effect: it appears that the different treatments result in different firmness. In particular,
the largest effect seems to be for the first treatment, while treatments 2 and 3 also seem to
result in higher firmness than the control treatment (group 4), although the effect appears
to be less strong. But our question now is: are these differences just due to chance? Or is
there really an effect?
If we were only dealing with two groups of observations, we know from the previous chapter
how to tackle this question: we could set up a two-sample t-test to decide between the null
hypothesis that the two groups have the same mean firmness, or the alternative hypothesis
that the mean firmness of the two groups is not the same. But we don’t have two groups: we
have four. How should we then proceed? We need a new kind of test, one that will allow us
to compare the means of several groups at the same time. This is the subject of this chapter.
We will learn about a tool called analysis of variance, or ANOVA for short.
Although the name refers to variance, ANOVA is actually used to test for differences between
means. We will see shortly why this is so. ANOVA is one of the most commonly used tools
in statistics, and you will come across it very often in scientific studies. Before we dive into
the details of how to perform an ANOVA analysis, in the remainder of this section we’ll go
through a general description of how ANOVA works.
3.1. ONE-WAY ANOVA 75

Let’s think again about our pear example. We are interested in the effect of the protective
coating on the mean firmness of pears. We have four different groups that have each been
treated with a different concentration of the coating (including one control group that has a
0% coating). We want to do some kind of statistical test that allows us to compare the mean
firmness of each of these four groups. This means that we would like to investigate the null
hypothesis that the means are equal:

H0 : µ1 = µ2 = µ3 = µ4

where µi is the mean firmness of group i.


Then our alternative hypothesis is that the means are not equal. We’ll see later that the
means can be “not equal” in several ways (for example, three means can be equal and one
not, or two means can be equal and the other two not). But for now, we’ll think about the
alternative hypothesis in the most general way:

H1 : it is not true that µ1 = µ2 = µ3 = µ4

Now it’s time to consider the variance referred to in ANOVA’s name. From Chapter 1, we
know that the sample variance of our pear dataset is given by

p i n
1 XX 2
s2 (Y ) = Yij − Ȳ (3.1)
n − 1 i=1 j=1

where the index i (i = 1, . . . , p) refers to the treatment group (for the pears, p = 4), the
index j (j = 1, . . . , ni ) refers to the observations (pears) in each group, and Ȳ is the overall
sample mean, i.e. the mean of all observations.
Notice that we are calculating the sample variance while being careful to keep track of which
pear belongs to which treatment group. This is why there are two sums in our formula: we
sum over the treatment groups and over the observations within each group. Previously, we
didn’t have separate groups in our data and so we used only one index k (k = 1, . . . , n) to
sum over all the observations in our dataset:
n
2 1 X 2
s (Y ) = Yk − Ȳ .
n − 1 k=1

It should be clear that these two formulae for the sample variance will give us the same result:
in both cases we are just summing over every observation in our dataset. But even though the
second formula is easier, when doing ANOVA we have to keep in mind the different groups
present in our dataset: after all, our goal is to draw conclusions about possible differences
between them. So we will use Eq. (3.1) for the sample variance.
76 CHAPTER 3. ANALYSIS OF VARIANCE

When we work with variance in ANOVA, we will actually use the total sum of squares (SSTot)
rather than the sample variance. The total sum of squares is given by
p ni
X X
SSTot = (Yij − Ȳ )2 .
i=1 j=1

Compare this formula to the sample variance in Eq. (3.1): they are almost identical, but
for SSTot we are only summing the deviations from the overall sample mean Ȳ , and not
averaging them like for the sample variance.
The reason that we are interested in the total sum of squares is that we can decompose
it into two kinds of variation. The first is the within-group variation: this refers to the
differences between observations in the same group, which can be captured by their deviation
from the mean of the group. The second is the between-group variation: this refers to
the differences between the observations in different groups, which can be captured by the
differences between the means of the groups. The two types of variation are illustrated in
Figure 3.2.
The sum of squared errors (sometimes call the within-group sum of squares) measures the
differences between observations and their group mean:
p ni
X X 2
SSE = Yij − Ȳi
i=1 j=1

where Ȳi is the group mean of treatment group i. In the pear example, this would refer to
the mean firmness of pears that received treatment i. Since the SSE compares observations
to their group mean rather than the overall mean, we expect the value of SSE to be smaller
than SSTot, since we are totally ignoring any differences between the treatment groups.
The between-group sum of squares is given by
p ni
X X 2
SST = Ȳ − Ȳi
i=1 j=1
p
X 2
= ni Ȳ − Ȳi
i=1

and measures the differences between the group means Ȳi and the overall mean Ȳ . In the
pear example, this refers to the differences between the mean firmness of each treatment
group and the overall mean firmness, calculated by pooling all the observations.
It’s fairly easy to show that
SSTot = SSE + SST
What does this tell us?
3.1. ONE-WAY ANOVA 77

group 1 group 2 group 3


(a) Between-group variation

group 1 group 2 group 3


(b) Within-group variation

Figure 3.2: Illustration of the different kinds of variation: between-group variation (differ-
ences between group means) and within-group variation (deviations from group means).
78 CHAPTER 3. ANALYSIS OF VARIANCE

The decomposition of SSTot shows us that we can separate the total variation in our dataset
into two parts: the variation due to differences between the sample means in the different
treatment groups, and the variation within the different treatment groups due to differences
between observations and their group mean (see again Figure 3.2).
Think back to our null hypothesis: that the means of the different treatment groups are all
equal. If this is true, then we would expect the sample means Ȳi of our different groups to
be very similar. This would imply that the SST is very small, or at least much smaller than
the SSE. Finally, we can see the link between testing means and analysing variances. Even
better, this is exactly the kind of relation that we can check with a hypothesis test.
More specifically, we’ll see later in this chapter that we can test this hypothesis using an
F -test. We’ll outline here the general idea underlying the F -test, and later we’ll go through
the details.
Let’s recap what we’ve learned so far: using the decomposition of SSTot, we want to compare
the within-group variation (SSE) and the between-group variation (SST). If we find that the
SST is large compared to the SSE, then we suspect that the means of the treatment groups
are not equal, and we will reject our null hypothesis that they are equal.
In our hypothesis test, we won’t directly compare the SST and the SSE. Instead we will
compare the mean squared sums, by dividing each sum of squares by its degrees of freedom.
For the SST, we have that
SST
MST =
p−1
and for the SSE we have that
SSE
MSE = .
n−p

The degrees of freedom are calculated in the usual way: the number of unique data points
used in the calculation minus the constraints associated with the calculation. For the SST,
we are calculating the difference between the p group means and the overall mean, hence its
degree of freedom is equal to p−1. For the SSE, we are calculating the difference between the
n observations and their group means (there are p), hence its degree of freedom is n − p. Now
we can construct a measure of the mean square sums that will tell us about the relationship
between the two variations we are interested in: the within-group variation (SSE) and the
between-group variation (SST).
So we are ready to state the central idea of our hypothesis test: if the null hypothesis (that
the group means are equal) is not true then we expect that the ratio
MST
F =
MSE
is large. On the other hand, if the null hypothesis is true then we expect that F will be
small. This decision rule should remind you of our hypothesis tests from the previous chapter!
3.1. ONE-WAY ANOVA 79

Indeed, the ratio F is the test statistic of the F -test at the heart of ANOVA. Depending on
how large F is, we will either reject or accept our null hypothesis that the means of the
different treatment groups are equal.
Now that we understand the general idea behind ANOVA and how it works, in the rest of
this chapter we’ll explain this procedure more rigorously.

3.1.2 The ANOVA model

We’ll start with a simple case: comparing two treatment groups. In this case, the statistical ANOVA mo
is given by
Yij = µ + τi + εij i = 1, 2 j = 1, . . . , ni .
where εij ∼ N (0, σ 2 ). The index i (i = 1, 2) refers to the treatment group and the index j
(j = 1, . . . , ni ) refers to the observations within group i.
Let µ1 = µ + τ1 and µ2 = µ + τ2 . This model implies that if i = 1,

Y1j ∼ N (µ1 , σ 2 ) j = 1, . . . , n1 .

and if i = 2,
Y2j ∼ N (µ2 , σ 2 ) j = 1, . . . , n2 .

The model is therefore completely equivalent to the assumptions made for the two-sample
t-test! Take a look again at Section 2.4 if you don’t remember what these are.
When working with ANOVA we use specific terminology:

• the populations or groups are referred to as the treatments. In our pear example, the
treatments are the different concentrations of protective coating.

• the parameters τi are referred to as the treatment effects.

• the variable that determines the treatments is known as the factor. In our example,
the factor is the coating concentration.

• the variable Yij is referred to as the response variable. In our example, the response
variable is the firmness of the pears.

• the treatments are also referred to as the levels of the factor.

• µ, τ1 and τ2 (and also σ 2 ) are known as the parameters of the statistical model.

It is also important to understand that the parameters µ and τi refer to the mean response,
because they parameterise the means of the normal distributions.
80 CHAPTER 3. ANALYSIS OF VARIANCE

This can be stressed by writing

µ1 = µ + τ1 = E [Y1j ] = E [Y | group 1]

and
µ2 = µ + τ2 = E [Y2j ] = E [Y | group 2] .

Note that the ANOVA model has three parameters for the two means (µ, τ1 and τ2 ), whereas
the two-sample t-test requires only two parameters (µ1 and µ2 ).
As a consequence a restriction is required. In ANOVA two restrictions are common:

• the sum or sigma restriction:


X
τi = τ1 + τ2 = 0
i

• the treatment restriction:


τ1 = 0.

These restrictions essentially tell us how we need to think of τ1 and τ2 . In the next short
sections, we’ll explain exactly how.

The sum restriction


P2
The restriction i=1 τi = τ1 + τ2 = 0 implies τ1 = −τ2 and therefore

µ = µ2 − τ2 = µ2 + τ1 = µ2 + (µ1 − µ)

from which we find the identity


1
µ = (µ1 + µ2 ),
2
Here, µ is the overall mean of our observations. It is sometimes also referred to as the
grand mean.
This gives us immediately an interpretation of the effect parameters τ1 and τ2 . We have that
1 1
µ1 = µ + τ1 = (µ1 + µ2 ) + τ1 and µ2 = µ + τ2 = (µ1 + µ2 ) + τ2
2 2
So we can see that

• τ1 is the effect of treatment 1 on the group mean, when compared to the overall mean

• τ2 is the effect of treatment 2 on the group mean, when compared to the overall mean
3.1. ONE-WAY ANOVA 81

Going back to the two-sample t-test, we would use this test if we want to test the null
hypothesis
H0 : µ1 = µ2 .

In the ANOVA model under the sum restriction, this is equivalent to the null hypothesis

H0 : τ1 = τ2 = 0,

which expresses that the two treatments have no effect on the mean response.

The treatment restriction

The restriction τ1 = 0 implies that

µ1 = µ + 0 = µ,

i.e. the mean response in the first treatment group is equal to µ. This group is referred to
as the reference group.
Moreover,
µ2 = µ + τ2

and therefore
τ 2 = µ2 − µ = µ2 − µ1 ,

i.e. τ2 is the mean difference in response between the two groups.


A two-sample t-test is applied if we want to test the null hypothesis

H0 : µ1 = µ2 .

In the ANOVA model under the treatment restriction, this is equivalent to the null hypothesis

H0 : τ2 = 0,

which expresses that the two treatments have no effect on the mean response. Since τ1 ≡ 0
by assumption, we may still write H0 : τ1 = τ2 = 0.
Now we have enough information to move to a general version of ANOVA. As we’ve seen in
this section, we can consider ANOVA as a generalization of the two-sample test to more than
two samples or treatment groups.
82 CHAPTER 3. ANALYSIS OF VARIANCE

3.1.3 The general one-way ANOVA model

The model

We need a general formulation of the ANOVA model that allows for more than two treatment
groups, and in general p groups. We can do this very easily by taking the ANOVA model for
two treatment groups and generalizing it:

Yij = µ + τi + εij i = 1, . . . p j = 1, . . . , ni ,
p
X
2
where εij ∼ N (0, σ ) and either τi = 0 (sum restriction) or τ1 = 0 (treatment restriction)
i=1
depending on our choice of restriction. Hence:

• if i = 1,
Y1j ∼ N (µ + τ1 , σ 2 ) for j = 1, . . . , n1 .

• if i = 2,
Y2j ∼ N (µ + τ2 , σ 2 ) for j = 1, . . . , n2 .

• ...

• if i = p,
Ypj ∼ N (µ + τp , σ 2 ) for j = 1, . . . , np .

We can still write µi = µ + τi = E [Yij ] = E [Y | group i] (for i = 1, . . . , p).


In the remainder of this chapter we’ll adopt the sum restriction when working with ANOVA.

Parameter estimation

The parameters µ, τi and σ 2 are defined at the population level. Since these are generally
unknown (as in our pear example), we need to estimate these parameters based on sample
data before we can apply the ANOVA model.
First, we’ll need some notation:

• Let µ̂, τ̂i and σ̂ 2 denote the parameter estimates.

• Let Ŷij = µ̂ + τ̂i denote the predictions.

• Let eij = Yij − Ŷij denote the residuals.


3.1. ONE-WAY ANOVA 83

response variable
𝑌11

𝑒11

𝑌ത1
𝑒12

𝑌12 𝑌ത2

𝑌ത3

i=1 i=2 i=3

treatment groups

Figure 3.3: Illustration of the meaning of SSE and the residuals.

Pp Pni 2
• Let SSE = i=1 j=1 eij (recall the SSE is the sum of squared errors)

We’ve already seen the SSE in the first section of this chapter: we used it to capture the
differences between observations and their group mean. Here, we’re using a different notation
to denote the same quantity: we’ve rewritten the definition of the SSE in terms of the
residuals. A residual eij is the difference between the observed value of the dependent variable
Yij and the predicted value Ŷij , and therefore each data point has one residual. This is
illustrated in Figure 3.3. In this figure we can also visualize the SSE: it is the sum (over
all treatment groups) of the squared differences between the observations and their group
sample mean.
The parameter estimators µ̂ and τ̂i are determined by minimizing the SSE. These estimators
are known as the least squares estimators.
In a balanced design (when the group sizes are equal: n1 = n2 = · · · = np ) the following
holds
p ni
1 XX
µ̂ = Ȳ = Yij and Ŷij = µ̂ + τ̂i = Ȳi ,
n i=1 j=1

where n = n1 + n2 + · · · + np denotes the total sample size.


The SSE is also related to the pooled variance estimator we saw in Chapter 2. This is
84 CHAPTER 3. ANALYSIS OF VARIANCE

illustrated here for p = 2:


p ni
X X
SSE = e2ij
i=1 j=1
p ni  2
X X
= Yij − Ŷij
i=1 j=1
Xn1 n2
X
2
= (Y1j − Ȳ1 ) + (Y2j − Ȳ2 )2
j=1 j=1

= (n1 − + (n2 − 1)S22


1)S12
= (n1 + n2 − 2)Sp2 .
So we may also write
SSE SSE
Sp2 = = .
n1 + n2 − 2 n−2
In the first section of this chapter (Section 3.1.1), we also introduced the MSE or mean squared error,
which is given by
SSE
MSE = .
n−p
Using the link between the SSE and the pooled variance estimator, it can be shown that for
p ≥ 2, the mean squared error is an unbiased estimator of σ 2 .
So by minimizing the SSE, we can obtain estimates µ̂, τ̂i and MSE = σ̂ 2 of the population
parameters µ, τi and σ 2 . Then we are ready to apply the ANOVA model.
Below you can see the output from R for an ANOVA test on the pear coatings dataset. In
this output we can immediately read the SSE and the MSE on the line Residuals. In this
example the MSE = 1.623, which is therefore the pooled variance estimate of the firmness.
The rest of the output will be explained later.

Analysis of Variance Table

Response: firmness
Df Sum Sq Mean Sq F value Pr(>F)
coating 3 20.796 6.9321 4.2712 0.00767 **
Residuals 76 123.347 1.6230
---
Signif. codes: 0 ^ O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Link to confidence intervals

It can be shown that


3.1. ONE-WAY ANOVA 85

• the estimators µ̂ and τ̂i are normally distributed with mean µ and τi , respectively
(hence, µ̂ and τ̂i are unbiased estimators)

• the variances of the estimators are proportional to σ 2 , say c2 σ 2 (with c2 depending on


the parameter to be estimated and the design of the experiment). We use the notation
σµ2 = Var [µ̂] and στ2i = Var [τ̂i ].

• the variance of an estimator can be estimated by replacing σ 2 with the MSE, i.e. c2 σ 2
is estimated by c2 MSE. We use the notation σ̂µ2 and σ̂τ2i to denote the estimators of σµ2
and στ2i , respectively.

Consequently,
µ̂ − µ τ̂i − τi
∼ N (0, 1) and ∼ N (0, 1)
σµ στi
µ̂ − µ τ̂i − τi
∼ tn−p and ∼ tn−p .
σ̂µ σ̂τi
Note that the degrees of freedom of the t-distribution are equal to those of the estimator
MSE.
The t-distributed studentised estimators form the basis for the construction of confidence
intervals, just as before.

A note on the interpretation of Ŷij

Once the parameters are estimated, we can define

Ŷij = µ̂ + τ̂i for i = 1, . . . , p j = 1, . . . , ni ,

which are generally referred to as the predictions. In this section we will argue that they can
be interpreted in two ways. Remember,

Yij ∼ N (µ + τi , σ 2 ) so that µ + τi = E [Yij ] = E [Y | group i] .

Thus, Ŷij = µ̂ + τ̂i is an estimator of the conditional mean E [Y | group i] = µ + τi .


However, what is our best prediction of an individual outcome in group i?
Since the distribution of Yij within group i is symmetric about the mean µ + τi (because we
know that a normal distribution is symmetric), and since we do not have more information
about Yij , our best estimate of an individual outcome from group i is given by the estimate
of the mean of its distribution, i.e. Ŷij = µ̂ + τ̂i .

Despite the ambiguous meaning of Ŷij , it is commonly known as a prediction, but its most
natural interpretation is actually the conditional mean of observations in group i. Later,
86 CHAPTER 3. ANALYSIS OF VARIANCE

when discussing regression analysis, we will have to make a similar distinction and it will
turn out to be very important to know what interpretation you want to attach to Ŷij , because
its variance depends on its interpretation (as we’ll see later in Chapter 4).

Single treatment effect: t-test

If we want to check the effect of a certain treatment i, we can compare the mean response of
this treatment to the overall mean µ. Since
τ̂i − τi
∼ tn−p ,
σ̂τi
we know that the test statistic
τ̂i
T =
σ̂τi
has a tn−p null distribution under the null hypothesis
H0 : τi = 0.
This expresses that treatment i has no effect on the mean response with respect to the overall
mean µ.

Multiple treatment effects: F -test

But in general it’s not practical to test the effect of each treatment τi individually. Instead,
we’d like to do what we discussed in Section 3.1.1 of this chapter: test all the treatment
effects at the same time.
We’d like to test our overall null hypothesis that the means of the different treatment groups
are equal. Using ANOVA terminology, this null hypothesis states that the factor has no effect
on the mean response. Recall that the factor is the variable that determines the treatments;
in the pear example, this is the coating concentration.
We can write this null hypothesis using our ANOVA notation as
H0 : τ1 = τ2 = · · · = τp = 0.
and then our alternative hypothesis (that the means of the treatment groups are different,
and that therefore the treatments have different effects) can be written generally as
H1 : not H0 .
Now let’s consider the statistics
p
X SST
SST = ni (Ȳ − Ȳi )2 and MST = .
i=1
p−1
3.1. ONE-WAY ANOVA 87

response variable
𝑌11

𝑒11

𝑌ത1
𝑒12
𝑌ത
𝑌12 𝑌ത2

𝑌ത3

i=1 i=2 i=3

treatment groups

Figure 3.4: An illustration of the meaning of the SST, SSE and residuals.

We’ve seen these already in Section 3.1.1: the SST measures the variability between the
treatment sample means, and the MST is the SST divided by its degrees of freedom, to
represent the mean of the squared sum. Remember that when there is no treatment effect,
we expect that the treatment sample means will be more or less equal to the overall mean
Ȳ , and therefore the SST or the MST will be small.
Figure 3.4 illustrates the meaning of the SST, SSE and residuals.
From our discussion in Section 3.1.1, and also by examining Figure 3.4, we also know that
in the absence of treatment effects we expect that the MST will be small compared to the
MSE, and therefore the F -statistic
MST
F =
MSE
will be small. When one or more treatments do have an effect, we expect that the MST will
be large compared to the MSE, and therefore that the F -statistic will be large.
We will therefore reject H0 in favour of H1 when we find a large value of the F -statistic. But
precisely how large? We can follow a similar reasoning as in our previous hypothesis tests
(see Chapter 2).
We won’t go into the theoretical proof, but it can be shown that

H
F ∼0 Fp−1,n−p .
88 CHAPTER 3. ANALYSIS OF VARIANCE

That is, if the null hypothesis is true then F follows an F -distribution with p − 1 and
n − p degrees of freedom, denoted by Fp−1,n−p . (You will learn about this distribution’s
characteristics during the practicals.) Then, at the α level of significance, we will reject H0
in favour of H1 when
F > Fp−1,n−p;α .
Since H0 is rejected for large values of F , the p-value for this test is defined as
p = P0 [F ≥ f ]
in which f denotes the observed F -statistic. Recall that the P0 notation refers to the prob-
ability given that the null hypothesis is true.
Finally, we are ready to do an ANOVA analysis of our pear experiment! You will learn in
the practicals how to do this yourself in R. For now, let’s take a look at the results when R
performs an ANOVA for us. Below you can again find the R output for an ANOVA analysis
of the pear coating dataset. We can recognize the key statistics underlying the ANOVA
analysis and F -test.
On the line coating, we can read SST = 20.796 and MST = SST/3 = 6.93. Therefore the F
statistic equals 4.27, which is shown in the next column. When this value is compared to an
F -distribution with 3 and 76 degrees of freedom, the p-value is 0.00767. Therefore at the 5%
level of significance the null hypothesis of no treatment effect is rejected, and we conclude
that the four types of coating do not result in the same firmness on average.

Analysis of Variance Table

Response: firmness
Df Sum Sq Mean Sq F value Pr(>F)
coating 3 20.796 6.9321 4.2712 0.00767 **
Residuals 76 123.347 1.6230
---
Signif. codes: 0 ^ O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Note that R does not give SSTot (some programs do, other do not). You can of course
calculate it easily yourself:
SSTot = SST + SSE = 20.796 + 123.347 = 144.143.

3.1.4 Multiple comparisons of means

Introduction to the problem

The ANOVA F -test has rejected the null hypothesis of no treatment effect. This means that
the mean responses for the different treatments are not all equal. So we have learned that at
3.1. ONE-WAY ANOVA 89

least two treatments result in different mean responses.


The obvious next question is therefore: which means are different? This problem is known
as multiple comparison of means.
In this section, we’ll learn about two methods that can be used for multiple comparison of
means: Bonferroni and Tukey. But first, we will explain in more detail what this problem
involves.
Suppose that we proceed with a naive approach. We want to test

H0ij : µi = µj

against the alternative hypothesis


H1ij : µi 6= µj
for the i-th and j-th treatment groups. Since we’re thinking naively, we decide to use a
two-sample t-test at the α = 5% level of significance, for every combination of i 6= j in
our experiment. So we’re going back to our two-sided two-sample test from Chapter 2, and
doing one such test for each pair of treatment groups. In our pear example, there are four
treatments and therefore six pair combinations.
Also from Chapter 2, recall that rejecting H0ij is equivalent to observing 0 outside of the
(1 − α) = 95% confidence interval of µi − µj .
Figure 3.5 illustrates the interpretation of three simultaneous 95% confidence intervals of the
differences between three means. For each of the three plots we expect that about 5% of the
confidence intervals to NOT cover µi − µj = 0. This is indeed true for the plots in Figure 3.5.
Each of the three graphs, when considered separately, gives a correct confidence interval
interpretation: each of three individual two-sample t-tests gives a type I error rate of α = 5%.
Unfortunately, the joint probability of making a type I error has increased.
The joint probability of a type I error is known as the familywise error rate (FWER) and is
defined as
FWER = P [reject any H0ij |H0 ] ,
in which H0 is the overall null hypothesis stating µ1 = µ2 = · · · = µp . We read this as “the
probability of rejecting any of the pairwise null hypotheses H0ij given that the overall null
hypothesis is actually true”.
Figure 3.5 illustrates that this FWER probability is much larger than α = 5%. The FWER
can be read (approximated) from Figure 3.5 as follows. Look at the first line of all three
graphs (i.e. of all three comparisons of means, for the same random sample). As soon as one
of the three confidence intervals does not cover 0 (i.e. results in a false rejection), we consider
the result of the multiple comparison as a false rejection. Then we proceed to the second
line in all three graphs; again, as soon as one of the three confidence intervals does not cover
90 CHAPTER 3. ANALYSIS OF VARIANCE

100

100
● ● ● ●
● ● ● ●
● ● ● ●
●● ● ●●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
80

80
● ●


● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●

● ● ●
● ● ● ●
● ●
repeated experiment

repeated experiment
● ● ● ●
● ● ● ●
60

60
● ● ●
● ● ● ●●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
40

40
●● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
20

20
●● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ●●
● ● ● ●●
● ● ●
● ● ● ●●
● ● ●
● ● ● ●
0

−4 −2 0 2 4 −4 −2 0 2 4

mean 1 − mean 2 mean 1 − mean 3


100

● ●
● ●
● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
80

● ●
● ●
● ●

● ● ●
● ●
● ●●

repeated experiment

● ●
● ●
60

● ●
● ●
● ●●
● ●
● ●
● ●
● ● ● ● ●
● ●
40

● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ●

20

● ●
● ●
●●
●●
● ●
● ●
● ●
●●
● ●
● ●
0

−4 −2 0 2 4

mean 2 − mean 3

Figure 3.5: An illustration of repeated sampling interpretation of three simultaneous 95%


confidence intervals of the differences between three means (100 repeated samples). The
simulations are performed under the null hypothesis that the three means are equal. The
dots correspond to Ȳi − Ȳj .
3.1. ONE-WAY ANOVA 91

0, we consider the result of the multiple comparison as a false rejection. We continue this
procedure, line-by-line, for all 100 lines (100 repeated samples).
At the end we count the number of false rejections and we divide this number by 100 to get
the error rate. This gives us an approximation of the FWER - if we had results of an infinite
number of repeated samples, this procedure of FWER-calculation would give us the exact
FWER, but of course in practice we never work with infinite samples. Here we have 100
repeated samples, which will give us a good approximation of the FWER.
From Figure 3.5 we can calculate an approximate FWER of 13/100 = 13%, which is obviously
much larger than the 5% level that we aimed for. This is a serious problem for our confidence
in the results of this test, and so we have to take action to correct it.

The Bonferroni correction

It can be shown that the FWER for m simultaneous tests is bounded, meaning that it cannot
exceed a certain value. In particular, when each individual test is performed at the α∗ level
of significance, then
FWER ≤ mα∗ .
This inequality gives a first correction method: the Bonferroni correction.
This correction involves performing every individual hypothesis test at the α∗ = α/m sig-
nificance level. Then we can be sure that the familywise error rate will never exceed the α
significance level that we aim for.
A disadvantage of the Bonferroni correction is that it is conservative, i.e.

FWER ≤ α.

This means that the Bonferroni correction is safe, in the sense that it controls the error rate
to a level that is definitely not larger than the desired level α. However, particularly when m
is large, the conservativeness becomes larger. An important consequence is that the power
of the testing procedure becomes smaller. This is undesirable, since we wish to maintain as
powerful a test as possible.
We’ll now apply the Bonferonni correction to the pear example. We want to do a multiple
comparison of means, so that we can reject or confirm the overall null hypothesis that the
means of the treatment groups are equal. We want to control the error rate of this test at
the 5% level. Since we have four treatment groups, this implies six pairwise null hypotheses

H0ij : µi = µj

for i, j = 1, 2, 3, 4. Using the Bonferroni correction, we therefore need to do each individual


hypothesis test at the α/6 = 0.0083 confidence level. Finally, the degrees of freedom of our
92 CHAPTER 3. ANALYSIS OF VARIANCE

100

100
● ● ● ●
● ● ● ●
● ● ● ●
●● ● ●●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
80

80
● ●


● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●

● ● ●
● ● ● ●
● ●
repeated experiment

repeated experiment
● ● ● ●
● ● ● ●
60

60
● ● ●
● ● ● ●●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
40

40
●● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
20

20
●● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ●●
● ● ● ●●
● ● ●
● ● ● ●●
● ● ●
● ● ● ●
0

−4 −2 0 2 4 −4 −2 0 2 4

mean 1 − mean 2 mean 1 − mean 3


100

● ●
● ●
● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
80

● ●
● ●
● ●

● ● ●
● ●
● ●●

repeated experiment

● ●
● ●
60

● ●
● ●
● ●●
● ●
● ●
● ●
● ● ● ● ●
● ●
40

● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ●

20

● ●
● ●
●●
●●
● ●
● ●
● ●
●●
● ●
● ●
0

−4 −2 0 2 4

mean 2 − mean 3

Figure 3.6: An illustration of repeated sampling interpretation of three simultaneous 95%


confidence intervals of the differences between three means (100 repeated samples). The
simulations are performed under the null hypothesis of the three means being equal. The
Bonferroni correction has been applied. The dots correspond to Ȳi − Ȳj .
3.1. ONE-WAY ANOVA 93

dataset is 76 (the 80 observations minus the four treatments), and so the critical value to be
used with the Bonferroni correction is given by

t(α/6)/2;76 = 2.7091.

The repeated sampling interpretation of the FWER with the Bonferroni correction is illus-
trated in Figure 3.6. From the 100 simulated data sets, only 6 result in a type I error, which
is close to the desired level α = 5%. Note that the exact interpretation only appears with an
infinite number of simulations.
The R output for building these confidence intervals with the Bonferroni correction is shown
below.

95 % simultaneous confidence intervals for specified


linear combinations, by the Bonferroni method

critical point: 2.7091


response variable: firmness

intervals excluding 0 are flagged by ’****’

Estimate Std.Error Lower Bound Upper Bound


1-2 1.0500 0.403 -0.0365 2.15
1-3 1.1700 0.403 0.0823 2.27 ****
1-4 1.2700 0.403 0.1740 2.36 ****
2-3 0.1190 0.403 -0.9730 1.21
2-4 0.2110 0.403 -0.8810 1.30
3-4 0.0921 0.403 -0.9990 1.18

In this table, each row corresponds to a pairwise hypothesis test. The pair of treatments in
question is listed in the first element of each row. The final two columns give the lower and
upper bounds of the confidence interval for each pairwise test. R indicates the confidence
intervals that do not contain zero, and therefore which pairwise null hypotheses should be
rejected.
From this output we learn that at the simultaneous 5% significance level, the mean firmnesses
of treatment groups 1 and 3 and of treatment groups 1 and 4 are significantly different. In
particular, the mean firmness of coating group 1 is larger than those of coating groups 3 and
4.
94 CHAPTER 3. ANALYSIS OF VARIANCE

Tukey’s method

We have seen that we can use Bonferonni’s method to control the familywise error rate when
we are performing multiple comparisons of means. However, it comes with the drawback of
decreasing the overall power of the test as the number of treatment groups increases.
Tukey’s method is a better method for controlling the familywise error rate. This is because
it controls the FWER exactly at the nominal level α (assuming normality and equality of
variances).
Figure 3.7 illustrates Tukey’s method applied to our pear dataset. We now find that only 7
out of the 100 random samples produce type I errors.
The R output for the pears example is shown next. The table has the same layout and
interpretation as for the Bonferonni correction. Checking the table, we can see that we
reach the same conclusion as with the Bonferroni correction for this dataset. This is only a
coincidence. Note that the critical value (used for testing and CI calculation) is now 2.6268,
which is smaller than 2.7091 we obtained using the Bonferroni correction.

95 % simultaneous confidence intervals for specified


linear combinations, by the Tukey method

critical point: 2.6268


response variable: firmness

intervals excluding 0 are flagged by ’****’

Estimate Std.Error Lower Bound Upper Bound


1-2 1.0500 0.403 -0.00339 2.11
1-3 1.1700 0.403 0.11500 2.23 ****
1-4 1.2700 0.403 0.20700 2.32 ****
2-3 0.1190 0.403 -0.93900 1.18
2-4 0.2110 0.403 -0.84700 1.27
3-4 0.0921 0.403 -0.96600 1.15

3.2 Assessment of model assumptions

As with every hypothesis test, we need to assess whether the assumptions underlying the
ANOVA analysis hold true for our dataset. If they do not, then our analysis is not valid.
For ANOVA, these assumptions are
3.2. ASSESSMENT OF MODEL ASSUMPTIONS 95

100

100
● ● ● ●
● ● ● ●
● ● ● ●
●● ● ●●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
80

80
● ●


● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●

● ● ●
● ● ● ●
● ●
repeated experiment

repeated experiment
● ● ● ●
● ● ● ●
60

60
● ● ●
● ● ● ●●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
40

40
●● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
20

20
●● ● ●
● ● ● ● ●
● ● ●
● ● ● ●
● ● ●●
● ● ● ●●
● ● ●
● ● ● ●●
● ● ●
● ● ● ●
0

0
−4 −2 0 2 4 −4 −2 0 2 4

mean 1 − mean 2 mean 1 − mean 3


100

● ●
● ●
● ●
● ●
● ●
● ● ● ●
● ●
● ●
● ●
80

● ●
● ●
● ●

● ● ●
● ●
● ●●

repeated experiment

● ●
● ●
60

● ●
● ●
● ●●
● ●
● ●
● ●
● ● ● ● ●
● ●
40

● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ●

20

● ●
● ●
●●
●●
● ●
● ●
● ●
●●
● ●
● ●
0

−4 −2 0 2 4

mean 2 − mean 3

Figure 3.7: An illustration of repeated sampling interpretation of three simultaneous 95%


confidence intervals of the differences between three means (100 repeated samples). The
simulations are performed under the null hypothesis of the three means being equal. Tukey’s
correction method has been applied. The dots correspond to Ȳi − Ȳj .
96 CHAPTER 3. ANALYSIS OF VARIANCE

8

7
firmness (N)

6
5
4
3

1 2 3 4

coating

Figure 3.8: Boxplot of the firmness for the four coating treatmens

• normality of the observations in each of the p treatment groups (unless all group sizes
are large),
• equality of the p group variances.

How should we assess these assumptions?


First, boxplots can help us to evaluate the normality of the observations and the equality of
the variances. Figure 3.8 shows the boxplot of the firmness of the four coating treatments.
From these plots we conclude that the four variances are likely to be approximately equal. The
four boxplots are also more or less symmetric, which agrees with the normality assumption.
No severe outliers are observed. Note, however, that symmetry does not prove normality.
Second, QQ-plots can be constructed for the observations in each of the four groups. This
is illustrated in Figure 3.9. From these graphs we see that the firmnesses of the pears in
coating groups 3 and 4 do not behave like a normal distribution. Their QQ-plots suggest
some skewness.
In the pear example, we have 20 observations in each of the four treatment groups. This is
a moderate sample size. It allows us to make plots like boxplots and QQ-plots, but we must
interpret the graphs with some care.
When the group sample sizes (ni ) are rather small, we may choose to look instead at the
distribution of the residuals (of which there are n = n1 + · · · + np ).
Consider the ANOVA model

Yij = µ + τi + εij i = 1, . . . p j = 1, . . . , ni ,

with error terms εij ∼ N (0, σ 2 ).


3.2. ASSESSMENT OF MODEL ASSUMPTIONS 97

Coating group 1 Coating group 2

● ●

8
8

● ●

7
● ●
7



Sample Quantiles

Sample Quantiles

6
● ●

● ●

● ●
6



● ●

5
● ●

● ● ●


5


● ●

4

● ●




4

● ●

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Coating group 3 Coating group 4

● ●
7

● ●
7
6
Sample Quantiles

Sample Quantiles



6



5

● ● ●

5


● ● ● ●

● ● ● ● ●


● ● ●
4


● ●
4

● ● ●


3


● ●
3

−2 −1 0 1 2 −2 −1 0 1 2

Theoretical Quantiles Theoretical Quantiles

Figure 3.9: Normal QQ-plots of the firmness for the four coating groups
98 CHAPTER 3. ANALYSIS OF VARIANCE

Normal Q−Q Plot

4

3

● ●

2

Sample Quantiles
●●
●●●
●●
●●●

1
●●



●●●

●●
●●
●●●●●●

0

●●●●
●●●●
●●●●
●●●
●●
●●●

●●●
●●●●

−1
●●●●●





−2


● ●

−2 −1 0 1 2

Theoretical Quantiles

Figure 3.10: Normal QQ-plots of the residuals of the pears example

The residuals,
eij = Yij − Ŷij i = 1, . . . , p j = 1, . . . , ni
can in some sense be considered as the “estimates” of the error terms εij . This suggests that
we can assess the assumption εij ∼ N (0, σ 2 ) by checking the normality of the residuals.
Figure 3.10 shows the normal QQ-plot of the residuals. This plot suggests a minor deviation
from normality, particularly in the tails of the distribution. As noted before, the actual
assumption underlying ANOVA is normality in all groups. So when a deviation is observed
in the QQ-plot of the residuals, we should actually look at the individual groups to find out
what is going on and in which group(s). But when too few replicates are available in each of
the individual groups, the residual QQ-plot may be helpful as a surrogate.

3.3 Two-way ANOVA

3.3.1 Introduction

In the previous section, we were considering the effects of treatments that could be grouped
in one variable. For example, with the pear experiment we were considering the effects
of different concentrations of coating on the firmness of the pears. The different treatments
could be combined in one group: the coating concentration. We were interested in the relation
between this one grouping variable (coating concentration) and the response variable (the
firmness). Hence the ANOVA analysis we used was called a one-way ANOVA.
But what if we had an experiment where the treatments could be classified into more than
one group? For example, imagine that during the pear experiment we ran out of the coating
chemical, and we had to use a second coating chemical from a different company. Then we
would probably want to test not only the effect of coating concentration, but also the effect
3.3. TWO-WAY ANOVA 99

of the type of coating.


We could do this simply by running two one-way ANOVA analyses, one after the other.
But this doesn’t seem like the right approach: why are we doing the same analysis twice in
order to check the same response variable? If there was an effect of the type of coating, this
might be hidden by the effect of the coating concentration. It would be much better to test
the effects of both groups at the same time. Put differently, we would want to run a single
ANOVA analysis that includes both coating concentration and coating type as predictors.
This type of ANOVA analysis is called two-way ANOVA, since we will consider two grouping
variables simultaneously.

3.3.2 Example

To learn about two-way ANOVA, we’ll again using an example from food science: we’re going
to consider whipped cream.
Whipped cream is produced by whipping fresh cream until it’s light and fluffy. However,
additives are often added to the cream to improve the textural and whipping properties.
In our example, we’ll consider the additive hydroxypropyl methylcellulose (HPMC). One of
the important whipping properties is measured by the overrun. This measures the gas holdup
in the whipped cream, and is given by
m1 − m2
overrun = ,
m2
where m1 and m2 are the weights (grams) of a constant volume of unwhipped and whipped
cream, respectively.
The objectives of the study are to assess:

• the effect of the concentration of HPMC on the overrun


• the effect of the whipping time on the overrun.

So we want to test the effects on the response variable of different treatments in two groupings:
a two-way ANOVA.
For this purpose, 10 replicated experiments are performed for each combination of whipping
time (1, 2, 3, 4 and 5 minutes) and HPMC concentration (0.025%, 0.050%, 0.075%, 0.100%
and 0.125%). Each experiment consists of whipping 250ml of cream.
Figure 3.11 shows the boxplots of the overrun for each of the 25 combinations of whipping
time and HPMC concentration. Note that each boxplot is based on only 10 replicated
experiments.
100 CHAPTER 3. ANALYSIS OF VARIANCE

200

● ●

150
overrun (%)

100


50

0.025.1 0.1.1 0.05.2 0.025.3 0.1.3 0.05.4 0.025.5 0.1.5

Figure 3.11: Boxplots of the overrun for all combinations of time and HPMC concentration

Below you can find the R output for the analysis of the data with a two-way ANOVA with
additive effects (we’ll explain in the next section what this means).

Analysis of Variance Table

Response: overrun
Df Sum Sq Mean Sq F value Pr(>F)
time 4 720694 180173 1725.885 < 2.2e-16 ***
HPMC 4 8831 2208 21.148 5.819e-15 ***
Residuals 241 25159 104
---
Signif. codes: 0 ^O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~O 0.1 ^
O ~
O 1

In the ANOVA table we can see now a line for the effect of whipping time (time) and the
effect of HPMC concentration (HPMC). Each of these lines gives degrees of freedom, sum
of squares, mean sum of squares, F values and p values, just like we had in the one-way
ANOVA.
Let’s go through the ANOVA table and interpret each part similarly as before.
On the line Residuals we read:

• the residual sum of squares (or the sum of squared errors): SSE = 25159
3.3. TWO-WAY ANOVA 101

• the residual degrees of freedom: 241 = 250 - (5-1) - (5-1) -1, i.e. total number of
observations (n) minus degrees of freedom of the time effect (5 levels -1) minus the
degrees of freedom of the HPMC effect (5 levels -1) minus one for the overall mean
effect.

• the mean sum of squared error: MSE = SSE/241 = 104. This is an estimate of
the residual variance, i.e. the variance of the overrun
√ for a particular time/HPMC
combination. The residual standard deviation is thus MSE = 10.2.

For the effect of whipping time:

• Sum of squares = 720694. This is a measure of the variability of the overrun observa-
tions that can be attributed to the whipping time.

• Degrees of freedom = 4, i.e. the number of levels of the factor time (five) minus 1.

• Mean sum of squares = 180173 = 720694/4 .

• F value = 180173/104 = 1725.885. This is the F test statistic for testing the null
hypothesis that whipping time has no effect on the mean overrun, given a particular
HPMC concentration. The alternative is that not all whipping times result in the same
mean overrun for a particular HPMC concentration. The degrees of freedom of this
F -test are 4 and 241.

• p value < 2.2e − 16 < 0.05. This is the p-value that corresponds to the F -test. Here,
since p < 0.05 (by a lot!) we may conclude at the α = 5% level of significance that
whipping time has a significant effect on the average overrun, i.e. for a given HPMC
concentration, not all whipping times give the same overrun average.

For the effect of HPMC concentration:

• Sum of squares = 8831. This is a measure for the variability of the overrun observations
that can be attributed to the HPMC concentration.

• Degrees of freedom = 4, i.e. the number of levels of the factor HPMC (5) minus 1.

• Mean sum of squares = 2208 = 8831/4 .

• F value = 2208 / 104=21.148. This is the F test statistic for testing the null hypoth-
esis that HPMC concentration has no effect on the mean overrun, given a particular
whipping time. The alternative is that not all HPMC concentrations result in the same
mean overrun for a particular whipping time. The degrees of freedom of this F -test are
4 and 241.
102 CHAPTER 3. ANALYSIS OF VARIANCE

• p value = 5.819e − 15 < 0.05. This is the p-value that corresponds to the F -test. Here,
since p < 0.05 (again by a lot!) we may conclude at the α = 5% level of significance
that HPMC concentration has a significant effect on the average overrun, i.e. for a
given whipping time, not all HPMC concentrations give the same overrun average.

So our two-way ANOVA table gives us the information we need to answer our research
questions: at the α = 5% confidence level, we determined that both whipping time and
HPMC concentration do have a significant effect on the average overrun.
In the next section we’ll go deeper into the details underlying this model.

3.3.3 The additive model

The output that we have discussed corresponds to the additive model:

Yijk = µ + τi + βj + εijk

for i = 1, . . . , t, j = 1, . . . , b, and k = 1, . . . , nij .


In this model, we have some of the same parameters as in the one-way ANOVA model, as
well as some additional parameters:

• µ is the intercept

• τi is the effect of the i-th level of factor T (e.g. whipping time)

• βj is the effect of the j-th level of factor B (e.g. HPMC concentration)

• the error terms εijk ∼ N (0, σ 2 )

As before, we need restrictions on the effect parameters. Again, these restrictions tell us how
we should interpret the effect parameters τi and βi . Two popular types of restrictions are:

t
X b
X
• the sigma restriction: τi = βj = 0
i=1 j=1

• the treatment restriction: τ1 = β1 = 0.

For both restrictions, the null hypothesis of no effect of the T-factor and no effect of the
B-factor can be expressed as

H0 : τ1 = · · · = τt = 0 and H0 : β1 = · · · = βb = 0.
3.3. TWO-WAY ANOVA 103

The alternative hypotheses are always the negations of the null hypotheses.
As for one-way ANOVA, we can estimate all the necessary parameters from the sample data,
and with these estimates we can calculate the predicted observations:
Ŷijk = µ̂ + τ̂i + β̂j .
As before, Ŷijk has two interpretations:

• an estimate of the conditional mean response in group combination i and j, i.e. an


estimate of
µ + τi + βj = E [Y | whipping time i and HPMC conc. j]

• the prediction of an individual observation in group combination i and j.

We’ll now ask R to perform a two-way ANOVA on the whipped cream dataset. In this ex-
ample, the treatment restriction has been applied. The R output below shows the parameter
estimates.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.240 1.939 25.915 < 2e-16 ***
time2 18.692 2.043 9.147 < 2e-16 ***
time3 86.942 2.043 42.546 < 2e-16 ***
time4 115.492 2.043 56.517 < 2e-16 ***
time5 137.374 2.043 67.226 < 2e-16 ***
HPMC0.05 6.654 2.043 3.256 0.00129 **
HPMC0.075 11.210 2.043 5.486 1.04e-07 ***
HPMC0.1 15.848 2.043 7.755 2.48e-13 ***
HPMC0.125 15.468 2.043 7.569 7.93e-13 ***
---
Signif. codes: 0 ^ O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Residual standard error: 10.22 on 241 degrees of freedom


Multiple R-squared: 0.9667,Adjusted R-squared: 0.9656
F-statistic: 873.5 on 8 and 241 DF, p-value: < 2.2e-16

Note that the parameter estimates of τ1 (time 1) and β1 (HPMC conc. 0.025%) are not listed
in the output. This is a consequence of the treatment restriction τ1 = β1 = 0.
Using the parameter estimates that R has provided, we can for example compute the predicted
mean overrun for one minute of whipping of cream with 0.025% of HPMC:
Ŷ11 = µ̂ + τ̂1 + β̂1 = 50.24 + 0 + 0.
104 CHAPTER 3. ANALYSIS OF VARIANCE

Similarly, we can compute the estimated mean overrun for two minutes of whipping of cream
with 0.025% of HPMC:
Ŷ21 = µ̂ + τ̂2 + β̂1 = 50.24 + 18.692 + 0 = 68.932.
The estimated mean overrun for one minute of whipping of cream with 0.050% of HPMC:
Ŷ12 = µ̂ + τ̂1 + β̂2 = 50.24 + 0 + 6.654 = 56.894.
The estimated mean overrun for two minutes of whipping of cream with 0.050% of HPMC:
Ŷ22 = µ̂ + τ̂2 + β̂2 = 50.24 + 18.692 + 6.654 = 75.586.
The other estimated means can be calculated in a similar manner.
From these estimates we can also get the interpretation of τ̂2 :
τ̂2 = Ŷ21 − Ŷ11 = (µ̂ + τ̂2 + β̂1 ) − (µ̂ + τ̂1 + β̂1 ) = 18.692,
i.e. τ̂2 is the estimated difference between the mean overrun after 2 minutes of whipping
minus the mean overrun after 1 minute of whipping of creams with 0.025% of HPMC.
However, τ̂2 may also be expressed as
τ̂2 = Ŷ22 − Ŷ12 = (µ̂ + τ̂2 + β̂2 ) − (µ̂ + τ̂1 + β̂2 ) = 18.692,
i.e. τ̂2 is the estimated difference between the mean overrun after 2 minutes of whipping
minus the mean overrun after 1 minute of whipping of creams with 0.050% of HPMC.
This is a characteristic of the additive model: the effect of whipping time is the same for all
HPMC concentrations. Similarly it can be shown that that the model implies that the effect
of HPMC concentration is the same for all whipping times. This is illustrated in Figure 3.12.
The additive model assumes that the effect of a change in the level of one treatment variable
does not depend on the level of the other variable.

3.3.4 Assessment of the assumptions

The assumptions of the additive model are implicitly expressed in the model formulation:
Yijk = µ + τi + βj + εijk
for i = 1, . . . , t, j = 1, . . . , b, and k = 1, . . . , nij . Here, we are assuming a sigma or treatment
restriction applies to the τi and βj , and that εijk ∼ N (0, σ 2 ).
Hence, for T -group i and B-group j,
Yijk ∼ N (µ + τi + βj , σ 2 ).
and therefore:
3.3. TWO-WAY ANOVA 105

200
estimated mean overrun (%)

150
100
50

0.025.1 0.1.1 0.05.2 0.025.3 0.1.3 0.05.4 0.025.5 0.1.5

Figure 3.12: Estimated mean overruns for the additive model.

• the effects are additive,

• within each factor combination the overrun should be normally distributed, and

• the variances of the overruns within each factor combination should be equal (constancy
of variance).

To assess these assumptions, boxplots and QQ-plots of the overrun observations for each
factor combination can be plotted and analysed. However, this should only be done when
the number of replicates within each factor combination is sufficiently large (say nij > 15 or
20).
In the whipped cream example, we only have nij = 10 observations within each factor
combination. So Figure 3.11 should be interpreted with care.
As before, when dealing with small group sizes we may check the residuals (eijk = Yijk − Ŷijk )
to approximately assess the assumption of normality.
Figure 3.13 shows the normal QQ plot of the residuals of the additive model for the overrun.
The plot shows no severe deviations from normality. In the right tail a minor deviation is
observed, but this involves only about 10 out of the 250 observations. So we feel safe to
proceed with the normality assumption.
Residuals may also be plotted for each level of one factor at a time. Such plots are shown in
Figure 3.14. From these plots we can conclude that the variance seems to be constant over
106 CHAPTER 3. ANALYSIS OF VARIANCE

Normal Q−Q Plot


20

●●●● ●


●●●●

●●
●●●

●●

●●

●●
●●
●●

●●

●●
●●

10


●●

●●

●●




●●


●●


●●


●●
●●


●●


Sample Quantiles ●●


●●
●●
●●

●●


●●


●●
●●


●●


0
●●


●●
●●



●●
●●




●●




●●


●●



●●
●●

●●



●●






−10

●●



●●

●●


●●


●●
●●
●●
●●


●●


●●●

●●●
●●
●●
−20

●●

●●

● ●

−3 −2 −1 0 1 2 3

Theoretical Quantiles

Figure 3.13: Normal QQ plots of the residuals of the additive model for the overrun

the levels of time and HPMC.

3.3.5 The interaction model

Motivation

In the additive ANOVA model, we assumed that the effect of a change in the level of one
treatment variable did not depend on the level of the other variable. In other words, we
assumed that there was no interaction between the effects of the levels of the two factors.
However, we might think of many cases where this would not be a realistic assumption.
For example, imagine that we want to determine which sauce offered at a cafeteria produces
the highest enjoyment in its customers. We’ll consider two treatment variables: type of sauce
and type of food. The response variable will be the satisfaction score given by the customers.
To keep this example simple, we’ll consider only two types of sauce and two types of food:
ketchup and chocolate sauce, and French fries and ice cream.
If we asked a customer whether they prefer ketchup or chocolate sauce, the answer will surely
be: “It depends on the type of food!”. Nobody likes ketchup on ice cream (or at least very
few people do), and the same is probably true for chocolate sauce on French fries. The
fact that people’s sauce preference “depends” on the type of food tells us that there is an
interaction effect between the two treatment variables.
3.3. TWO-WAY ANOVA 107


20

20
10

10
residuals

residuals
0

0
−10

−10
−20

−20

1 2 3 4 5 0.025 0.05 0.075 0.1 0.125

Figure 3.14: Boxplots of the residuals for the levels of whipping time (left) and the levels of
HPMC concentration (right)

In cases like this, we should not use the additive model. Instead we need yet another type
of two-way ANOVA, that will account for the possibility of an interaction effect between the
levels of the two factors. Not surprisingly, this is called the interaction model.
In the next section we’ll apply the interaction ANOVA model to another example, again from
food science. We’re going to consider the problem of mercury in fish.

Example

Mercury accumulates in fish over their lifetimes, as they absorb it from their environment.
Unfortunately, at elevated levels mercury can be extremely toxic to humans. If we consume
an excess of fish in our diet, or fish with particularly high levels of mercury, we are at risk
of mercury poisoning. To reduce this risk, scientists wish to study mercury accumulation in
fish.
The objective of this study is to test whether mercury accumulation is the same for three
different species of fish, and whether the accumulation depends on the mercury concentration
in the environment.
It seems clear that the accumulation in each species might also be affected by the environ-
mental mercury concentration, and vice versa. Therefore we need to perform an ANOVA
analysis with interaction effects.
The experiment is set up as follows:
108 CHAPTER 3. ANALYSIS OF VARIANCE

40

35
30
microgram Hg/kg


25
20


15

high.1 low.1 high.2 low.2 high.3 low.3

Figure 3.15: Boxplots of the Hg accumulation in three species of fish in two environments
(high and low Hg concentrations).

• two environments (lakes) are selected, with low and high levels of mercury;

• three species of fish are selected;

• for each fish/mercury level combination, five young fish are kept in a fishing net in the
water;

• the mercury (Hg) concentration in the fish is measured before the start of the study,
and then also one month later;

• the response variable is the Hg accumulation (in micrograms per kg fish).

The boxplots of the results are presented in Figure 3.15. Note that each individual boxplot
is based on only five replicates!
At this point we might be a bit tired of learning new ANOVA techniques. Can’t we just
analyse the data with the additive two-way ANOVA model? If the additive model applies,
then we expect

• the effect of Hg concentration to be the same for all fish species, and

• the effect of species to be the same for all Hg concentrations.


3.3. TWO-WAY ANOVA 109

This is NOT confirmed by the boxplot. But can we draw any formal conclusions based on the
boxplot? No, because we are aware that the observed differences may be caused by random
sampling variability. A formal statistical test is needed.
In particular, the boxplot suggests that a low Hg concentration results in a smaller Hg
accumulation, but only in species 1 and 2. The environmental Hg concentration does not seem
to have an effect on the Hg accumulation in species 3. So this might suggest an interaction
effect between environmental Hg concentration and fish species, but only a formal statistical
test can make the distinction between a true effect and sampling variability.
The R output for a two-way ANOVA analysis with interaction effect is presented below for
this example. We are using the null hypothesis that there are no interaction effects, against
the alternative hypothesis that there is at least one non-zero interaction effect.

Analysis of Variance Table

Response: conc
Df Sum Sq Mean Sq F value Pr(>F)
species 2 110.65 55.325 2.5051 0.102780
Hg 1 302.61 302.609 13.7023 0.001116 **
species:Hg 2 250.11 125.053 5.6624 0.009673 **
Residuals 24 530.03 22.085
---
Signif. codes: 0 ^O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

We note a difference from the ANOVA table that we obtained with an additive model: there
is now an extra line for the interaction effect, denoted by species:Hg. On this line we find:

• degrees of freedom = 2 = (3-1)*(2-1)

• sum of squares = 250.11. This is a measure of the variability that can be attributed to
the interaction effect.

• mean sum of squares = 125.053 = 250.11/2

• F = 125.053/22.085 is the F -test statistic for testing the null hypothesis that there
are no interaction effects, versus the alternative that there is at least one non-zero
interaction effect.

• p-value: p = 0.009673 < 0.05

Therefore, at the α = 5% level of significance we conclude that there is at least one non-zero
interaction effect in the mercury-fish dataset. So we were right to conclude that the additive
model is not applicable!
110 CHAPTER 3. ANALYSIS OF VARIANCE

The ANOVA table gives also sum of squares for the two main effects: species and Hg, and
it gives the SSE and MSE. The latter (MSE=22.085) is an estimate of the residual variance,
i.e. the variance of the Hg accumulation within species/Hg groups.
Note
√ that from MSE=22.085 we can calculate the standard deviation:
22.085 = 4.7. Thus the standard deviation of the Hg accumulation is 4.7 micrograms/kg
within each of the species/Hg groups.
Now that we’ve seen how to apply the interaction model in practice, in the next section we’ll
consider some of the theoretical details underlying this model.

The model

The ANOVA statistical model with interaction effect is given by


Yijk = µ + τi + βj + (τ β)ij + εijk
for i = 1, . . . , t, j = 1, . . . , b and k = 1, . . . , nij . In this model, we again have some additional
parameters compared to the additive model:

• µ is the intercept
• τi is the main effect of the i-th level of factor T (e.g. species)
• βj is the main effect of the j-th level of factor B (e.g. Hg concentration)
• (τ β)ij is the interaction effect of the i-th level of factor T and the j-th level of factor B
• εijk ∼ N (0, σ 2 ) are the error terms

Again we need to apply either a treatment or sigma restriction to the parameters. To interpret
the parameters, we’ll proceed as we did for the additive model, adopting the same treatment
restrictions.
The estimated difference of Hg accumulation between species 2 and 1 in the high Hg concen-
tration environment is
Ŷ21 − Ŷ11 = (µ̂ + τ̂2 + β̂1 + (τcβ)21 ) − (µ̂ + τ̂1 + β̂1 + (τc
β)11 )
= τ̂2 + (τcβ)21 − (τc β)11 .

The estimated difference of Hg accumulation between species 2 and 1 in low Hg concentration


environment is
Ŷ22 − Ŷ12 = (µ̂ + τ̂2 + β̂2 + (τcβ)22 ) − (µ̂ + τ̂1 + β̂2 + (τc
β)12 )
= τ̂2 + (τcβ)22 − (τc β)12 .
3.3. TWO-WAY ANOVA 111

Normal Q−Q Plot



5
● ●


Sample Quantiles


●●

●●

0
●●
●●

● ●
● ●

−5



−2 −1 0 1 2

Theoretical Quantiles

Figure 3.16: Normal QQ plot of the residuals

This demonstrates that the effect of species is not the same for each level of the factor Hg
concentration, unless the interaction effects are zero. It also demonstrates that the main
effect parameters are not interpretable in the presence of interaction effects.
What are the consequences of this?

• We must first test for the presence of interaction effects, i.e. test for
H0 : (τ β)ij = 0 for all i, j versus H1 : not H0 .

• When H0 cannot be rejected, we may proceed with the additive two-way ANOVA
• When H0 is rejected, the main effect parameters cannot be directly interpreted.

When we find that interaction effects are present, we cannot use an additive ANOVA ap-
proach. Instead, we can proceed by performing one-way ANOVAs. In our fish example,
this would mean using one-way ANOVAs for the species factor, but applied separately to
firstly the low Hg concentration data, and secondly the high Hg concentration data. Simi-
larly, one-way ANOVAs with the Hg concentration factor can be performed for each species
separately.

Assessment of assumptions

As before, statistical tests and their p-values can only be interpreted correctly if the as-
sumptions underlying the model hold true. In particular, we have to assess the normality
assumption and the constancy of variance assumption. Since there are only five observations
in each species/Hg group, we will use the residuals for assessing the assumptions.
Figure 3.16 shows the normal QQ plot of residuals. No severe deviations from normality are
observed. We can also make boxplots of the residuals for the levels of both factors separately.
112 CHAPTER 3. ANALYSIS OF VARIANCE

5
residuals

residuals
0

0
−5

−5

1 2 3 high low

Figure 3.17: Boxplots of the residuals for the levels of species (left) and the levels of Hg
(right)

These are presented in Figure 3.17. From these graphs we may conclude that the constancy
of variance assumption holds true.
For completeness we also show the R output for the fish example, with the parameter esti-
mates for the interaction model.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.764 2.102 13.686 7.85e-13 ***
species2 5.388 2.972 1.813 0.0824 .
species3 -6.242 2.972 -2.100 0.0464 *
Hglow -5.686 2.972 -1.913 0.0677 .
species2:Hglow -8.048 4.203 -1.915 0.0675 .
species3:Hglow 6.050 4.203 1.439 0.1630
---
Signif. codes: 0 ^O***~
O 0.001 ^
O**~
O 0.01 ^
O*~O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Residual standard error: 4.699 on 24 degrees of freedom


Multiple R-squared: 0.5559,Adjusted R-squared: 0.4633
F-statistic: 6.007 on 5 and 24 DF, p-value: 0.0009693

To close this chapter, let’s briefly go back to the earlier example of the whipping cream
dataset. In a previous section, we analysed this dataset using the additive model. But
actually, we have learned in this section that we should have started from a model with
interaction effects and first tested for the absence of the interaction effects. This is shown
below.
3.3. TWO-WAY ANOVA 113

Analysis of Variance Table

Response: overrun
Df Sum Sq Mean Sq F value Pr(>F)
time 4 720694 180173 1682.8626 < 2.2e-16 ***
HPMC 4 8831 2208 20.6213 1.715e-14 ***
time:HPMC 16 1070 67 0.6245 0.8625
Residuals 225 24089 107
---
Signif. codes: 0 ^O***~
O 0.001 ^
O**~O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

We can see that the p-value for the interaction effect is very large, so we conclude that we
should not reject the null hypothesis. Therefore we can safely proceed with the two-way
additive ANOVA, as we did before.
Chapter 4

Regression Analysis

In this chapter, we will learn:

• the purpose of linear regression

• the theoretical basis of the linear regression model, for one or more predictor variables

• how to interpret the results of a linear regression model

• how to check whether we can use the linear regression model or not

4.1 Simple linear regression analysis

4.1.1 Example and introduction

To introduce linear regression analysis, we’ll use an example from soil science.
Soil is an important source of CO2 production and emission. Incorrect assessments of these
processes can affect the calculation of the CO2 flux and carbon balance as a whole. The rate
of CO2 emission from the soil is also an important indicator of its microbial activity and
intensity of organic matter decomposition. These are both indicators of the “health” of the
soil, and therefore its productivity.

115
116 CHAPTER 4. REGRESSION ANALYSIS



800
● ● ●


● ●●



soil respiration (mg CO2/m2/hr)


● ●

600
● ●

● ● ●
● ●


● ●

● ●
● ● ●

400
● ●

● ●


200

10 15 20 25 30 35

temperature (Celcius)

Figure 4.1: Scatterplot of soil CO2 respiration versus soil temperature for 44 fields in the
Missouri prairie.

It has been suggested that CO2 production and emission rates are lower at low temperatures.
To verify this, an experiment is planned to study the relation between soil temperature and
CO2 respiration. For this purpose, 44 fields in the Missouri prairie (USA) are selected for
study. The CO2 respiration (evolution of mg CO2 /m2 /hr) and the average soil temperature
(degrees Celsius) are measured in each field.
A scatterplot of the data is presented in Figure 4.1. Looking at the data, we believe we can see
a relation between temperature and CO2 respiration: it appears that increased temperature
leads to increased respiration. But is this relation statistically significant, or just due to
variability in our random samples? By now we know how to answer this question: a statistical
analysis.
To guide our analysis, we can come up with some specific research questions:

• Is there a linear relation between temperature and CO2 respiration?

• Can we quantify this relation?

• Can we use this relation to predict the CO2 respiration for a given soil temperature?

Before we begin, we’ll need some terminology:

• temperature is referred to as the independent variable, the regressor, the predictor or


the covariate.
4.1. SIMPLE LINEAR REGRESSION ANALYSIS 117

• the CO2 respiration is referred to as the dependent variable or the response variable.

We designate the variables in this way because we are interested in how CO2 changes as
a function of changes in the temperature. This is why we have plotted the temperature
observations on the x-axis in Figure 4.1, and the corresponding respiration observations on
the y-axis.
What we’d like to do is draw a line through the data in Figure 4.1 in a way that summarizes
the distribution of the sample data in the scatterplot. This seemingly simple task will actually
help us to answer all three of our research questions about the soil sample dataset.
First, if we are able to draw a line that “fits” the data well (we’ll be more specific later
about exactly what we mean here), then this will tell us that there is indeed a linear relation
between our two variables. Second, the equation of this line will quantify the relation between
the variables. And finally, we can use this line to make predictions about temperature and
respiration values that we did not observe in our dataset: for a given temperature value, we
can check what respiration rate is predicted by the line. So this theoretical line sounds very
helpful for our analysis. In statistics, this line is called a regression line.
Let’s think a bit more specifically how we can construct such a line. The formula for a
straight line is usually written as
y = mx + b.
There are two variables, x and y, and two coefficients, m and b. The coefficient b represents
the y-intercept of the line, which is interpreted as the value of y that we get when x = 0.
The coefficient m represents the slope of the line. A slope of m means that if we increase
the x-value by 1 unit, then the y-value will increase by m units. If the slope is negative then
this means that the y-value would decrease by m units instead of increase.
Not surprisingly, we can use exactly this formula to describe a regression line. This is
called simple linear regression because it is indeed a very simple approach for predicting a
quantitative response Y based on a single predictor variable X.

4.1.2 The regression model

The simple linear regression model is given by

Yi = µ + βXi + εi i = 1, . . . , n,

where

• µ is the intercept parameter

• β is the regression coefficient or slope parameter


118 CHAPTER 4. REGRESSION ANALYSIS

• µ + βXi is the regression line

• εi ∼ N (0, σ 2 ) is the error term.

We can see immediately the connection between the simple linear regression model and the
equation of a line that we looked at in the previous section. The only addition are the error
terms εi , because we will not be able to fit a line that lies exactly over all our sample data
points: the deviations are captured by these error terms.
The model implies that for a given value Xi :

Yi |Xi ∼ N (µ + βXi , σ 2 ).

The notation Yi |Xi stresses that the distribution of Yi is conditional on Xi . Therefore

• for a given Xi = x, the point on the regression line µ + βXi has the interpretation of
being the mean of the response variable, i.e. E [Y |X] = µ + βX. This is sometimes
denoted by
µ(x) = E [Y |X = x] = µ + βx.

• the observations Yi are normally distributed about the point on the regression line,
with constant variance:
Y |X ∼ N (µ + βX, σ 2 ).

• the regression model specifies the conditional distribution of Y given X, and the µ and
β parameters refer to the conditional mean.

• the interpretation of the regression coefficient β follows from

µ(x + 1) − µ(x) = E [Y |X = x + 1] − E [Y |X = x]
= µ + β(x + 1) − (µ + βx) = β.

Hence, β is the average increase of Y when X increases by one unit.

These characteristics are illustrated in Figure 4.2.

4.1.3 Parameter estimation

The model presented in the previous section once again contains population parameters that
are usually unknown. Once we have collected a dataset, we can use the observations to
estimate these parameters. In this section, we’ll go through this procedure.
First, we’ll define some necessary notation:
4.1. SIMPLE LINEAR REGRESSION ANALYSIS 119

y y = µ + βx

µ(x0 )

µ = µ(0) β
1
x0 x

Figure 4.2: Illustration of the simple linear regression model.

• Let µ̂, β̂ and σ̂ 2 denote the parameter estimates.

• Let Ŷi = µ̂ + β̂Xi denote the predictions.

• Let ei = Yi − Ŷi denote the residuals.


Pn 2
• Let the sum of squared errors be denoted by SSE = i=1 ei .

The parameter estimates µ̂ and β̂ are determined so that the SSE is minimised (by the
least squares estimates). The parameter σ 2 is estimated using the mean squared error:
MSE = SSE/(n − 2).
The factor n−2 in MSE = SSE/(n−2) is referred to as the degrees of freedom of the SSE. As
we’ve seen before, this is the number of terms in the SSE minus the number of constraints,
here the number of mean-parameters to be estimated (µ and β).
To estimate the parameters for a simple linear regression on our example of Missouri soil
samples, we can ask R to do this for us. Then we obtain the following output:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.316 40.209 2.346 0.0238 *
temp 20.018 1.642 12.192 2.2e-15 ***

The estimated regression line for this dataset is therefore

Ŷi = 94.3 + 20.02Xi .


120 CHAPTER 4. REGRESSION ANALYSIS



800
● ● ●


● ●●



soil respiration (mg CO2/m2/hr)


● ●

600
● ●

● ● ●
● ●


● ●

● ●
● ● ●

400
● ●

● ●


200

10 15 20 25 30 35

temperature (Celcius)

Figure 4.3: Scatterplot of the CO2 respiration versus temperature for 44 fields in the Missouri
prairie, and the fitted regression line.

Remember that X is temperature, our predictor variable, and Y is CO2 , the response variable.
So we have found a regression line linking our observed temperature values Xi and predicted
temperature values Ŷi . We can also write the regression line as

µ̂(x) = 94.3 + 20.02x.

The second equation stresses that the fitted regression line is not limited to our observations
X1 , . . . , Xn , but may also be used for other x-values that were not observed. Remember, this
was our third research question: predicting CO2 respiration for temperature values that we
did not observe.
The notation µ̂ stresses that µ̂(x) = µ̂ + β̂x is an estimate of the mean of Y for a given
X = x. It is not the estimate of an individual observation, as Ŷi might suggest. This is an
important point, so we will come back to it later in this chapter (Section 4.1.7) in a more
detailed discussion.
The fitted regression line is shown in Figure 4.3. We can read from the R output that the
estimated regression coefficient is β̂ = 20.02. From this we conclude that we expect that the
average CO2 respiration increases by 20.02 mg CO2 /m2 /hr when the temperature increases
by 1 degree Celcius. So we have answered all three of our research questions from the first
section of this chapter.
It can be shown that

• the estimators µ̂ and β̂ are normally distributed with mean µ and β, respectively
4.1. SIMPLE LINEAR REGRESSION ANALYSIS 121

• the variances of the estimators are proportional to σ 2 , say c2 σ 2 (with c2 depending on


the parameter to be estimated
h i and the design of the experiment). We use the notation
σµ2 = Var [µ̂] and σβ2 = Var β̂ .

• the variance of an estimator can be estimated by replacing σ 2 with the MSE, i.e. c2 σ 2
is estimated by c2 MSE. We use the notation σ̂µ2 and σ̂β2 for the estimates of σµ2 and σβ2 ,
respectively.

Consequently,
µ̂ − µ β̂ − β
∼ N (0, 1) and ∼ N (0, 1).
σµ σβ
But of course, we do not know the population parameters σµ and σβ . Instead we must use
their estimators. After this substitution, we no longer have normal distributions, but rather
t-distributions:
µ̂ − µ β̂ − β
∼ tn−2 and ∼ tn−2 .
σ̂µ σ̂β
Note that the degrees of freedom of the t-distribution coincide with those of the estimator of
the MSE. So now that we know that each estimator is distributed according to a certain t-
distribution, we can use this information to construct confidence intervals for the population
parameters, as we saw before.

4.1.4 Confidence intervals

Instead of giving the formulae for the confidence intervals, we will demonstrate their cal-
culation on the example dataset. Note that the formulae are completely similar to those
presented earlier in this course (so it’s left as an exercise for you to derive them yourself!)
For our dataset of soil samples, we show again the R output for a linear regression:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.316 40.209 2.346 0.0238 *
temp 20.018 1.642 12.192 2.2e-15 ***

For a 95% confidence interval, we can immediately read off that t44−2;0.025 = 2.018, and
therefore that the 95% CI of β is given by

[20.02 − 2.018 × 1.642, 20.02 + 2.018 × 1.642] = [16.71, 23.33].


122 CHAPTER 4. REGRESSION ANALYSIS

How should we interpret this confidence interval? We must remember the interpretation of β:
it’s the slope of the regression line. This tells us how the response variable (CO2 respiration)
will change as a result of a change of one unit in the predictor variable (temperature).
So, the interpretation of this confidence interval is that with a probability of 95% we ex-
pect that for an increase in the soil temperature by 1 degree Celsius, we will see that the
average CO2 respiration will increase by an amount somewhere between 16.71 and 23.33 mg
CO2 /m2 /hr.

4.1.5 Hypothesis tests

In experiments like our soil investigation, where we want to determine a quantitative rela-
tionship between the predictor and response variables, it’s often of interest to test the null
hypothesis
H0 : β = 0
versus one of the alternatives

H1 : β < 0 or H1 : β > 0 or H1 : β 6= 0.

What is the interpretation of these hypotheses? Looking at the linear regression model,
we see that β = 0 implies that there is no relation between Xi and Yi . So assuming this
null hypothesis means we assume that there is no link between the predictor and response
variables.
On the other hand, as alternative hypotheses we can hypothesize that Xi and Yi are linked
by a positive relation (β > 0), a negative relation (β < 0), or a relation whose sign we don’t
specify (β 6= 0).
So let’s consider the null hypothesis β = 0. From Section 4.1.3, we know that

β̂ − β
∼ tn−2
σ̂β

and therefore the test statistic


β̂
T =
σ̂β
has a tn−2 null distribution under H0 .
Now we’ll ask R to perform this hypothesis test for us. Below we show once more the R
output for a linear regression on the soil sample dataset:

Coefficients:
4.1. SIMPLE LINEAR REGRESSION ANALYSIS 123

Estimate Std. Error t value Pr(>|t|)


(Intercept) 94.316 40.209 2.346 0.0238 *
temp 20.018 1.642 12.192 2.2e-15 ***

Note that the p-value given by R is always two-sided. Therefore, since p = 2.2×10−15 < 0.05,
we reject H0 : β = 0 in favour of the alternative H1 : β 6= 0 at the 5% level of significance. So
we conclude that there is indeed a relation between average CO2 respiration and temperature.
Since the estimate β̂ > 0, we may conclude that the slope of the regression line is positive.
This implies that the average CO2 respiration increases with increasing temperature. Fur-
thermore, since β is much larger than zero, we also conclude that average CO2 respiration
will increase significantly as temperature increases.

4.1.6 ANOVA table, F -test and R2

ANOVA table

So far in this course we have seen several different sums of squares, including:

• the total sum of squares


n
X
SSTot = (Yi − Ȳ )2 ,
i=1

which is a measure of the total variability in the Y -observations (note that SSTot/(n−1)
is the sample variance of Yi ).

• the regression sum of squares


n
X
SSR = (Ŷi − Ȳ )2 .
i=1

When β = 0 we expect the SSR to be small. It is associated with one degree of freedom.

• the residual sum of squares


n
X
SSE = (Yi − Ŷi )2
i=1

which has n − 2 degrees of freedom.

It’s not difficult to show that


SSTot = SSR + SSE.
124 CHAPTER 4. REGRESSION ANALYSIS

Table 4.1: An ANOVA table for a regression analysis

Source SS df MS F -value p-value


MSR
Regression SSR 1 MSR MSE
p
Error SSE n−2 MSE
Total SSTot n−1 MSTot

This is a very powerful relationship: it tells us that the total variability in the response
variable can be decomposed into the variability that can be explained by the regression
relation (SSR) and the remaining residual variability (SSE).
For this reason, these quantities are reported when performing an ANOVA, by including
them in the table of results. In Table 4.1, we can see the typical organization of an ANOVA
table for a simple linear regression. We can observe the key statistics that we studied in the
chapter on ANOVA: SSE, MSE, MSR, the F -statistic and its associated p-value.
We can compare the layout of Table 4.1 with the output that R provides for ANOVA (using
our soil sample dataset), and notice that it looks just the same.

Analysis of Variance Table

Response: CO2
Df Sum Sq Mean Sq F value Pr(>F)
temp 1 1110539 1110539 148.66 2.198e-15 ***
Residuals 42 313764 7471

Why are we interested in this decomposition of variability with respect to the regression
model? You can probably guess the answer (not least from the title of this section): we want
to use these relations to construct hypothesis tests.

The F -test

Let’s quickly recap what we’ve done so far in this chapter. We have constructed a linear
regression model by assuming a linear relation between the predictor and response variable.
Based on this, we estimated the parameters of the linear regression model and constructed
confidence intervals for them. By interpreting the parameters of the regression model, we
could make inferences about the relation between our predictor and response variables.
4.1. SIMPLE LINEAR REGRESSION ANALYSIS 125

At this point, the first hypothesis test that we are interested in should be quite obvious. It’s
probably wise to test whether it was a good idea to assume a relation between these variables
in the first place!
This hypothesis test would require the null hypothesis that there is no relationship between
the predictor and the response. Then the alternative hypothesis is that the data are indeed
distributed in exactly the way that the regression model predicts, and we were right to assume
we could construct that model.
So let’s look again at the SSR,
n
X
SSR = (Ŷi − Ȳ )2 .
i=1

When β = 0, we expect the SSR to be close to zero. We also expect that the more that β
deviates from zero, the larger the SSR will be. Therefore
MSR
F =
MSE
is a good test statistic for testing
H0 : β = 0 versus H1 : β 6= 0.

These hypotheses imply that there is either no relation between the dependent and indepen-
dent variables (β = 0) or a relation whose sign we don’t specify (β 6= 0). This is exactly
what we want to test to ensure the validity of our regression model.
It can be shown that under H0 ,
H
F ∼0 F1;n−2 .

This null distribution will allow us to calculate the test statistic necessary for this hypothesis
test. We won’t spend more time on this test, since it’s completely equivalent to the F -test
we studied in the previous chapter for ANOVA analysis. We should sample the test statistic,
compare it to the critical value, examine the p-value, check that the test assumptions are
satisfied, and either reject or accept the null hypothesis.

The coefficient of determination (R2 )

The coefficient of determination is defined as


SSR SSE
R2 = =1− .
SSTot SSTot

This is the fraction of the total variability of the response observations that can be explained
by the regression relation.
126 CHAPTER 4. REGRESSION ANALYSIS

In practice there will always be other sources of variability in our data, like the fact that we
are dealing with random samples, or the possibility of biological variability (e.g. in the fish
dataset). So we would like to find an R2 value that is large, but we shouldn’t expect to ever
find R2 = 1: there is no meaningful dataset or model where 100% of the variability can be
explained by the regression relation. If you do find a such an R2 , this should be a hint that
something strange is going on!
R2 is particularly important when the objective is to use the fitted regression line for pre-
dicting new observations (to be discussed shortly).
Below we show again the R output for a linear regression on the soil sample dataset, this
time including the statistics provided after the coefficients are reported:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.316 40.209 2.346 0.0238 *
temp 20.018 1.642 12.192 2.2e-15 ***
---
Signif. codes: 0 ^ O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Residual standard error: 86.43 on 42 degrees of freedom


Multiple R-squared: 0.7797,Adjusted R-squared: 0.7745
F-statistic: 148.7 on 1 and 42 DF, p-value: 2.198e-15

We can read off that R2 = 0.7797, which tells us that about 78% of the variance of the CO2
respiration can be explained by its relation to the temperature. This is a fairly high value,
which makes us confident in the significance of the relation between these two variables.
During the practicals you will learn more about R2 and how to interpret it.

4.1.7 Prediction

Earlier in this chapter (Section 4.1.3), we argued that

µ̂(x) = µ̂ + β̂x

is a estimate of the mean response for a given x. In other words, it is an estimate of E [Y |X],
the expected value of the response variable given the distribution of the predictor variable.
However,
Ŷ (x) = µ̂ + β̂x
is also the prediction of an observation Y for a given X = x.
4.1. SIMPLE LINEAR REGRESSION ANALYSIS 127

This second interpretation can be understood as follows. We know that µ̂(x) is an estimate
of the mean response, but we have assumed that the distribution of the Yi observations
is normal, which is a symmetric distribution. This means that we expect that half of the
observations are smaller than the mean and half of them are larger than the mean. Therefore
µ̂(x) is also the best prediction of a new observation for a given x.
We would like to go one step further, and obtain estimates of the variances associated with
both µ̂(x) and Ŷ (x). Then we will be able to construct confidence intervals for both of these
estimators/predictors.
It can be shown that the variance of µ̂(x) is given by
(x − X̄)2
 
2 1
σµ̂ (x) = Var [µ̂(x)] = + Pn 2
σ2,
n i=1 (Xi − X̄)

where X̄ = n1 ni=1 Xi is the sample mean of the observations of the predictor variable.
P

It can also be shown that the variance of Ŷ (x) is given by


i  (x − X̄)2

2
h 1
σŶ (x) = Var Ŷ (x) = 1 + + Pn 2
σ2.
n i=1 (Xi − X̄)
h i
What have we found? That the difference between the variances Var Ŷ (x) − Var [µ̂(x)] is
just σ 2 .
Of course, we don’t know σ 2 , the true variance of the population. We must replace it by its
estimator, the mean squared error (MSE). Then the estimates of the variances are given by
(x − X̄)2
 
2 1
σ̂µ̂ (x) = + Pn 2
MSE
n i=1 (Xi − X̄)

and
(x − X̄)2
 
1
σ̂Ŷ2 (x) = 1 + + Pn 2
MSE.
n i=1 (Xi − X̄)

Finally, we can write down the limits of the (1 − α)% confidence interval of the mean. These
are given by
µ̂(x) ± tn−2;α/2 σ̂µ̂ (x),
and the limits of the (1 − α)% prediction interval are given by

Ŷ (x) ± tn−2;α/2 σ̂Ŷ (x).

These are all the pieces we need to make predictions using the regression relation that we
have found.
These intervals are illustrated in Figure 4.4 and 4.5.
128 CHAPTER 4. REGRESSION ANALYSIS



800
● ● ●


● ●●



soil respiration (mg CO2/m2/hr)


● ●

600
● ●

● ● ●
● ●


● ●

● ●
● ● ●
400

● ●

● ●


200

10 15 20 25 30 35

temperature (Celcius)

Figure 4.4: Scatterplot of CO2 respiration versus temperature for 44 fields in the Missouri
prairie, the fitted regression line and the 95% confidence interval of the mean.




800

● ● ●


● ●●




soil respiration (mg CO2/m2/hr)

● ●

600

● ●

● ● ●
● ●


● ●

● ●
● ● ●
400

● ●

● ●


200

10 15 20 25 30 35

temperature (Celcius)

Figure 4.5: Scatterplot of CO2 respiration versus temperature for 44 fields in the Missouri
prairie, the fitted regression line, the 95% confidence intervals of the mean (blue – dashed)
and the 95% prediction intervals (red – dotted).
4.2. MULTIPLE LINEAR REGRESSION 129

Normal Q−Q Plot

150


100

●●

●●

●●●

50
●●

●●
●●

Sample Quantiles

0
●●
●●

●●●

−50


●●

−100


−150

−200

−2 −1 0 1 2

Theoretical Quantiles

Figure 4.6: Normal QQ plot of the residuals of the regression model.

4.1.8 Assessment of the assumptions

We still need to assess the assumptions underlying the linear regression model, in a similar
way as we did in Chapter 3:

• normality of the error terms: this can be assessed by means of a normal QQ plot of the
residuals. This is shown in Figure 4.6.

• linearity of the covariate effect: this can be assessed by means of a graph of the resid-
uals versus the covariate. This is shown in Figure 4.7.

• constancy of the variance: this can be assessed by means of the same residual plot.

In the residual plot, we would like to see that the residuals are randomly scattered, and
do not form any pattern. In particular, we would like their distribution in the plot to be
symmetric and also clustered around the y = 0 line (implying small errors).

4.2 Multiple linear regression

4.2.1 Introduction

Simple linear regression assumes that there is exactly one predictor variable that can explain
our response variable. For example, in the soil experiment we considered only temperature as
130 CHAPTER 4. REGRESSION ANALYSIS

150


100

● ●
● ●

● ● ●

50
● ●

residuals (mg CO2/m2/hr)


● ●
● ●

0
● ●
● ●


● ●

−50


● ●


−100



−150


−200


10 15 20 25 30 35

temperature (Celcius)

Figure 4.7: Residual plot of the regression model.

a predictor of CO2 respiration. In reality, there are likely more predictors we should consider.
In the soil example, these might include the season or time of year, or the characteristics of
the soil’s microbial community.
So in this section we will consider the situation where we want to test for the relation
between more than two variables. In this case, a simple linear regression will not suffice. We
would instead like to use an extension of the simple linear regression framework with more
predictors. This model is known as multiple linear regression.
The concept is straightforward: we will extend the simple linear regression model by including
more terms in our regression equation. Instead of a single predictor variable Xi , we will have
multiple predictors Xi1 , ..., Xik .
However, before we dive into the details of the multiple linear regression model, we need to
consider a tricky point that already came up in Chapter 3 when we studied ANOVA with
more than one treatment variable. As soon as we have more than one treatment/predictor
variable, we need to consider the possibility that there might be an interaction between these
two variables in terms of their effect on the response variable.
We saw for ANOVA analysis that we could sometimes assume that there was no interaction
between the treatment/predictor variables. In this case we could use an additive model that
did not include an interaction effect. In Chapter 3, this was the example of the whipped
cream experiment where we considered the whipping time and the HPMC concentration as
the predictors of the cream overrun (the response variable), but did not include an interaction
effect.
On the other hand, if we do believe that there is interaction between our predictor variables,
4.2. MULTIPLE LINEAR REGRESSION 131

then we need to account for this in our regression model. In the ANOVA interaction model,
this refers to the example of mercury in fish. In that experiment, we believed that there was
an interaction between the mercury accumulation in each fish species and the environmental
mercury concentration, so we accounted for this in our ANOVA model. We will do the same
thing in our multiple regression model.
So in this section we will first consider the additive multiple linear regression model, where
we assume no interaction effect is present. Then we will make the opposite assumption, and
consider the interaction multiple linear regression model
We’ll study the multiple linear regression model using yet another food science example:
we’re going to think about cheese.
As cheese ages, various chemical processes take place that determine the taste of the final
product. To investigate this process (and hopefully learn how to make even tastier cheese)
scientists have collected a dataset containing the concentrations of various chemicals in 30
samples of mature cheddar cheese, and a subjective measure of taste given by a panel of
professional tasters. The chemicals are lactic acid (Lactic) and hydrogen sulphide (H2 S)
(measured in mg/l).
In our regression model, the concentrations of these two chemicals will be our predictor
variables, and the response variable will be the taste score of the cheese.
A scatterplot matrix of this dataset is shown in Figure 4.8, and a three-dimensional scatterplot
is shown in Figure 4.9. Examining these plots, we think we can observe a relation between
the predictors and the response variable: it looks like the taste score increases with increased
concentrations of H2 S and lactic acid. As usual, we’ll make a statistical analysis of this
dataset to investigate our suspicion.
The objectives of the study are summarized in the following research questions:

• Do H2 S and lactic acid concentrations have an effect on the expected taste score?

• Can the relation between H2 S and lactic acid concentrations and the expected taste
score be quantified?

• Can the relation be used to estimate a mean taste score for a given H2 S and lactic acid
concentration?

• Finally, can we predict the taste scores for values of H2 S and lactic acid concentrations
that we did not observe?

Note that these research questions are completely equivalent to those we formulated for
simple linear regression in Section 4.1.1.
132 CHAPTER 4. REGRESSION ANALYSIS

4 6 8 10

● ● ● ●
● ●

50
● ●

40
● ●
● ● ● ●
● ●
● ●
● ●

30
taste ● ● ● ● ● ●

● ● ●● ●

20

● ● ● ●
●● ● ●
● ●
● ● ● ●
●● ● ●

10
● ●● ● ● ●

● ● ● ●

0
● ●
10
● ●
● ●
● ●
● ●
8

● ● ● ●
● ● ● ●
● ●

● ●

H2S ● ●

●●
6

● ●
● ●
● ● ● ● ● ●
●● ● ●
● ●
4

● ● ●● ● ● ●●
● ●
●● ● ●
● ●

2.0
● ● ● ●

● ●

1.8
● ●
● ●
● ● ● ●
● ●
● ●

1.6
● ● ● ●
● ● ● ●
● ●
● ● ● ●
Lactic

1.4
●● ●● ●
● ● ● ● ●
● ● ● ●

1.2
●● ● ●
● ● ● ●
● ●

1.0
● ●

● ●

0 10 20 30 40 50 1.0 1.2 1.4 1.6 1.8 2.0

Figure 4.8: Scatterplot matrix of the cheese data set.

!
60

!
!
!
!
50

!
!
!
40

! !
Taste score

!
!
30

! !
!
H2S

!
! ! 12
! !
20

!
10
! !
!
8
!
10

! 6
! 4
!
2
0

0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Lactic

Figure 4.9: Three-dimensional scatterplot of the cheese data set.


4.2. MULTIPLE LINEAR REGRESSION 133

4.2.2 The additive model

First we’ll address the situation where there is no interaction between our predictor variables.
Then we can use an additive model, since we assume that the effect on the response variable
is just the sum of the effects due to the two predictor variables plus some error terms.
The additive multiple linear regression model is given by

Yi = µ + β1 X1i + β2 X2i + εi i = 1, . . . , n,

where

• X1i and X2i are the two predictors

• µ, β1 , β2 and σ 2 are the parameters

• εi ∼ N (0, σ 2 ) are the error terms.

This should look very familiar: this is nothing but the simple linear regression model with
an additional predictor. This model implies that

(Yi |X1i , X2i ) ∼ N (µ + β1 X1i + β2 X2i , σ 2 ) i = 1, . . . , n.

Therefore µ+β1 X1i +βX2i is the conditional mean of Y given X1i and X2i , i.e. E [Y |X1i , X2i ] =
µ + β1 X1i + β2 X2i .
Since there is actually no need to include the subscript i in these expressions, we may just
as well write
(Y |X1 , X2 ) ∼ N (µ + β1 X1 + β2 X2 , σ 2 )
and E [Y |X1 , X2 ] = µ + β1 X1 + β2 X2 , or even

(Y |X1 = x1 , X2 = x2 ) ∼ N (µ + β1 x1 + β2 x2 , σ 2 )

and E [Y |x1 , x2 ] = µ + β1 x1 + β2 x2 .
Note that we no longer have a regression line, but a surface: the response variable Y is a
function of both X1 and X2 . This model is shown graphically in Figure 4.10. Its interpretation
will be explained shortly.
Since this model is a straightforward extension of the simple regression model, we won’t
spend much time on its characteristics since these can easily be extrapolated from what we
learned about the simple regression model. We summarize below the main points for using
this model:

• The parameters can be estimated as before using the method of least squares.
134 CHAPTER 4. REGRESSION ANALYSIS

Figure 4.10: An illustration of an additive regression model.

cal representation of the additive regre


• The parameter estimators µ̂, β̂1 and β̂2 are normally distributed.

• The residual variance σ 2 can be estimated as MSE = SSE/(n − p), where p is the
number of mean-parameters in the model (here: p = 3).

• The total sum of squares can be decomposed in the usual way: SSTot = SSR + SSE,
where SSR has p − 1 degrees of freedom.
SSR can be further decomposed into SSR1 +SSR2 , where the terms refer to the variance
that can be attributed to the first and the second predictor, respectively.

• The F -test statistic F = MSR/MSE has p − 1 and n − p degrees of freedom and can
be used for testing the hypotheses

H0 : β1 = β2 = 0 versus H1 : not H0 .

The null hypothesis proposes that there is no relation between the predictors and the
response variable, whereas the alternative hypothesis proposes that there is.

• For each SSR an F-test can be performed. For example, F = SSR1 /MSE can be used
for testing
H0 : β1 = 0 versus H1 : β1 6= 0.

• The coefficient of determination, R2 = SSR/SSTot has the same interpretation as


before.
4.2. MULTIPLE LINEAR REGRESSION 135

4.2.3 Example

We will now perform a multiple linear regression on our cheese dataset. Remember that we
wish to test whether the concentrations of the chemicals H2 S and lactic acid have an effect on
the taste score of the cheese. Our null hypothesis is that there is no effect, i.e. β1 = β2 = 0.
We show below the R output for a multiple linear regression on this dataset.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.592 8.982 -3.072 0.00481 **
Lactic 19.887 7.959 2.499 0.01885 *
H2S 3.946 1.136 3.475 0.00174 **
---
Signif. codes: 0 ^O***~
O 0.001 ^
O**~
O 0.01 ^
O*~
O 0.05 ^
O.~
O 0.1 ^
O ~
O 1

Residual standard error: 9.942 on 27 degrees of freedom


Multiple R-squared: 0.6517,Adjusted R-squared: 0.6259
F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07

We can see immediately that the estimated regression line is given by


µ̂(H, L) = −27.6 + 19.9L + 3.9H
where H is the concentration of H2 S and L is the concentration of lactic acid. The fit of this
model to the observed data is shown in Figure 4.11.
To interpret the additive model, we can carry out a number of calculations to check the
relationships between our two predictors and the response variable.
First, we can calculate the estimated difference in average taste score for cheese with H2 S=2mg/l
and lactic acid concentrations of 4 and 3 mg/l as:
µ̂(4, 2) − µ̂(3, 2) = (−27.5918 + 19.8872 × 4 + 3.9463 × 2)
− (−27.5918 + 19.8872 × 3 + 3.9463 × 2)
= 19.8872(= β̂1 ).

Similarly, we can calculate the estimated increase in the mean taste score for a cheese with
3mg/l lactic acid as compared to a cheese with 4 mg/l lactic acid, but now for both cheeses
having 7mg/l H2 S:
µ̂(4, 7) − µ̂(3, 7) = (−27.5918 + 19.8872 × 4 + 3.9463 × 7)
− (−27.5918 + 19.8872 × 3 + 3.9463 × 7)
= 19.8872(= β̂1 ).
136 CHAPTER 4. REGRESSION ANALYSIS

60
!
!
!
!

50
!
!
!

40
! !
Taste score !
!

30 ! !
!

H2S
!
! ! 12
! !
20

!
10
! !
!
8
!
10

! 6
! 4
!
2
0

0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Lactic

Figure 4.11: The fit of the additive regression model.

Therefore, we may write

β̂1 = µ̂(4, 7) − µ̂(3, 7) = µ̂(4, 2) − µ̂(3, 2),

or, more generally,


β̂1 = µ̂(x1 , x2 ) − µ̂(x1 − 1, x2 ).
Similar calculations will show that

β̂2 = µ̂(x1 , x2 ) − µ̂(x1 , x2 − 1).

What do we learn from these calculations?

• the effect of lactic acid (β̂1 = 19.9) does not depend on the H2 S concentration

• the effect of H2 S (β̂2 = 3.9) does not depend on the lactic acid concentration

• the effect of lactic acid (β̂1 = 19.9) has to be interpreted conditionally on H2 S, but it
does not depend on the H2 S concentration

• the effect of H2 S (β̂2 = 3.9) has to be interpreted conditionally on lactic acid, but it
does not depend on the lactid acid concentration

• the effects of lactic acid and H2 S are therefore additive.

The latter property is further illustrated by the R output below, which shows the results of
two simple linear regression analyses: one for each of the predictor variables. The effects are
4.2. MULTIPLE LINEAR REGRESSION 137

different in the simple regression models because in the simple models their interpretations
are not conditional on the other covariate (predictor) variable.
The simple linear regression model with only lactic acid as predictor is given by:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -29.859 10.582 -2.822 0.00869 **
Lactic 37.720 7.186 5.249 1.41e-05 ***

The simple linear regression model with only H2 S as regressor is given by:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.7868 5.9579 -1.643 0.112
H2S 5.7761 0.9458 6.107 1.37e-06 ***

We also show the ANOVA table that R produces:

Analysis of Variance Table

Response: taste
Df Sum Sq Mean Sq F value Pr(>F)
Lactic 1 3800.4 3800.4 38.446 1.25e-06 ***
H2S 1 1193.5 1193.5 12.074 0.001743 **
Residuals 27 2669.0 98.9

By checking the p-values we can determine that there are statistically significant relations
between both the cheese taste score and the H2 S concentration, and the cheese taste score
and the lactic acid concentrations. From the estimation of the regression coefficients, we
can read the sign and magnitude of these effects. In particular, we can note that lactic acid
concentration appears to have a stronger effect on the taste score than the H2 S concentration.
Finally, we must also assess the assumptions underlying the tests. In Figure 4.12, the QQ-
plot of the residuals is shown. It suggests that the error terms are normally distributed, as
we would wish to find.
It is also informative to look at a scatter plot of the residuals versus the covariates. These
residual graphs are shown in Figure 4.13. These graphs support the assumption that the
effects of lactic acid and H2 S are linear, and that the residual variance is constant. Therefore,
we can consider the assumptions satisfied, and that our statistical analysis is valid.
138 CHAPTER 4. REGRESSION ANALYSIS

Normal Q−Q Plot

20


10

Sample Quantiles

●● ● ●
●●



0


●●


●●




−10



−2 −1 0 1 2

Theoretical Quantiles

Figure 4.12: The normal QQ-plot of the residuals of the additive regression model.

● ●
20

20

● ●

● ●

● ●
10

10

● ●
● ●
residuals

residuals

● ● ● ●
● ● ●
● ● ● ●

● ●
● ●
0

● ●
● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ●
● ●
● ●
−10

−10

● ●
● ●
● ●

● ●

● ●

1.0 1.4 1.8 4 6 8 10

lactic acid H2S

Figure 4.13: The residual plots for the additive regression model.
4.2. MULTIPLE LINEAR REGRESSION 139

Figure 4.14: An illustration of a regression model with interaction.


−→ for small values of X2 there is a positive linear effect of X1,
4.2.4 The interaction model
but for large values of X2 there is a negative linear effect of X1.
In this final section, we will deal briefly with the question of what happens if the effect of one
predictor on the response variable also depends on the other predictor. As we saw for the
two-way ANOVA interaction model, interaction between our statistical variables can happen38
very often in the real world.
In our cheese example, we have to consider the possibility that the effect of H2 S concentration
on the taste score also depends on the concentration of lactic acid, and vice versa. How can
we make predictions in this case?
Consider the regression model
Yi = µ + β1 X1i + β2 X2i + β3 X1 X2 + εi i = 1, . . . , n.
In this model, the term β3 X1 X2 is the interaction term.
Figure 4.14 illustrates the regression model with the interaction term: notice that the effect
of X2 on the average response now depends on the value of X1 . In other words, the regression
plane is no longer flat like in Figure 4.10, but is now curved. This tells us that the response
variable Y depends simultaneously on X1 and X2 , and we cannot study their effects separately
like we did in the additive model.
To make the interpretation of the interaction model more clear, we can rewrite the model as
Yi = µ + (β1 + β3 X2 ) X1i + β2 X2i + εi .

This representation shows that the regression coefficient of X1 is equal to β1 + β3 X2 , i.e. the
regression coefficient of X1 is a function of X2 .
The model could also have been rewritten as
Yi = µ + β1 X1 + (β2 + β3 X1 ) X1i + εi ,
140 CHAPTER 4. REGRESSION ANALYSIS

which illustrates that the same reasoning applies to the effect of X2 .


We will not go any deeper here into the details of this model. Instead we will see how R
will apply this model, using our cheese example. The results obtained from R for a multiple
linear regression with interaction on the cheese dataset are shown below.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -23.187 27.749 -0.836 0.411
Lactic 16.725 20.479 0.817 0.422
H2S 3.236 4.382 0.738 0.467
Lactic:H2S 0.488 2.902 0.168 0.868

Residual standard error: 10.13 on 26 degrees of freedom


Multiple R-squared: 0.6521,Adjusted R-squared: 0.6119
F-statistic: 16.24 on 3 and 26 DF, p-value: 3.768e-06

Note that the p-value for the interaction effect Lactic:H2S is p = 0.868 < 0.05. So we conclude
that we should not reject the null hypothesis that there is no interaction effect between our
two predictor variables.
We are in a similar situation as for our whipped cream example for two-way ANOVA. We
should have started from a model with interaction effects and first tested for the absence of
the interaction effects. Indeed, for the cheese experiment we find this to be true, and so we
can safely use the additive linear regression model for this dataset.

You might also like