Notes

School of Mathematics, Statistics and Computer Science
STAT261
Statistical Inference Notes
Printed at the University of New England, October 4, 2007
Contents
1 Estimation
1.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Examples of Statistics . . . . . . . . . . . . . .
1.2 Estimation by the Method of Moments . . . . . . . . .
1.3 Estimation by the Method of Maximum Likelihood . .
1.4 Properties of Estimators . . . . . . . . . . . . . . . . .
1.5 Examples of Estimators and their Properties . . . . . .
1.6 Properties of Maximum Likelihood Estimators . . . . .
1.7 Confidence Intervals . . . . . . . . . . . . . . . . . . .
1.7.1 Pivotal quantity . . . . . . . . . . . . . . . . . .
1.8 Bayesian estimation . . . . . . . . . . . . . . . . . . . .
1.8.1 Bayes theorem for random variables . . . . . .
1.8.2 Post is prior likelihood . . . . . . . . . . . . .
1.8.3 Likelihood . . . . . . . . . . . . . . . . . . . . .
1.8.4 Prior . . . . . . . . . . . . . . . . . . . . . . . .
1.8.5 Posterior . . . . . . . . . . . . . . . . . . . . . .
1.9 Normal Prior and Likelihood . . . . . . . . . . . . . . .
1.10 Bootstrap Confidence Intervals . . . . . . . . . . . . .
1.10.1 The empirical cumulative distribution function.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
8
12
15
19
21
22
25
28
28
28
31
31
32
34
36
36
2 Hypothesis Testing
2.1 Introduction . . . . . . . . . . . . . . . . . . .
2.2 Terminology and Notation. . . . . . . . . . . .
2.2.1 Hypotheses . . . . . . . . . . . . . . .
2.2.2 Tests of Hypotheses . . . . . . . . . . .
2.2.3 Size and Power of Tests . . . . . . . .
2.3 Examples . . . . . . . . . . . . . . . . . . . .
2.4 One-sided and Two-sided Tests . . . . . . . .
2.4.1 Case(a) Alternative is one-sided . . .
2.4.2 Case (b) Two-sided Alternative . . . .
2.4.3 Two Approaches to Hypothesis Testing
2.5 Two-Sample Problems . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
40
41
41
41
42
43
47
48
48
50
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.6
2.7
2.8
2.9
Connection between Hypothesis testing

Summary . . . . . . . . . . . . . . . .
Bayesian Hypothesis Testing . . . . . .
2.8.1 Notation . . . . . . . . . . . . .
2.8.2 Bayesian approach . . . . . . .
Non-Parametric Hypothesis testing. . .
2.9.1 Kolmogorov-Smirnov (KS) . . .
2.9.2 Asymptotic distribution . . . .
2.9.3 Bootstrap Hypothesis Tests . .
and CIs
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
57
58
58
58
61
61
62
65
3 Chisquare Distribution
3.1 Distribution of S 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Independence of X and S 2 . . . . . . . . . . . . . . . . . . . . . . .
3.4 Confidence Intervals for 2 . . . . . . . . . . . . . . . . . . . . . . .
3.5 Testing Hypotheses about 2 . . . . . . . . . . . . . . . . . . . . . .
3.6 2 and Inv-2 distributions in Bayesian inference . . . . . . . . . .
3.6.1 Non-informative priors . . . . . . . . . . . . . . . . . . . . .
3.7 The posterior distribution of the Normal variance . . . . . . . . . .
3.7.1 Inverse Chi-squared distribution . . . . . . . . . . . . . . . .
3.8 Relationship between 2 and Inv-2 . . . . . . . . . . . . . . . . . .
3.8.1 Gamma and Inverse Gamma . . . . . . . . . . . . . . . . . .
3.8.2 Chi-squared and Inverse Chi-squared . . . . . . . . . . . . .
3.8.3 Simulating Inverse Gamma and Inverse-2 random variables.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
69
75
75
77
79
79
80
81
82
82
82
82
4 F Distribution
4.1 Derivation . . . . . . . . . . . . . . . . . . . . .
4.2 Properties of the F distribution . . . . . . . . .
4.3 Use of F-Distribution in Hypothesis Testing . .
4.4 Pooling Sample Variances . . . . . . . . . . . .
4.5 Confidence Interval for 12 /22 . . . . . . . . . .
4.6 Comparing parametric and bootstrap confidence
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
intervals
. .
. .
. .
. .
. .
for
. . . .
. . . .
. . . .
. . . .
. . . .
12 /22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
86
89
92
94
94
5 t-Distribution
5.1 Derivation . . . . . . . . . . . . . . . . . . .
5.2 Properties of the tDistribution . . . . . . .
5.3 Use of tDistribution in Interval Estimation
5.4 Use of t-distribution in Hypothesis Testing .
5.5 Paired-sample t-test . . . . . . . . . . . . .
5.6 Bootstrap T-intervals . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
. 96
. 97
. 99
. 104
. 109
. 111
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Analysis of Count Data

6.1 Introduction . . . . . . . . . . .
6.2 GoodnessofFit Tests . . . . .
6.3 Contingency Tables . . . . . . .
6.3.1 Method . . . . . . . . .
6.4 Special Case: 2 2 Contingency
6.5 Fishers Exact Test . . . . . . .
6.6 Parametric Bootstrap-X 2 . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
115
115
115
125
125
129
131
134
7 Analysis of Variance
7.1 Introduction . . . . . . . . . . . . . . . . . . . . .
7.2 The Basic Procedure . . . . . . . . . . . . . . . .
7.3 Single Factor Analysis of Variance . . . . . . . . .
7.4 Estimation of Means and Confidence Intervals . .
7.5 Assumptions Underlying the Analysis of Variance
7.5.1 Tests for Equality of Variance . . . . . . .
7.6 Estimating the Common Mean . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
138
138
138
140
148
149
150
152
.
.
.
.
.
153
153
154
159
161
166
8 Simple Linear Regression

8.1 Introduction . . . . . . .
8.2 Estimation of and . .
8.3 Estimation of 2 . . . .
8.4 Inference about
, and
8.5 Correlation . . . . . . .
. .
. .
. .
Y
. .
.
.
.
.
.
.
.
.
.
.
. . . .
. . . .
. . . .
. . . .
Table
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The notes
Material for these notes has been drawn and collated from the following sources
Mathematical Statistics with Applications William Mendenhall, Dennis Wakerly, Richard
Schaeffer. Duxbury ISBN 0-534-92026-8
Bayesian Statistics an introduction, third edition Peter Lee. Hodder Arnold ISBN 0-34081405-5.
Bayesian Data Analysis Andrew Gelman, John Carlin, Hal Stern, Donald Rubin. Chapman
& Hall. ISBN 1-58488-388-X
An Introduction to the Bootstrap Bradley Effron, Robert Tibshirani. Chapman & Hall.
ISBN 0-412-04231-2
Introduction to Statistics through Resampling Methods and R/S-PLUS Phillip Good. Wiley
ISBN 0-471-71575-1
There are 3 broad categories of statistical inference

Parameteric, Frequentist
Parametric, Bayesian
Non-parametric
These are not mutually exclusive and both semi-parametric and non-parametric Bayesian
models are powerful methods in modern statistics.
Statistical Inference is a vast area which cannnot be covered in a 1 semester undegraduate course. This unit shall focus mainly on frequentist parametric statistical inference
but it is not intended that this has more weight than the others. Each has their place.
However we can introduce the rudimentary concepts about probability, intervals and the
mathematics required to derive such. These mathematical techniques apply equally well
in the other settings, along with the knowledge of densities, probability etc.
A discussion of statistical inference would be most unbalanced if it were restricted to
only one type of inference. Successful statistical modelling requires flexibility and the
statistician (with practice) recognises which type of model is suggested by the data and
the problem at hand.
The notes are organised to introduce the alternative inference methods in sections. At
this stage you may consider the different methods as alternate ways of using the information
in the data.
1
2
Not all topics include the 3 categories of inference because the Bayesian or Nonparametric counterpart does not always align with the frequentist methods. However, where there
exist sufficient alignment, an alternate method of inference is introduced. It is hoped that
this will stimulate students to explore the topics of Bayesian and Nonparametric statistics
more fully in later units.
Parametric, Frequentist
Both systematic and random components are represented by a mathematical model and
the model is a function of parameters which are estimated from the data. For example
yij = 0 + 1 xi + ij
ij N (0, 2 )
is a parametric model where the parameters are

the coefficients of the systematic model, 0 , 1
the variance of the random model, 2 .
A rough description of frequentist methods is that population values of the parameters
are unknown and based on a sample (x, y) we get estimates of the true, but unknown
values. These are denoted as 1 , 2 ,
2 in this case.
Bayesian
Whereas in frequentist inference the data are considered a random sample and the parameters fixed, Bayesian statistics regards the data as fixed and the parameters as random
samples. The exercise is that given the data, what are the distributions of the parameters
such that the observed sample from those distributions could give rise to the observed data.
Non-parametric
This philosophy does not assume that a mathematical form (with parameters) should be
imposed on the data and the model is determined by the data themselves. The techniques
include
premutation tests, bootstrap, Kolmogorov-Smirnov tests etc.
Kernel density estimation, kernel regression, smoothing splines etc.
This seems a good idea to not impose any predetermined mathematical form on the
data. However, the limitations are
the data are not summarized by parameters and so interpretation of the data requires
whole curves etc. There is not a ready formula to plug in values to derive estimates.
3
Requires sound computing skills and numerical methods.
The statistical method may be appropriate only when there is sufficient data to
reliably indicate associations etc. without the assistance of a parametric model.
Chapter 1
Estimation
The application of the methods of probability to the analysis and interpretation of data is
known as statistical inference. In particular, we wish to make an inference about a population based on information contained in a sample. Since populations are characterized by
numerical descriptive measures called parameters, the objective of many statistical investigations is to make an inference about one or more population parameters. There are two
broad areas of inference: estimation (the subject of this chapter) and hypothesis-testing
(the subject of the next chapter).
When we say that we have a random sample X1 , X2 , . . . , Xn from a random variable X
or from a population with distribution function F (x; ), we mean that X1 , X2 , . . . , Xn are
identically and independently distributed random variables each with c.d.f. F (x; ), that
is, depending on some parameter . We usually assume that the form of the distribution,
e.g., binomial, Poisson, Normal, etc. is known but the parameter is unknown. We wish
to obtain information from the data (sample) to enable us to make some statement about
the parameter. Note that, may be a vector, e.g., = (, 2 ). See WMS 2.12 for more
detailed comments on random samples.
The general problem of estimation is to find out something about using the information in the observed values of the sample, x1 , x2 , . . . , xn . That is, we want to choose a
function H(x1 , x2 , . . . , xn ) that will give us a good estimate of the parameter in F (x; ).
1.1
Statistics
We will introduce the technical meaning of the word statistic and look at some commonly
used statistics.
Definition 1.1
Any function of the elements of a random sample, which does
not depend on unknown parameters, is called a statistic.
CHAPTER 1. ESTIMATION
Strictly speaking, H(X1 , X2 , . . . , Xn ) is a statistic and H(x1 , x2 , . . . , xn ) is the observed

value of the statistic. Note that the former is a random variable, often called an estimator
of , while H(x1 , x2 , . . . , xn ) is called an estimate of . However, the word estimate is
sometimes used for both random variable and its observed value.
1.1.1
Examples of Statistics
Suppose that we have a random sample X1 , X2 , . . . , Xn from a distribution with mean

and variance 2 .
P
1. X = ni=1 Xi /n is called the sample mean.
P
2. S 2 = ni=1 (Xi X)2 /(n 1) is called the sample variance.
3. S = S 2 is called the sample standard deviation.

P
4. Mr = ni=1 Xir /n is called the rth sample moment about the origin.
5. Suppose that the random variables X1 , . . . , Xn are ordered and re-written as X(1) ,
X(2) , . . . , X(n) . The vector (X(1) , . . . , X(n) ) is called the ordered sample.
(a) X(1) is called the minimum of the sample, sometimes written Xmin or min(Xi ).
(b) X(n) is called the maximum of the sample, sometimes written Xmax or
max(Xi ).
(c) Xmax Xmin = R is called the sample range.
(d) The sample median is X( n+1 ) if n is odd, and 12 (X( n2 ) + X( n2 +1) ) if n is even.
2
Computer Exercise 1.1

Generate a random sample of size 100 from a normal distribution with mean 10 and
standard deviation 3. Use R to find the value of the (sample) mean, variance, standard
deviation, minimum, maximum, range, median, and M2 , the statistics defined above.
Repeat for a sample of size 100 from an exponential distribution with parameter 1.
Solution:
#________ SampleStats.R _________
# Generate the normal random sample
rn <- rnorm(n=100,mean=10,sd= 3)
print(summary(rn))
cat("mean = ",mean(rn),"\n")
cat("var = ",var(rn),"\n")
cat("sd
= ",sd(rn),"\n")
cat("range = ",range(rn),"\n")
cat("median = ",median(rn),"\n")
cat("Second Moment = ",mean(rn^2),"\n")
> source("SampleStats.R")
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.8
7.9
9.4
9.8
12.0 18.2
mean = 9.9
var = 9.5
sd
= 3.1
range = 1.8 18
median = 9.4
Second Moment = 106
If you are using Rcmdr, the menu for generating normal random variables is
Distributions Normal distribution Sample from a normal distribution
which appears like this:-
You must then supply a name for the data set (e.g. rn) and the parameters mu = 10,
sigma =3. Make the number of rows = 100 and the number of columns = 1.
When you click OK a data set containing the numbers is produced. Here the name of
the data set is rn and this appears in Data set: rn in the top left of Rcmdr. Observe in
the script window that Rcmdr is using the input through the menus (GUIs) to produce a
script akin to that above. Rcmdr is an alternative of computing but it is not sufficiently
comprehensive do all our computing and usually requires augmenting with other scripts.
If there is an active data set, summary statistics are derived by
Statistics Summaries Active data set
The summary statistics are output in the Output window.

The exponential random sample is generated using: re <- rexp(n=100, rate=1)
or the Rcmdr menus
Distributions Exponential distribution Sample from an exponential distribution
In the probability distributions considered in STAT260 the mean and variance are simple functions of the parameter(s), so when considering statistics, it is helpful to note which
ones youd expect to give information about the mean and which ones give information
about the variance. Clearly, X and the sample median give information about whereas
S 2 and R give information about 2 .
We have not previously encountered S 2 , Mr , R, etc. but we (should) already know the
following facts about X.
(i) It is a random variable with E(X) = , Var(X) = 2 /n, where = E(Xi ),
2 =Var(Xi ).
(ii) If X1 , X2 , . . . , Xn is from a normal distribution, then X is also normally distributed.
(iii) For large n and any distribution of the Xi for which a mean () and variance ( 2 )
2
(by the
exist, X is distributed approximately normal with mean and variance
n
Central Limit Theorem).
Next we will consider some general methods of estimation. Since different methods may
lead to different estimators for the same parameter, we will then need to consider criteria
for deciding whether one estimate is better than another.
1.2
Estimation by the Method of Moments
Recall that, for a random variable X, the rth moment about the origin is 0r =E(X r )
and that for a random sample X1 , X2 , . . . , Xn , the rth sample moment about the origin is
defined by
n
X
Mr =
Xir /n, r = 1, 2, 3, . . .
i=1
and its observed value is denoted by

mr =
n
X
xri /n .
i=1
Note that the first sample moment is just the sample mean, X.
We will first prove a property of sample moments.
Theorem 1.1
Let X1 , X2 , . . . , Xn be a random sample of X. Then
E(Mr ) = 0r , r = 1, 2, 3, . . .
Proof
1
E(Mr ) = E
n
n
X
i=1
!
Xir
1X 0
1X
=
E(Xir ) =
= 0r .
n i=1
n i=1 r
This theorem provides the motivation for estimation by the method of moments (with the
estimator being referred to as the method of moments estimator or MME). The sample
moments, M1 , M2 , . . ., are random variables whose means are 01 , 02 , . . .. Since the

population moments depend on the parameters of the distribution, estimating them by the
sample moments leads to estimation of the parameters.
We will consider this method of estimation by means of 2 examples, then state the
general procedure
Example 1.1
In this example, the distribution only has one parameter.
Given X1 , X2 , . . . , Xn is a random sample from a U (0, ) distribution, find the method of
moments estimator (MME) of .
Solution: Now, for the uniform distribution (f (x) = 1 I[0,] (x) ),
Z
= E(X) =
0
1
x dx
=
2
Using the Method of Moments we proceed to estimate = /2 by m1 . Thus since m1
= x we have
=x
2
and,
= 2x.
Then, = 2 x and the MME of is 2 X.
Generate 100 samples of size 10 from a uniform distribution, U(0,) with = 10. Estimate
the value of from your samples using the method of moments and plot the results.
Comment on the results.
In this exercise, we know a priori that = 10 and have generated random samples
samples. the samples are analysed as if unknown and estimated by the method of
moments. Then we can compare the estimates with the known value.
Solution:
10
density(x = theta.estimates)
0.10
0.00
Density
for (i in 1:nsimulations){
ru <- runif(n=sampsz,min=0,max=theta)
Xbar <- mean(ru)
theta.estimates[i] <- 2*Xbar
}
#
end of the i loop
0.20
#_________ UniformMoment.R _____

theta <- 10
sampsz <- 10
nsimulations <- 100
theta.estimates <- numeric(nsimulations)
10
12
14
plot(density(theta.estimates))
N = 100 Bandwidth = 0.605
(You should do the exercise and obtain a plot for yourself).

It should be clear from the plot that about 50% of the estimates are greater than 10 which
is outside the parameter space for a U(0,10) distribution. This is undesirable.
Example 1.2
In this example the distribution has two parameters.
Given X1 , . . . , Xn is a random sample from the N (, 2 ) distribution, find the method
of moments estimates of and 2 .
Solution:
For the normal distribution, E(X) = and E(X 2 ) = 2 + 2 (Theorem 2.2 STAT260).
Using the Method of Moments:
2 +
2 = m2 .
Equate E(X) to m1 and E(X 2 ) to m2 so that,
= x and
That is, estimate by x and estimate 2 by m2 x2 . Then,
1X 2
= x, and 2 =
xi x2 .
n
P
The latter can also be written as 2 = n1 ni=1 (xi x)2
Generate 100 samples of size 10 from a normal distribution with = 14 and = 4.
Estimate and 2 from your samples using the method of moments. Plot the estimated
values of and 2 . Comment on your results.
Solution:
0.30
0.20
0.00
0.10
Density
10
14
18
0.04
0.06
0.00
plot(density(mu.estimates))
plot(density(var.estimates))
0.02
Density
#_____________NormalMoments.R ___________
mu <- 14
sigma <- 4
sampsz <- 10
nsimulations <- 100
mu.estimates <- numeric(nsimulations)
var.estimates <- numeric(nsimulations)
rn <- rnorm(mean=mu,sd=sigma,n=sampsz)
mu.estimates[i] <- mean(rn)
var.estimates[i] <- mean( (rn -mean(rn))^2
}
# end of i loop
11
10
20
30
40
The plot you obtain for the row means should be centred around the true mean of
14. However, you will notice that the plot of the variances is not centred about the true
variance of 16 as you would like. Rather it will appear to be centred about a value less than
16. The reason for this will become evident when we study the properties of estimators in
section 1.5.
General Procedure
Let X1 , X2 , . . . , Xn be a random sample from F (x : 1 , . . . , k ). That is, suppose that
there are k parameters to be estimated. Let 0r , mr (r = 1, 2, . . . , k) denote the first k
population and sample moments respectively, and suppose that each of these population
moments are certain known functions of the parameters. That is,
01 = g1 (1 , . . . , k )
02 = g2 (1 , . . . , k )
..
.
0
k = gk (1 , . . . , k ) .
Solving simultaneously the set of equations,
0r = gr (1 , . . . , k ) = mr , r = 1, 2, . . . , k
gives the required estimates, 1 , . . . , k .
1.3
12
Estimation by the Method of Maximum Likelihood
First the term likelihood of the sample must be defined. This has to be done separately
for discrete and continuous distributions.
Definition 1.2
Let x1 , x2 , . . . , xn be sample observations taken on the random variables
X1 , X2 , . . . , Xn . Then the likelihood of the sample, L(|x1 , x2 , . . . , xn ), is defined as:
(i) the joint probability of x1 , x2 , . . . , xn if X1 , X2 , . . . , Xn are discrete, and
(ii) the joint probability density function of X1 , . . . , Xn evaluated at x1 , x2 ,
. . . , xn if the random variables are continuous.
In general the value of the likelihood depends not only on the (fixed) sample x1 , x2 , . . . , xn
but on the value of the (unknown) parameter . and can be thought of as a function of .
The likelihood function for a set of n identically and independently distributed (iid)
random variables, X1 , X2 , . . . , Xn , can thus be written as:

P (X1 = x1 ).P (X2 = x2 )...P (Xn = xn ) for X discrete
L(; x1 , . . . , xn ) =
(1.1)
f (x1 ; ).f (x2 ; )...f (xn ; )
for X continuous.
For the discrete case, L(; x1 . . . . , xn ) is the probability (or likelihood) of observing
(X1 = x1 , X2 = x2 , . . . , Xn = xn ) It would then seem that a sensible approach to selecting
an estimate of would be to find the value of which maximizes the probability of observing
(X1 = x1 , X2 = x2 , . . . , Xn = xn ), (the event which occured).
The maximum likelihood estimate (MLE) of is defined as that value of which
maximizes the likelihood. To state it more mathematically, the MLE of is that value of
, say such that
x1 , . . . , xn ) > L(0 ; x1 , . . . , xn ).
L(;
where 0 is any other value of .
Before we consider particular examples of MLEs, some comments about notation and
technique are needed.
Comments
1. It is customary to use to denote both estimator (random variable) and estimate
(its observed value). Recall that we used for the MME.
13
2. Since L(; x1 , x2 , . . . , xn ) is a product, and sums are usually more convenient to deal
with than products, it is customary to maximize log L(; x1 , . . . , xn ) which we usually
abbreviate to l(). This has the same effect. Since log L is a strictly increasing
function of L, it will take on its maximum at the same point.
3. In some problems, will be a vector in which case L() has to be maximized by
differentiating with respect to 2 (or more) variables and solving simuiltaneously 2 (or
more) equations.
4. The method of differentiation to find a maximum only works if the function concerned
actually has a turning point.
Example 1.3
Given X is distributed bin(1, p) where p (0, 1), and a random sample x1 , x2 , ..., xn , find
the maximum likelihood estimate of p.
Solution: The likelihood is,
L(p; x1 , x2 , . . . , xn ) = P (X1 = x1 )P (X2 = x2 )...P (Xn = xn )
n
Y
1 xi
=
p (1 p)1xi
x
i
i=1
= px1 +x2 ++xn (1 p)nx1 x2 xn
P
P
= p xi (1 p)n xi
So
log L(p) =
xi log p + (n
xi ) log(1 p)
Differentiating with respect to p, we have

P
P
d logL(p)
n xi
xi
=
dp
p
1p
P
P
P
This is equal to zero when
xi (1 p) = p(n xi ), that is, when p = xi /n.
This estimate is denoted by p.
Thus, if the random variable X is distributed bin(1, p), the MLE of p derived from a sample
of size n is
p = X.
(1.2)
Example 1.4
Given x1 , x2 , . . . , xn is a random sample from a N (, 2 ) distribution, where both and
2 are unknown, find the maximum likelihood estimates of and 2 .
14
Solution: Write the likelihood as:

n
Y
1
2
2
e(xi ) /2
2
i=1
Pn
1
2
2
=
e i=1 (xi ) /2
2
n/2
(2 )
L(, ; x1 , . . . , xn ) =
So
X
n
n
(xi )2 /2 2
log L(, ) = log(2) log 2
2
2
i=1
2
To maximize this w.r.t. and 2 we must solve simultaneously the two equations
log L(, 2 )/ = 0
(1.3)
log L(, 2 )/ 2 = 0.
(1.4)
These equations become, respectively,

n

2 X
1
(xi ) = 0
. 2
2
i=1
(1.5)
Pn
2
n
i=1 (xi )
+
=0
(1.6)
2 2
2 4
Pn
From (1.5) we obtain i=1 xi = n, so that
= x. Using this in equation (1.6), we obtain
2 =
n
X
(xi x)2 /n .
i=1
Thus, if X is distributed N (, 2 ), the MLEs of and 2 derived from a sample of size n

are
=X
and
2 =
n
X
(Xi X)2 /n.
(1.7)
i=1
Note that these are the same estimators as obtained by the method of moments.
Example 1.5
Given random variable X is distributed uniformly on [0, ], find the MLE of based on a
sample of size n.
Solution: Now f (xi ; ) = 1/, xi [0, ], i = 1, 2, . . . , n. So the likelihood is
n
Y
L(; x1 , x2 , . . . , xn ) =
(1/) = 1/n .
i=1
15
L ()
Figure 1.1: L() = 1/n
1 n
When we come to find the maximum of this function we note that the slope is not zero
dL()
d log L()
anywhere, so there is no use finding
or
.
d
d
Note however that L() increases as 0. So L() is maximized by setting equal to the
smallest value it can take. If the observed values are x1 , . . . , xn then can be no smaller
than the largest of these. This is because xi [0, ] for i = 1, . . . , n. That is, each xi
or each xi .
Thus, if X is distributed U(0, ), the MLE of is
= max(Xi ).
(1.8)
Comment
The Method of Moments was first proposed near the turn of the century by the British
statistician Karl Pearson. The Method of Maximum Likelihood goes back much further.
Both Gauss and Daniel Bernoulli made use of the technique, the latter as early as 1777.
Fisher though, in the early years of the twentieth century, was the first to make a thorough
study of the methods properties and the procedure is often credited to him.
1.4
Properties of Estimators
Using different methods of estimation can lead to different estimators. Criteria for deciding
which are good estimators are required. Before listing the qualities of a good estimator, it
is important to understand that they are random variables. For example, suppose that we
take a sample of size 5 from a uniform distribution and calculate x. Each time we repeat
16
the experiment we will probably get a different sample of 5 and therefore a different x. The
behaviour of an estimator for different random samples will be described by a probability
distribution. The actual distribution of the estimator is not a concern here and only its
mean and variance will be considered. As a first condition it seems reasonable to ask that
the distribution of the estimator be centered around the parameter it is estimating. If not
it will tend to overestimate or underestimate . A second property an estimator should
possess is precision. An estimator is precise if the dispersion of its distribution is small.
These two concepts are incorporated in the definitions of unbiasedness and efficiency below.
In the following, X1 , X2 , . . . , Xn is a random sample from the distribution F (x; ) and
H(X1 , . . . , Xn ) = will denote an estimator of (not necessarily the MLE).
Definition 1.3 Unbiasedness
An estimator of is unbiased if
= for all .
E()
(1.9)
If an estimator is biased, the bias is given by

.
b = E()
(1.10)
There may be large number of unbiased estimators of a parameter for any given distribution and a further criterion for choosing between all the unbiased estimators is needed.
Definition 1.4 Efficiency
Let 1 and 2 be 2 unbiased estimators of with variances Var(1 ), Var(2 )
respectively, We say that 1 is more efficient than 2 if
Var(1 ) < Var(2 ) .
That is, 1 is more efficient than 2 if it has a smaller variance.
Definition 1.5 Relative Efficiency
The relative efficiency of 2 with respect to 1 is defined as
efficiency = Var(1 )/Var(2 ) .
(1.11)
17

Generate 100 random samples of size 10 from a U(0,10) distribution. For each of the 100
samples generated calculate the MME and MLE for and graph the results.
(a) From the graphs does it appear that the estimators are biased or unbiased? Explain.
(b) Estimate the variance of the two estimators by finding the sample variance of the
100 estimates (for each estimator). Which estimator appears more efficient?
Solution:
theta <- 10
sampsz <- 10
nsimulations <- 100
moment.estimates <- numeric(nsimulations)
ML.estimates <- numeric(nsimulations)
0.6
0.5
moment
ML
0.4
0.3
moment.estimates[i] <- 2*mean(ru)

ML.estimates[i] <- max(ru)
}
0.2
0.1
0.0
plot(density(moment.estimates),
6
xlab=" ",ylab=" ",main=" ",ylim=c(0,0.6),las=1) 4
abline(v=theta,lty=3)
lines(density(ML.estimates),lty=2)
legend(11,0.5,legend=c("moment","ML"),lty=1:2,cex=0.6)
10
12
14
You should see that the Method of Moments gives unbiased estimates of which many
are not in the range space as noted in Computer Example 1.3. The maximum likelihood
estimates are all less than 10 and so are biased.
It will now be useful to indicate that the estimator is based on a sample of size n by
denoting it by n .
18
Definition 1.6 Consistency n is a consistent estimator of if

lim P (|n | > ) = 0 for all > 0 .
(1.12)
We then say that n converges in probability to as n . Equivalently,

lim P (|n | < ) = 1 .
n
This is a large-sample or asymptotic property. Consistency has to do only with the limiting
behaviour of an estimator as the sample size increases without limit and does not imply
that the observed value of is necessarily close to for any specific size of sample n. If
only a relatively small sample is available, it would seem immaterial whether a consistent
estimator is used or not.
The following theorem (which will not be proved) gives a method of testing for consistency.
Theorem 1.2
If, lim E(n ) = and lim Var(n ) = 0, then n is a consistent estimator of .
n

Demonstrate that the MLE is consistent for estimating for a U(0, ) distribution.
Method: Generate the random variables one at a time. After each is generated calculate
the MLE of for the sample of size n generated to that point and so obtain a sequence of
estimators, {n }. Plot the sequence.
Solution: Uniform random variables are generated one at a time and n is found as the
maximum of n1 and nth uniform rv generated. The estimates are plotted in order.
#_________ UniformConsistency.R _____
theta <- 10
sampsz <- 10
nsimulations <- 100
ML.est <- numeric(nsimulations)
if(i==1) ML.est[i] <- max(ru)
else ML.est[i] <- max(ML.est[i-1],max(ru) )
}
plot(ML.est,type=l)
abline(h=theta,lty=2)
10.0
9.8
9.6
9.4
9.2
20
40
60
80
As n increases .
The final concept of sufficiency requires some explanation before a formal definition
is given. The random sample X1 , X2 , . . . , Xn drawn from the distribution with F (x; )
contains information about the parameter . To estimate , this sample is first condensed
100
19
to a single random variable by use of a statistic = H(X1 , X2 , . . . , Xn ). The question of

interest is whether any information about has been lost by this condensing process. For
example, a possible choice of is H(X1 , . . . , Xn ) = X1 in which case it seems that some
of the information in the sample has been lost since the observations X2 , . . . , Xn have been
ignored. In many cases, the statistic does contain all the relevant information about the
parameter that the sample contains. This is the concept of sufficiency.
Definition 1.7 Sufficiency
Let X1 , X2 , . . . , Xn be a random sample from F (x; ) and let = H(X1 , X2 , . . . , Xn )
be a statistic (a function of the Xi only). Let 0 = H 0 (X1 , X2 , . . . , Xn ) be any
other statistic which is not a function of . If for each of the statistics 0 ,
the conditional density of 0 given does not involve , then is called a
sufficient statistic for . That is, if f (0 | ) does not contain , then is
sufficient for .
Note: Application of this definition will not be required, but you should think of sufficiency
in the sense of using all the relevant information in the sample. For example, to say that x is
sufficient for in a particular distribution means that knowledge of the actual observations
x1 , x2 , . . . , xn gives us no more information about than does only knowing the average of
the n observations.
1.5
Examples of Estimators and their Properties
In this section we will consider the sample mean X and the sample variance S 2 and examine
which of the above properties they have.
Theorem 1.3
Let X be a random variable with mean and variance 2 . Let X be the sample mean
based on a random sample of size n. Then X is an unbiased and consistent estimator
of .
Proof
Now E(X) = , no matter what the sample size is, and Var(X) = 2 /n. The
latter approaches 0 as n , satisfying Theorem 1.2.
It can also be shown that of all linear functions of X1 , X2 , . . . , Xn , X has minimum
variance. Note that the above theorem is true no matter what distribution is sampled.
Some applications are given below.
For a random sample X1 , X2 , . . . , Xn , X is an unbiased and consistent estimator of:
(i) when the Xi are distributed N (, 2 );
20
(ii) p when the Xi are distributed bin(1, p);

(iii) when the Xi are distributed Poisson();
(iv) 1/ when the Xi have p.d.f. f (x) = ex , x > 0.
Sample Variance
Recall that the sample variance is defined by
2
S =
n
X
(Xi X)2 /(n 1) .
i=1
Theorem 1.4
Given X1 , X2 , . . . , Xn is a random sample from a distribution with mean and variance
2 , then S 2 is an unbiased estimator of 2 .
Proof
(n 1)E(S 2 ) = E
n
X
(Xi X)2
i=1
= E
= E
= E
= E
n
X
[Xi (X )]2
"i=1n
X
" i=1
n
X
(Xi )2 2(X )
n
X
#
(Xi ) + n(X )2
i=1
#
(Xi )2 2n(X )2 + n(X )2
i=1
n
X
(Xi )2 nE(X )2
i=1
n
X
Var(Xi ) nVar(X)
i=1
= n 2
So E(S 2 ) = 2 .
n.
2
n
= (n 1) 2
(1.13)
21
We make the following comments.

2
(i) In the special case of Theorem 1.4 where
Pn the Xi are2distributed N (, ) 2with both
2
2
and unknown, the MLE of is i=1 (Xi X) /n which is (n 1)S /n. So in
this case the MLE is biased.
(ii) The number in the denominator of S 2 , that is, n1, is called the number of degrees
of freedom. The numerator is the sum of n deviations (from the mean) squared
but the deviations
are not independent. There is one constraint on them, namely the
P
fact that (Xi X) = 0. As soon as n 1 of the Xi X are known, the nth one is
determined.
(iii) In calculating the observed value of S 2 , s2 , the following form is usually convenient.
P
X

( xi )2
2
2
s =
xi
/(n 1)
(1.14)
n
or, equivalently,
x2i nx2
n1
The equivalence of the two forms is easily seen:
X
X
X
X
(x2i 2xxi + x2 ) =
x2i 2x
xi + nx2
(xi x)2 =
2
s =
where the RHS can readily be seen to be
P 2
(
xi )
x2i
.
n
(iv) For any distribution that has a fourth moment,

Var(S 2 ) =
4 322
222
n
n1
(1.15)
Clearly lim Var(S 2 ) = 0, so from Theorem 1.2, S 2 is a consistent estimator of 2 .

n
1.6
Properties of Maximum Likelihood Estimators
The following four properties are the main reasons for recommending the use of Maximum
Likelihood Estimators.
(i) The MLE is consistent.
(ii) The MLE has a distribution that tends to normality as n .
(iii) If a sufficient statistic for exists, then the MLE is sufficient.
22
(iv) The MLE is invariant under functional transformations. That is, if

= H(X1 , X2 , . . . , Xn ) is the MLE of and if u() is a continuous monotone function
is the MLE of u(). This is known as the invariance property of
of , then u()
maximum likelihood estimators.
For example, in the normal distribution where the p
mean is and the variance is 2 ,
(n 1)S 2 /n is the MLE of 2 , so the MLE of is (n 1)S 2 /n.
1.7
Confidence Intervals
In the earlier part of this chapter we have been considering point estimators of a parameter. By point estimator we are referring to the fact that, after the sampling has been done
and the observed value of the estimator computed, our end-product is the single number
which is hopefully a good approximation for the unknown true value of the parameter. If
the estimator is good according to some criteria, then the estimate should be reasonably
close to the unknown true value. But the single number itself does not include any indication of how high the probability might be that the estimator has taken on a value close
to the true unknown value. The method of confidence intervals gives both an idea of
the actual numerical value of the parameter, by giving it a range of possible values, and a
measure of how confident we are that the true value of the parameter is in that range. To
pursue this idea further consider the following example.
Example 1.6
Consider a random sample of size n for a normal distribution with mean (unknown) and
known variance 2 . Find a 95% confidence interval for the unknown mean, .
Solution: We know that the best estimator of is X and the sampling distribution of X
2
). Then from the standard normal,
is N(,
n

|X |
< 1.96 = .95 .
P
/ n
The event
|X |
< 1.96 is equivalent to the event
/ n

1.96
1.96
< X < + ,
n
n
which is equivalent to the event
X 1.96 < < X + 1.96 .

n
n
Hence
23
= .95
(1.16)
P X 1.96 < < X + 1.96
n
n
The two statistics X 1.96 , X +1.96 are the endpoints of a 95% confidence interval
n
n
for . This is reported
as

1.96 , X
+ 1.96
The 95% CI for is X
n
n
Generate 100 samples of size 9 from a N(0,1) distribution. Find the 95% CI for for each
of these samples and count the number that do (dont) contain zero. (You could repeat
this say 10 times to build up the total number of CIs generated to 1000.) You should
observe that about 5% of the intervals dont contain the true value of (= 0).
Solution: Use the commands:
#___________ ConfInt.R __________
sampsz <- 9
nsimulations <- 100
non.covered <- 0
rn <- rnorm(mean=0,sd=1,n=sampsz)
Xbar <- mean(rn)
s <- sd(rn)
CI <- qnorm(mean=Xbar,sd=s/sqrt(sampsz),p=c(0.025,0.975) )
non.covered <- non.covered + (CI[1] > 0) + (CI[2] < 0)
}
cat("Rate of non covering CIs",100*non.covered/nsimulations," % \n")
> source("ConfInt.R")
Rate of non covering CIs 8 %
This implies that 8 of the CIs dont contain 0. With a larger sample size we would expect
that about 5% of the CIs would not contain zero.
We make the following definition:
Definition 1.8
An interval, at least one of whose endpoints is a random variable is called a
random interval.
24
In (1.16), we are saying that the probability is 0.95 that the random interval
(X 1.96 , X + 1.96 ) contains . A confidence interval (CI) has to be interpreted

n
n
carefully. For a particular sample, where x is the observed value of X, a 95% CI for is

,
(1.17)
x 1.96 , x + 1.96
n
n
but the statement
x 1.96 < < 1.96

n
n
is either true or false. The parameter is a constant and either the interval contains it in
which case the statement is true, or it does not contain it, in which case the statement is
false. How then is the probability 0.95 to be interpreted? It must be considered in terms
of the relative frequency with which the indicated event will occur in the long run of
similar sampling experiments.
Each time we take a sample of size n, a different x, and hence a different interval (1.17)
would be obtained. Some of these intervals will contain as claimed, and some will not. In
fact, if we did this many times, wed expect that 95 times out of 100 the interval obtained
would contain . The measure of our confidence is then 0.95 because before a sample
is drawn there is a probability of 0.95 that the confidence interval to be constructed will
cover the true mean.
A statement such as P (3.5 < < 4.9) = 0.95 is incorrect and should be replaced by :
A 95% confidence interval for is (3.5, 4.9).
We can generalize the above as follows: Let z/2 be defined by
(z/2 ) = 1 (/2) .
(1.18)
That is, the area under the normal curve above z/2 is /2. Then

P
z/2
So a 100(1 )% CI for is

X
< z/2
<
/ n

= 1 .
x z/2 , x + z/2
n
n

.
(1.19)
Commonly used values of are 0.1, 0.05, 0.01.

Confidence intervals for a given parameter are not unique. For example, we have
considered a symmetric, twosided interval, but

x z2/3 , x + z/3
n
n
25
is also a 100(1 )% CI for . Likewise, we could have one-sided CIs for . For example,

or x z , .
, x + z
n
n

X
< z = 1 . ]
[The second of these arises from considering P
/ n
We could also have a CI based on say, the sample median instead of the sample mean.
Methods of obtaining confidence intervals must be judged by their various statistical properties. For example, one desirable property is to have the length (or expected length) of a
100(1 )% CI as short as possible. Note that for the CI in (1.19), the length is constant
for given n.
1.7.1
Pivotal quantity
We will describe a general method of finding a confidence interval for from a random
sample of size n. It is known as the pivotal method as it depends on finding a pivotal
quantity that has 2 characteristics:
(i) It is a function of the sample observations and the unknown parameter , say
H(X1 , X2 , . . . , Xn ; ) where is the only unknown quantity,
(ii) It has a probability distribution that does not depend on .
Any probability statement of the form
P (a < H(X1 , X2 , . . . , Xn ; ) < b) = 1
will give rise to a probability statement about .
Example 1.7
Given X1 , X2 , . . . , Xn1 from N(1 ,12 ) and Y1 , Y2 , . . . , Yn2 from N(2 , 22 ) where 12 , 22 are
known, find a symmetric 95% CI for 1 2 .
Solution: Consider 1 2 (= , say) as a single parameter. Then X is distributed N(1 ,
12 /n1 ) and Y is distributed N(2 , 22 /n2 ) and further, X and Y are independent. It follows
that X Y is normally distributed, and writing it in standardized form,
X Y (1 2 )
p
is distributed as N (0, 1) .
(12 /n1 ) + (22 /n2 )
So we have found the pivotal quantity which is a function of 1 2 but whose distribution
does not depend on 1 2 . A 95% CI for = 1 2 is found by considering
!
X Y
P 1.96 < p 2
< 1.96 = .95 ,
(1 /n1 ) + (22 /n2 )
26
which, on rearrangement, gives the appropriate CI for 1 2 . That is,
s
s
2
2
2
2
x y 1.96 1 + 2 , x y + 1.96 1 + 2 .
n1 n2
n1 n2
(1.20)
Example 1.8
In many problems where we need to estimate proportions, it is reasonable to assume that
sampling is from a binomial population, and hence that the problem is to estimate p in
the bin(n, p) distribution, where p is unknown. Find a 100(1 )% CI for p, making use
of the fact that for large sample sizes, the binomial distribution can be approximated by
the normal.
Solution: Given X is distributed as bin(n, p), an unbiased estimate of p is p = X/n. For
n large, X/n is approximately normally distributed. Then,
E(
p) = E(X)/n = p ,
and
Var(
p) =
so that
1
1
p(1 p)
Var(X)
=
np(1
p)
=
,
n2
n2
n
p p
p
p(1 p)/n
is distributed approximately N (0, 1) .
[Note that we have found the required pivotal quantity whose distribution does not depend
on p.]
An approximate 100(1 )%CI for p is obtained by considering
!
p p
P z/2 < p
< z/2 = 1 .
(1.21)
p(1 p)/n
where z/2 is defined in (1.18).
Rearranging (1.21), the confidence limits for p are obtained as
q
2
2
2n
p + z/2
z/2 4n
p(1 p) + z/2
2
2(n + z/2
)
(1.22)
A simpler expression can be found by dividing both numerator and denominator of (1.22)
by 2n and neglecting terms of order 1/n. That is, a 95% CI for p is

p
p
p 1.96 p(1 p)/n , p + 1.96 p(1 p)/n .
(1.23)
Note that this is just the expression we would have used if we replaced Var(
p) = p(1 p)/n
d
in (1.21) by Var(
p) = p(1 p)/n. In practice, confidence limits for p are generally obtained
by means of specially constructed tables which makes it possible to find confidence intervals
when n is small.
27
Example 1.9
Construct an appropriate 90% confidence
P interval for in the Poisson distribution. Evaluate this if a sample of size 30 yields
xi = 240.
Solution: Now X is an unbiased estimator of for this problem, so can be estimated by
= and Var()
=Var(X)= 2 /n = /n. By the Central Limit Theorem,
= x with E()
for large n, the distribution of X is approximately normal, so

X
p
is distributed approximately N(0, 1).
/n
An approximate 90% CI for can be obtained from considering
!
X
P 1.645 < p
< 1.645 = .90 .
/n
(1.24)
Rearrangement of the inequality in (1.24) to give an inequality for , is similar to that

in Example 1.8 where it was necessary to solve a quadratic. But, noting the comment
= X/n, giving for the 90%
following (1.23), replace the variance of X by its estimate /n
CI for ,
p
p
(x 1.645 x/n x + 1.645 x/n)
which on substitution of the observed value 240/30 = 8 for x gives (7.15, 8.85).
1.8
28
Bayesian estimation
Fundamental results from probability

Some results from STAT260 are presented without elaboration, intended as a revision to
provide the framework for Bayesian data analyses.
P (E|F H)P (F |H) = P (EF |H) = P (EF H|H) = P (EH|H) = P (E|H)
X
P (E) =
P (E|Hn )P (Hn )
n
P (Hn |E)P (E) = P (EHn ) = P (Hn )P (E|Hn )

P (Hn |E) P (Hn )P (E|Hn )
(1.25)
(1.26)
The results at (1.25) is Bayes theorem and in this form shows how we can invert
probabilities, getting P (Hn |E) from P (E|Hn ).
When Hn consists of exclusive and exhaustive events,
P (Hn )P (E|Hn )
m P (Hm )P (E|Hm )
P (Hn |E) = P
1.8.1
Bayes theorem for random variables

p(y|x) p(y)p(x|y)
The constant of proportionality is
1.8.2
(1.27)
continuous:
1
p(x)
discrete:
1
p(x)
1
p(x|y)p(y)dy
1
=P
y p(x|y)p(y)dy
=R
Post is prior likelihood
Suppose we are interested in the values of k unknown quantities,

= (1 , 2 , . . . , k )
and apriori beliefs about their values can be expressed in terms of the pdf p().
Then we collect data,
X = (X1 , X2 , . . . , Xn )
which have a probability disribution that depends on , expressed as
p(X|)
(1.28)
29
From (1.28),
p(|X) p(X|) p()
The term, p(X|) may be considered as a function of X for fixed , i.e. a density of X
which is parameterized by .
We can also consider the same term as a function of for fixed X and then it is termed
the likelihood function,
`(|X) = p(X|)
These are the names given to the terms of (1.28),
p() is the prior
`(|X) is the likelihood
p(|X) is the posterior
and Bayes theorem is
posterior likelihood prior
The function p() is not the same in each instance but is a generic symbol to represent the
density appropriate for prior, density of the data given the parameters, and the posterior.
The form of p is understood by considering its arguments, i.e. p(), p(x|) or p(|x).
A diagram depicting the relationships amongst the different densities is shown in Figure 1.2
Figure 1.2: Posterior distribution
1.0
P(x|)
0.8
0.6
P(|x)
0.4
P()
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
The posterior is a combination of the likelihood, where information about comes from
the data X and the prior p() where the information is the knowledge of independent of
30
X. This knowledge may come from previous sampling, (say). The posterior represents an
update on P () with the new information at hand, i.e. x.
If the likelihood is weak due to insufficient sampling or wrong choice of likelihood
function, the prior can dominate so that the posterior is just an adaptation of the prior.
Alternatively, if the sample size is large so that the likelihood function is strong, the
prior will not have much impact and the Bayesian analysis is the same as the maximum
likelihood.
The output is a distribution, p(|X) and we may interpret it using summaries such as
the median and an interval where the true value of would lie with a certain probability.
The interval that we shall use is the Highest Density Region, or HDR. This is the interval
for which the density of any point within it is higher than the density of any point outside.
Figure 1.3 depicts a density with shaded areas of 0.9 in 2 cases. In frame (a), observe
that there are quantiles outside the interval (1.37, 7.75) for which the density is greater
than quantiles within the interval. Frame (b) depicts the HDR as (0.94, 6.96).
0.20
0.20
0.15
0.15
p(
|X)
p(
|X)
Figure 1.3: Comparison of a 2 types of regions, a Confidence Interval and a Highest Density
Region
(a) Confidence Interval
(b) A HDR
0.10
0.10
0.05
0.05
0.00
0.00
0 1.37
7.75 10
15
00.94
5 6.94
10
15
The following example illustrates the principles.

Generate 10 samples of size 10 from a uniform distribution, U(0,) with = 10. Estimate
the 90% Highest Density Region (HDR) for from your samples using a prior p() 1 .
We recap that the job is to use the simulated data (X U (0, ) ( = 10), and estimate
as if it were unknown. Then the estimate is compared with the true value.
Both the previous estimation methods have not been entirely satisfactory,
moment estimation gave too many estimates way outside the true value,
31
maximum likelihood estimates were biassed because we could only use the maximum value and had no information regarding future samples which might exceed the
maximum.
The Bayesian philosophy attempts to address these concerns.
For this example and further work, we require the indicator function,

IA (x) =
1.8.3
1 (x A)
0 (x
/ A)
Likelihood
Denote the joint density of the observations by p(x|). For X U (0, ),

n
0<x<
p(x|) =
0
otherwise
(1.29)
The likelihood of the parameter is

`(|x) =
n > x
0
otherwise
and if M = max(x1 , x2 , . . . , xn ),

`(|x) =
n > M
0
otherwise
n I(M,) ()
1.8.4
Prior
When X U (0, ), a convenient prior distribution for is

1

>
p(|, ) =
0
otherwise
1 I(,) ()
This is a Pareto distribution. At this point, just accept that this prior works but later
we shall consider how to choose priors.1
1
is the Greek letter pronounced xi.
1.8.5
32
Posterior
By Bayes rule,
p(|x) p() `(|x)
1 I(,) () n I(M,) ()
{z
} |
{z
}
|
prior
likelihood
(+n)1 I(0 ,) ()
where 0 = max(M, ).
Thus we combine the information gained from the data with of our prior beliefs to get
a distribution of s.
In this exercise, there is a fixed lower endpoint which is zero, X U (0, ).
The prior chosen is a Pareto distribution with = 0, = 0 so that
p() = 1 I(,) ()
This is chosen so that the prior does not change very much over the region in which
the likelihood is appreciable and does not take on large values outside that region. It is
said to be locally uniform. We defer the theory about this, for now you may just accept
that it is appropriate for this exercise.
The posterior density, p(|X) is
p(|X) n1 I(0 ,) ()
The HDR will be as in Figure 1.4
Figure 1.4: The 90% HDR for p(|X)
5e11
p(|X)
4e11
3e11
2e11
1e11
0e+00
9
10
11
10.8
12
13
14
15
The lower end-point is M = max(x1 , x2 , . . . , xn ). This is the MLE and in that setting,
there was no other information that we could use to address the point that M ; M
33
had to do the job but we were aware that it was very possible that the true value of was
greater than the maximum value of the sample.
The upper end-point is found from the distribution function. We require such that
Z
p(|X)d = 0.9
0

M n
1( )
= 0.9
M
=
1
0.1 n
Likewise, we can compute the median of the posterior distribution,
Q0.5 =
M
1
0.5 n
The following R program was used to estimate the median and HDR of p(|X).
#_________ UniformBayes.R _____
theta <- 10
sampsz <- 10
nsimulations <- 10
xi <- max(runif(n=sampsz,min=0,max=theta) )
Q0.9 <- xi/(0.1^(1/sampsz) )
Q0.5 <- xi/(0.5^(1/sampsz) )
cat("simulation ",i,"median = ",round(Q0.5,2),"90% HDR = (",round(xi,2),round(Q0.9,2),")\n")
}
simulation
simulation
simulation
simulation
simulation
simulation
simulation
simulation
simulation
simulation
1
2
3
4
5
6
7
8
9
10
median
median
median
median
median
median
median
median
median
median
=
=
=
=
=
=
=
=
=
=
10.65
10.09
8.92
10.64
9.86
9.88
8.4
8.66
10.41
9.19
90%
90%
90%
90%
90%
90%
90%
90%
90%
90%
HDR
HDR
HDR
HDR
HDR
HDR
HDR
HDR
HDR
HDR
=
=
=
=
=
=
=
=
=
=
(
(
(
(
(
(
(
(
(
(
9.94 12.51 )
9.42 11.85 )
8.32 10.48 )
9.93 12.5 )
9.2 11.59 )
9.22 11.61 )
7.84 9.87 )
8.08 10.18 )
9.71 12.22 )
8.57 10.79 )
1.9
34
Normal Prior and Likelihood
This section is included to demonstrate the process for modelling the posterior distribution
of parameters and the notes shall refer to it in an example in Chapter 2.
x N (, 2 )
N (0 , 02 )

1 (x )2
exp
p(x|) =
2 2
2 2

1
1 ( 0 )2
p() = p
exp
2
02
202
1
p(|x) = p(x|)p()

1 (x )2
1
1 ( 0 )2
1
exp
p
=
exp
2 2
2
02
2 2
202

1
1
x
1
0
exp 2
+ 2 +
+ 2
2
2
2
0
0
(1.30)
(1.31)
Define the precision as the reciprocal of the variance,

=
1
2
0 =
1
02
Addressing (1.31), put

1 = + 0

1
1
0
x
1 =
+
+
02 2
02 2
(1.32)
(1.33)
(1.34)
Equation 1.32 states

Posterior precision = datum precision + prior precision or
1
1
1
=
+ 2
2
2
1
0
Then (1.31) can be expressed as

1
p(|x) exp 1 + 1 1
2
35
Add into the exponent the term 12 1 21 which is a constant as far as is concerned.
Then

1
2
p(|x) exp 1 ( 1 )
2
(

2 )
1
1
1
= (212 ) 2 exp
2
1
1
The last result containing the normalising constant (212 ) 2 comes from
p(|x)dx =
1.
Thus the posterior density is |x N (1 , 12 ), where
1
1
1
+ 2
=
2
2
1
0
0
1 = 0
+x
0 +
+
0

0
x
= 12
+ 2
2
0
Posterior mean is weighted mean of prior mean and datum value. The weights are
proportional to their respective precisions.
Example 1.10
Suppose that 0 N (370, 202 ) and that x| N (421, 82 ). What is p(|x)?
1 =
12 =
1
1
+ 2
2
20
8
1
= 55
1

1 = 55
370 421
+ 2
202
8
|x N (413, 55)

= 413
1.10
36
Bootstrap Confidence Intervals
The Bootstrap is a Monte-Carlo method which uses (computer) simulation in lieu of mathematical theory. It is not necessarily simpler. Exercises with the bootstrap are mostly
numerical although the underlying theory follows much of the analytical methods.
1.10.1
The empirical cumulative distribution function.
An important tool in non-parametric statistics is the empirical cumulative distribution

function (acronym ecdf) which uses the ordered data as quantiles and probabilities are
1
.
steps of (n+1)
We have used the word empirical for this plot because it uses only the information
in the sample. The values for cumulative area that are associated with each datum are
determined by the following argument.
The sample as collected is denoted by x1 , x2 , . . . , xn . The subscript represents the
position in the list or row in the data file. Bracketed subscripts denote ordering of the
data, where x(1) is the smallest, x(2) is the second smallest, x(n) is the largest. In general
xi is not the same datum x(i) but of course this correspondence could happen.
The n sample points are considered to divide the sampling interval into (n + 1) subintervals,
(0, x(1) ), (x(1) , x(2) ), (x(2) , x(3) ), . . . , (x(n1) , x(n) ), (x(n) , )
The total area under the density curve (area=1) has been subdivided into (n + 1) sub1
regions with individual areas approximated as (n+1)
. The values of the cumulative area
under the density curve is then approximated as:Interval
Cumulative area
(0, x1 )
(x1 , x2 )
0
(n+1)
1
(n+1)
...
...
(xn , )
1
(x(n1) , xn )
n
(n+1)
A diagram of this is shown in Figure 1.5

Figure 1.5: The ecdf is a step function with step size
1
(n+1)
between data points.
F (x)
1
n
n+1
..
.
4
n+1
3
n+1
2
n+1
1
n+1
x(1)
x(2)
x(3)x(4)
x(5)
...
x(n1)
If there are tied data, say k of them, the step size is
x(n)
k
.
(n+1)
37
In R the required function is ecdf().

The following data are the numbers of typhoons in the North Pacific Ocean over 88
years and assume that they are saved in a file called TYPHOON.txt
13
16
19
39
30
7 14 20 13 12 12 15 20 17 11 14 16 12 17 17
7 14 15 16 20 17 20 15 22 26 25 27 18 23 26 18 15 20 24
25 23 20 24 20 16 21 20 18 20 18 24 27 27 21 21 22 28 38
27 26 32 19 33 23 38 30 30 27 25 33 34 16 17 22 17 26 21
30 31 27 43 40 28 31 24 15 22 31
A plot of the ecdf shown in Figure 1.6 is generated with the following R code,
Figure 1.6: An empirical distribution function for typhoon data.
1.0
0.8
0.6
Fn(x)
typhoons <- scan("TYPHOON.txt")

plot(ecdf(typhoons),las=1)
0.4
0.2
0.0
10
20
30
40
5 10
Empirical cumulative distributions are used when calculating bootstrap probabilities.

Example
Suppose that in Example 1.9, the data were
8
5 10
8 12
8 11
8 10
8 10
6 10
Denote this sample by x1 , x2 , . . . , xn where n = 30. the summary statistics are

n
X
xi = 240
=8
X
i=1
We shall use this example to illustrate (a) resampling, and (b) the bootstrap distribution.
The sample, x1 , x2 , . . . , xn , are independently and identically distributed (i.i.d.) as
Poisson() which means that each observation is as important as any other for providing
information about the population from which this sample is drawn. That infers we can
replace any number by one of the others and the new sample will still convey the same
information about the population.
This is demonstrated in Figure 1.7. Three new samples have been generated by
taking samples of size n = 30 with replacement from x. The ecdf of x is shown in bold and
8 14
38
the ecdfs of the new samples are shown with different line types. There is little change
in the empirical distributions or estimates of quantiles. If a statistic (e.g. a quantile) were
estimated from this process a large number of times, it would be a reliable estimate of the
population parameter. The new samples are termed bootstrap samples.
Figure 1.7: Resampling with replacement from original sample.
1.0
0.8
^
F(x)
0.6
0.4
0.2
0.0
0
9 10
12
14 15
This is the bootstrap procedure for the CI for in the current example.
1. Nominate the number of bootstrap samples that will be drawn, e.g. nBS=99.
2. Sample with replacement from x a bootstrap sample of size n, x?1 .
?.
3. For each bootstrap sample, calculate the statistic of interest,
1
4. Repeat steps 2 and 3 nBS times.

?,
?, . . . ,
? to get the Con5. Use the empirical cumulative distribution function of
2
1
nBS
fidence Interval.
This is shown in Figure 1.8.
Figure 1.8: Deriving the 95% CI from the ecdf of bootstrap estimates of the mean
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
7.0
7.22
7.5
8.0
8.5
8.73
9.0
39
The bootstrap estimate of the 95% CI for is (7.22, 8.73). Note that although there
is a great deal of statistical theory underpinning this (the ecdf, iid, a thing called order
statistics etc.), there is no theoretical formula for the CI and it is determined numerically
from the sample.
This is R code to generate the graph in Figure 1.8.
x <- c(8,6,5,10,8,12,9,9,8,11,7,3,6,7,5,8,10,7,8,8,10,8,5,10,8,6,10,6,8,14)
n <- length(x)
nBS <- 99
# number of bootstrap simulations
BS.mean <- numeric(nBS)
i <- 1
while (i < (nBS+1) ){
BS.mean[i] <- mean(sample(x,replace=T,size=n))
i <- i + 1
}
# end of the while() loop
Quantiles <- quantile(BS.mean,p = c(0.025,0.975))
cat(" 95\% CI = ",Quantiles,"\n")
plot(ecdf(BS.mean),las=1)
The boot package in R has functions for bootstrapping. The following code uses that
to get the same CI as above,
library(boot)
mnz <- function(z,id){mean(z[id])}
# user must supply this
bs.samples <- boot(data=x,statistic=mnz,R=99)
boot.ci(bs.samples,conf=0.95,type=c("perc","bca"))
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 99 bootstrap replicates
CALL :
boot.ci(boot.out = bs.samples, conf = 0.95)
Intervals :
Level
Percentile
BCa
95%
( 7.206, 8.882 )
( 7.106, 8.751 )
It seems that the user must supply a function (e.g. mnz here) to generate the bootstrap
samples. The variable id is recogised by R as a vector 1:length(z) so that it can draw
the samples.
Chapter 2
2.1
Hypothesis Testing
Introduction
Consider the following problems:

(i) An engineer has to decide on the basis of sample data whether the true average
lifetime of a certain kind of tyre is at least 22000 kilometres.
(ii) An agronomist has to decide on the basis of experiments whether fertilizer A produces
a higher yield of soybeans than fertilizer B.
(iii) A manufacturer of pharmaceutical products has to decide on the basis of samples
whether 90% of all patients given a new medication will recover from a certain disease.
These problems can be translated into the language of statistical tests of hypotheses.
(i) The engineer has to test the assertion that if the lifetime of the tyre has pdf. f (x) =
ex , x > 0, then the expected lifetime, 1/, is at least 22000.
(ii) The agronomist has to decide whether A > B where A , B are the means of 2
normal distributions.
(iii) The manufacturer has to decide whether p, the parameter of a binomial distribution
is equal to .9.
In each case, it is assumed that the stated distribution correctly describes the experimental conditions, and that the hypothesis concerns the parameter(s) of that distribution. [A more general kind of hypothesis testing problem is where the form of the
distribution is unknown.]
In many ways, the formal procedure for hypothesis testing is similar to the scientific
method. The scientist formulates a theory, and then tests this theory against observation.
In our context, the scientist poses a theory concerning the value of a parameter. He then
samples the population and compares observation with theory. If the observations disagree
strongly enough with the theory the scientist would probably reject his hypothesis. If not,
the scientist concludes either that the theory is probably correct or that the sample he
40
CHAPTER 2. HYPOTHESIS TESTING
41
considered did not detect the difference between the actual and hypothesized values of the
parameter.
Before putting hypothesis testing on a more formal basis, let us consider the following
questions. What is the role of statistics in testing hypotheses? How do we decide whether
the sample value disagrees with the scientists hypothesis? When should we reject the
hypothesis and when should we withhold judgement? What is the probability that we
will make the wrong decision? What function of the sample measurements should be used
to reach a decision? Answers to these questions form the basis of a study of statistical
hypothesis testing.
2.2
2.2.1
Terminology and Notation.

Hypotheses
A statistical hypothesis is an assertion or conjecture about the distribution of a random

variable. We assume that the form of the distribution is known so the hypothesis is a
statement about the value of a parameter of a distribution.
Let X be a random variable with distribution function F (x; ) where . That is,
is the set of all possible values can take, and is called the parameter space. For
example, for the binomial distribution, = {p : p (0, 1)}. Let be a subset of .
Then a statement such as is a statistical hypothesis and is denoted by H0 . Also,
the statement (where is the complement of with respect to ) is called the
alternative to H0 and is denoted by H1 . We write
/ ).
H0 : and H1 : (or
Often hypotheses arise in the form of a claim that a new product, technique, etc. is
better than the existing one. In this context, H is a statement that nullifies the claim (or
represents the status quo) and is sometimes called a null hypothesis, but we will refer to
it as the hypothesis.
If contains only one point, that is, if = { : = 0 } then H0 is called a simple
hypothesis. We may write H0 : = 0 . Otherwise it is called composite. The same
applies to alternatives.
2.2.2
Tests of Hypotheses
A test of a statistical hypothesis is a procedure for deciding whether to accept or reject

the hypothesis. If we use the term accept it is with reservation, because it implies stronger
action than is really warranted. Alternative phrases such as reserve judgement, fail to
reject perhaps convey the meaning better. A test is a rule, or decision function, based
on a sample from the given distribution which divides the sample space into 2 regions,
commonly called
42
(i) the rejection region (or critical region), denoted by R;

(ii) the acceptance region (or region of indecision), denoted by R (complement of R).
If we compare two different ways of partitioning the sample space then we say we are
comparing two tests (of the same hypothesis). For a sample of size n, the sample space is
of course n-dimensional and rather than consider R as a subset of n-space, its helpful to
realize that well condense the information in the sample by using a statistic (for example
x), and consider the rejection region in terms of the range space of the random variable X.
2.2.3
Size and Power of Tests
There are two types of errors that can occur. If we reject H when it is true, we commit
a Type I error. If we fail to reject H when it is false, we commit a Type II error. You
may like to think of this in tabular form.
Actual H0 is true
situation H0 is not true
Our decision
do not reject H0
reject H0
correct decision
Type I error
Type II error
correct decision
Probabilities associated with the two incorrect decisions are denoted by

= P (H0 is rejected when it is true) = P(Type I error)
= P (H0 is not rejected when it is false) = P(Type II error)
(2.1)
(2.2)
The probability is sometimes referred to as the size of the critical region or the significance level of the test, and the probability 1 as the power of the test.
The roles played by H0 and H1 are not at all symmetric. From consideration of potential
losses due to wrong decisions, the decision-maker is somewhat conservative for holding the
hypothesis as true unless there is overwhelming evidence from the data that it is false. He
believes that the consequence of wrongly rejecting H is much more severe to him than of
wrongly accepting it.
For example, suppose a pharmaceutical company is considering the marketing of a
newly developed drug for treatment of a disease for which the best available drug on the
market has a cure rate of 80%. On the basis of limited experimentation, the research
division claims that the new drug is more effective. If in fact it fails to be more effective,
or if it has harmful side-effects, the loss sustained by the company due to the existing drug
becoming obsolete, decline of the companys image, etc., may be quite severe. On the other
hand, failure to market a better product may not be considered as severe a loss. In this
problem it would be appropriate to consider H0 : p = .8 and H1 : p > .8. Note that H0
is simple and H1 is composite.
43
Ideally, when devising a test, we should look for a decision function which makes probabilities of Type I and Type II errors as small as possible, but, as will be seen in a later
example, these depend on one another. For a given sample size, altering the decision rule
to decrease one error, results in the other being increased. So, recalling that the Type I
error is more serious, a possible procedure is to hold fixed at a suitable level (say = .05
or .01) and then look for a decision function which minimizes . The first solution for
this was given by Neyman and Pearson for a simple hypothesis versus a simple alternative.
Its often referred to as the Neyman-Pearson fundamental lemma. While the formulation
of a general theory of hypothesis testing is beyond the scope of this unit, the following
examples illustrate the concepts introduced above.
2.3
Examples
Example 2.1
Suppose that random variable X has a normal distribution with mean and variance 4.
Test the hypothesis that = 1 against the alternative that = 2, based on a sample of
size 25.
Solution: An unbiased estimate of is X and we know that X is distributed normally
with mean and variance 2 /n which in this example is 4/25. We note that values of x
close to 1 support H whereas values of x close to 2 support A. We could make up a decision
rule as follows:
If x > 1.6 claim that = 2,
If x 1.6 claim that = 1.
The diagram in Figure fig.CRUpperTail shows the sample space of x partitioned into
(i) the critical region, R= {x : x > 1.6}
(ii) the acceptance region, R = {x : x 1.6}
Here, 1.6 is the critical value of x.
We will find the probability of Type I and Type II error,
2
P (X > 1.6| = 1, = ) = .0668. ( pnorm(q=1.6,mean=1,sd=0.4,lower.tail=F))
5
This is
P(H0 is rejected|H0 is true) = P(Type I error) =
Also
= P(Type II error) = P(H0 is not rejected|H0 is false)
2
= P (X 1.6| = 2, = )
5
= .1587
(pnorm(q=1.6,mean=2,sd=0.4,lower.tail=T))
44
Figure 2.1: Critical Region Upper Tail
mean=2
mean=1

1.6
-2
critical region 4
To see how the decision rule could be altered so that = .05, let the critical value be c.
We require
2
P (X > c| = 1, = ) = 0.05
5
c = 1.658
2
P (X < c| = 2, = ) = 0.196
5
(qnorm(p=0.05,mean=1,sd=0.4,lower.tail=T))
(pnorm(q=1.658,mean=2,sd=0.4,lower.tail=T))
This value of c gives an of 0.05 and a of 0.196 illustrating that as one type of error
() decreases the other () increases.
45
Example 2.2
Suppose we have a random sample of size n from a N(,4) distribution and wish to test
H0 : = 10 against H1 : = 8. The decision rule is to reject H0 if x < c . We wish to
find n and c so that = 0.05 and 1.
Solution:In Figure 2.2 below, the left curve is f (x|H1 ) and the right curve is f (x|H0 ).
The critical region is {x : x < c}, so is the left shaded area and is the right shaded
area.
Figure 2.2: Critical Region Lower Tail
mean=8

mean=10

10

12
critical region
Now
2
= 0.05 = P (X < c| = 10, = )
n
2
= 0.1 = P (X c| = 8, = )
n
(2.3)
(2.4)
(2.5)
We need to solve (2.3) and (2.4) simultaneously for n as shown in Figure 2.3
46
Figure 2.3: Solution for size and power of test

10.0
Critical value
9.5
= 0.1
9.0
8.5
8.0
= 0.05
7.5
7.0
4
10
12
sample size
The R code for the above diagram is:n <- 3:12

alpha <- 0.05
beta <- 0.1
Acrit <- qnorm(mean=10,sd=2/sqrt(n),p=alpha)
Bcrit <- qnorm(mean=8,sd=2/sqrt(n),p=beta,lower.tail=F)
plot(Acrit ~ n,type=l,xlab="sample size",ylab="Critical value",las=1,ylim=c(7,10) ,lwd=2)
lines(n,Bcrit,lty=2,lwd=2)
A sample size n = 9 and critical value c = 8.9 gives 0.05 and 0.1.
2.4
47
One-sided and Two-sided Tests
Consider the problem where the random variable X has a binomial distribution with
P(Success)=p. How do we test the hypothesis p = 0.5. Firstly, note that we have an
experiment where the outcome on an individual trial is success or failure with probabilitites p and q respectively. Let us repeat the experiment n times and observe the number
of successes.
Before continuing with this example it is useful to note that in most hypothesis testing
problems we will deal with, H0 is simple, but H1 on the other hand, is composite, indicating that the parameter can assume a range of values. Examples 1 and 2 were more
straightforward in the sense that H1 was simple also.
If the range of possible parameter values lies entirely on the one side of the hypothesized
value, the aternative is said to be one-sided. For example, H1 : p > .5 is one-sided but
H1 : p 6= .5 is two-sided. In a real-life problem, the decision of whether to make the
alternative one-sided or two-sided is not always clear cut. As a general rule-of-thumb, if
parameter values in only one direction are physically meaningful, or are the only ones that
are possible, the alternative should be one-sided. Otherwise, H1 should be two-sided. Not
all statisticians would agree with this rule.
The next question is what test statistic we use to base our decision on. In the above
problem, since X/n is an unbiased estimator of p, that would be a possibility. We could even
use X itself. In fact the latter is more suitable since its distribution is known. Recall that,
the principle of hypothesis testing is that we will assume H0 is correct, and our position will
change only if the data show beyond all reasonable doubt that H1 is true. The problem
then is to define in quantitative terms what reasonable doubt means. Let us suppose that
n = 18 in our problem above. Then the range space for X is RX = {0, 1, . . . , 18} and
E(X)=np= 9 if H0 is true. If the observed number of successes is close to 9 we would be
obliged to think that H was true. On the other hand, if the observed value of X was 0
or 18 we would be fairly sure that H0 was not true. Now reasonable doubt does not
have to be as extreme as 18 cases out of 18. Somewhere between x-values of 9 and 18 (or
9 and 0), there is a point, c say, when for all practical purposes the credulity of H0 ends
and reasonable doubt begins. This point is called the critical value and it completely
determines the decision-making process. We could make up a decision rule
If x c, reject H0
If x < c, conclude that H0 is probably correct.
(2.6)
In this case, {x : x c} is the rejection region, R referred to in 2.2.

We will consider appropriate tests for both one- and two-sided alternatives in the problem above.
2.4.1
48
Case(a) Alternative is one-sided
In the above problem, suppose that the alternative is H1 : p > .5. Only values of x much
larger than 9 would support this alternative and a decision rule such as (2.6) would be appropriate. The actual value of c is chosen to make , the size of the critical region, suitably
small. For example, if c = 11, then P (X 11) = .24 and this of course
Clearly
large.
P18 is too
18
18
we should look for a value closer to 18. If c = 15, P (X 15) = x=15 x (.5) = 0.004,
on calculation. We may now have gone too far in the other extreme. Requiring 15 or more
successes out of 18 before we reject H0 : p = 0.5 means that only 4 times in a thousand
would we reject H0 wrongly. Over the years, a reasonable consensus has been reached
as to how much evidence against H0 is enough evidence. In many situations we define
the beginning of reasonable doubt as the value of the test statistic that is equalled or
exceeded by chance 5% of the time when H0 is true. According to this criterion, c should
be chosen so that P (X c|H0 is true) = 0.05. That is c should satisfy
P (X c|p = 0.5) = 0.05 =
18
X
18
x=c
(0.5)18 .
A little trial and error shows that c = 13 is the appropriate value. Of course because of
the discrete nature of X it will not be possible to obtain an of exactly 0.05.
Defining the critical region in terms of the x-value that is exceeded only 5% of the
time when H0 is true is the most common way to quantify reasonable doubt, but there are
others. The figure 1% is frequently used and if the critical value is exceeded only 1% of the
time we say there is strong evidence against H0 . If the critical value is only exceeded
.1% of the time we may say that there is very strong evidence against H0 .
So far we have considered a one-sided alternative. Now well consider the other case
where the alternative is two-sided.
2.4.2
Case (b) Two-sided Alternative
Consider now the alternative H1 : p 6= 0.5. Values of x too large or too small would
support this alternative. In this case there are two critical regions (or more correctly, the
critical region consists of two disjoint sets), one in each tail of the distribution of X. For
a 5% critical region, there would be two critical values c1 and c2 such that
P (X c1 |H0 is true) 0.025 and P (X c2 |H0 is true) 0.025.
This can be seen in Figure 2.4 below, where the graph is of the distribution of X when H0
is true. (It can be shown that c1 = 4 and c2 = 14 are the critical values in this case.)
Tests with a one-sided critical region are called one-tailed tests, whereas those with
a two-sided critical region are called two-tailed tests.
49
Figure 2.4: Critical Region Twosided Alternative
Distribution of X when H is true
0.20
0.15
0.10
0.05

0.00
0

c1
c2
18
Computer Exercise 2.1 Use a simulation approach to estimate a value for c in (2.6)
above.
Solution: Use the commands
#Generate 1000 random variables fron a bin(18,0.5) distribution.
rb <- rbinom(n=1000,size=18,p=0.5)
table(rb)
#Tabulate the results
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15
1
1
3 11 28 77 125 174 187 166 126 63 22 11
5
Figure 2.5: Empirical cumulative distribution function for binomial rvs

ecdf(rb)
1.0
0.8
Fn(x)
0.6
0.4
0.2
0.0
10
x
15
50
This would indicate the onesided critcal value should be c = 13 as the estimate of
P (X 13) is 0.038. For a two sided test the estimated critical values are c1 = 4 and
c2 = 13.
These results from simulation are in close agreement with theoretical results obtained
in 2.4.1 and 2.4.2.
2.4.3
Two Approaches to Hypothesis Testing
It is worthwhile considering a definite procedure for hypothesis testing problems. There

are two possible approaches.
(i) See how the observed value of the statistic compares with that expected if H0 is true.
Find the probability, assuming H0 to be true, of this event or others more extreme,
that is, further still from the expected value. For a two-tailed test this will involve
considering extreme values in either direction. If this probability is small (say, <
0.05), the event is an unlikely one if H0 is true. So if such an event has occurred,
doubt would be cast on the hypothesis.
(ii) Make up a decision rule by partitioning the sample space of the statistic into a critical
region, R, and its complement R, choosing the critical value (or two critical values in
the case of a two- tailed test) c, in such a way that = 0.05. We then note whether
or not the observed value lies in this critical region, and draw the corresponding
conclusion.
Example 2.3
Suppose we want to know whether a given die is biased towards 5 or 6 or whether it is
true. To examine this problem the die is tossed 9000 times and it is observed that on
3164 occasions the outcome was 5 or 6.
Solution:Let X be the number of successes (5s or 6s) in 9000 trials. Then if p = P (S),
X is distributed bin(9000,p). As is usual in hypothesis testing problems, we set up H0
as the hypothesis we wish to disprove. In this case, it is that the die is true, that is,
p = 1/3. If H0 is not true, the alternative we wish to claim is that the die is biased towards
5 or 6, that is p > 1/3. In practice, one decides on this alternative before the experiment
is carried out. We will consider the 2 approaches mentioned above.
Approach (i), probabilities
If p = 1/3 and N = 9000 then E(X) = np = 3000 and Var(X) = npq = 2000. The
observed number of successes, 3164, was greater than expected if H0 were true. So,
assuming p = 1/3, the probability of the observed event together with others more extreme (that is, further still from expectation) is
PB (X 3164|p = 1/3) = 0.0001 (pbinom(q=3164,size=9000,prob=1/3,lower.tail=F)
This probability is small, so the event X 3164 is an unlikely one if the assumption
weve made (p = 1/3) is correct. Only about 1 times in 10000 would we expect such an
51
occurrence. Hence, if such an event did occur, wed doubt the hypothesis and conclude
that there is evidence that p > 1/3.
Approach (ii), quantiles
Clearly, large values of X support H1 , so wed want a critical region of the form x c
where c is chosen to give the desired significance level, . That is, for = 0.05, say, the
upper tail 5% quantile of the binomial distribution with p = 13 and N = 9000 is 3074.
(qbinom(size=N,prob=px,p=0.05,lower.tail=F))
The observed value 3164 exceeds this and thus lies in the critical region [c, ]. So we
reject H0 at the 5% significance level. That is, we will come to the conclusion that
p > 1/3, but in so doing, well recognize the fact that the probability could be as large as
0.05 that weve rejected H0 wrongly.
The 2 methods are really the same thing. Figure 2.6 shows the distribution function
for Bin(9000, 13 ) with the observed quantile 3164 and associated with it is P (X > 3164)
The dashed lines show the upper = 0.05 probability and the quantile C1 . The event
that X > C1 has a probability p < .
The rejection region can be defined either by the probabilities or the quantiles.
Figure 2.6: using either quantiles or probability to test the null hypothesis
1.0
probability
0.8
0.6
0.4
0.2
C
0.0
2800
2900
3000
3100
3200
quantiles
In doing this sort of problem it helps to draw a diagram, or at least try to visualize the
partitioning of the sample space as suggested in Figure 2.7.
If x R it seems much more likely that the actual distribution of X is given by a curve
similar to the one on the right hand side, with mean somewhat greater than 3000.
52
Figure 2.7: One Sided Alternative Binomial.

1.0
probability
0.8
H0:p=
0.6
H1:p >
1
3
0.4
0.2
0.0
2800
3000
3200
3400
quantiles

The following random sample was drawn from a normal distribution with = 5. Test the
hypothesis that = 23.
18
21
12
28
18
14
22
19
24
16
23
16
22
22
24
23
21
15
18
26
18
28
18
13
35
Solution:
x <- c(18,14,23,23,18,21,22,16,21,28,12,19,22,15,18,28,24,22,18,13,18,16,24,26,35)
xbar <- mean(x)
n <- length(x)
> xbar
[1] 20.56
pnorm(q=xbar, mean=23,sd=5/sqrt(n))
[1] 0.007
qnorm(p=0.05,mean=23,sd=5/sqrt(n))
[1] 21
We can now use approach (i). For a two sided alternative calculated probability is
P = 0.015(= 2 0.00734) so that the hypothesis is unlikely to be true.
For approach (ii) with = 0.05 the critical value is 21. The conclusion reached would
therefore be the same by both approaches.
For testing = 23 against the one sided alternative < 23 , P = 0.0073
2.5
53
Two-Sample Problems
In this section we will consider problems involving sampling from two populations where
the hypothesis is a statement of equality of two parameters. The two problems are:
(i) Test H0 : 1 = 2 where 1 and 2 are the means of two normal populations.
(ii) Test H0 : p1 = p2 where p1 and p2 are the parameters of two binomial populations.
Example 2.4
Given independent random samples X1 , X2 , . . . , Xn1 from a normal population with unknown mean 1 and known variance 12 and Y1 , Y2 , . . . , Yn2 from a normal population with
unknown mean 2 and known variance 22 , derive a test for the hypothesis H: 1 = 2
against one-sided and two-sided alternatives.
Solution: Note that the hypothesis can be written as H : 1 2 = 0. An unbiased
estimator of 1 2 is X Y so this will be used as the test statistic. Its distribution is
given by

12 22
+
X Y N 1 2 ,
n1 n2
or, in standardized form, if H0 is true
X Y
p
(12 /n1 )
+ (22 /n2 )
N (0, 1).
For a two-tailed test (corresponding to H1 : 1 2 6= 0) we have a rejection region of the

form
|x y|
p
>c
(2.7)
(12 /n1 ) + (22 /n2 )
where c = 1.96 for = .05, c = 2.58 for = .01, etc.
For a one-tailed test we have a rejection region
xy
p
(12 /n1 )
+ (22 /n2 )
>
c for H1 : 1 2 > 0
< c for H1 : 1 2 < 0
(2.8)
(2.9)
where c = 1.645 for = .05, c = 2.326 for = .01, etc. Can you see what modification
to make to the above rejection regions for testing H0 : 1 2 = 0 , for some specfified
constant other than zero?
Example 2.5
Suppose that n1 Bernoulli trials where P (S) = p1 resulted in X successes and that n2
Bernouilli trials where P (S) = p2 resulted in Y successes. How do we test H : p1 = p2 (=
p, say)?
54
Solution: Note that H0 can be written H0 : p1 p2 = 0. Now X is distributed as bin(n1 , p1 )

and Y is distributed as bin(n2 , p2 ) and we have seen earlier than unbiased estimates of p1 ,
p2 are respectively,
p1 = x/n1 , p2 = y/n2 ,
X
Y
.
n1 n2
For n1 , n2 large, we can use the Central Limit Theorem to observe that
so an appropriate statistic to use to estimate p1 p2 is
X
n1
nY2 E[ nX1 nY2 ]

q
approximately N (0, 1)
Var[ nX1 nY2 ]
(2.10)

Y
n1 p1 n2 p2
X
E
= 0 under H0 , and
n1 n2
n1
n2

X
Y
n 1 p 1 q1 n 2 p 2 q2
1
1
Var
=
+
= p(1 p)
+
under H0
n1 n2
n21
n22
n1 n2

In (2.10) the variance is unknown, but we can replace it by an estimate and it remains to
decide what is the best estimate to use. For the binomial distribution, the MLE of p is
p =
number of successes
X
=
.
n
number of trials
In our case, we have 2 binomial distributions with the same probability of success under
H0 , so intuitively it seems reasonable to pool the 2 samples so that we have X + Y
successes in n1 + n2 trials. So we will estimate p by
p =
x+y
.
n1 + n2
Using this in (2.10) we can say that to test H0 : p1 = p2 against H1 : p1 6= p2 at the 100%
significance level, H0 is rejected if
s
|(x/n1 ) (y/n2 )|

> z/2 .
x+y
x+y
n1 + n2
1
n1 + n2
n1 + n2
n1 n2
Of course the appropriate modification can be made for a one- sided alternative.
(2.11)
2.6
55
Connection between Hypothesis testing and CIs
Consider the problem where we have a sample of size n from a N(,P

2 ) distribution where
2
is known and is unknown. An unbiased estimator of is x = ni=1 xi /n. We can use
this information either
(a) to test the hypothesis H0 : = 0 ; or
(b) to find a CI for and see if the value 0 is in it or not.
We will show that testing H0 at the 5% significance level (that is, with = .05) against
a 2-sided alternative is the same as finding out whether or not 0 lies in the 95% confidence
interval.
(a) For H1 : 6= 0 we reject H0 at the 5% significance level if
x 0
x 0
> 1.96 or
< 1.96.
/ n
/ n
(2.12)
That is, if
|x 0 |
> 1.96.
/ n
Or, using the P-value, if x > 0 we calculate the probability of a value as extreme or
more extreme than this, in either direction. That is, calculate

x 0
P = 2 P (X > x) = 2 PN Z >
.
/ n
If P < .05 the result is significant at the 5% level. This will happen if
x 0
< 1.96, as
/ n
in (2.11).
(b) A symmetric 95% confidence interval for is x1.96/ n which arose from considering
the inequality
x
1.96 < < 1.96
/ n
which is the event complementary to that in (2.11).
So, to reject H0 at the 5% significance level is equivalent to saying that the hypothesized value is not in the 95% CI. Likewise, to reject H0 at the 1% significance level is
equivalent to saying that the hypothesized value is not in the 99% CI, which is equivalent
to saying that the P-value is less than 1%.
If 1% < P < 5% the hypothesized value of will not be within the 95% CI but it will
lie in the 99% CI.
This approach is illustrated for the hypothesis-testing situation and the confidence
interval approach below.
Using the data in Computer Exercise 2.2, find a 99% CI for the true mean, .
56
Solution:
#Calculate the upper and lower limits for the 99% confidence interval.
CI <- qnorm(mean=xbar,sd=5/sqrt(25),p=c(0.005,0.995) )
> CI
[1] 18 23
So that the 99% CI is (18, 23).

Figure 2.8: Relationship between Non-significant Hypothesis Test and Confidence Interval
Test significant at 5% level, 95% confidence

interval doesn't include hypothesised mean.
Hypothesised Mean
Critical value
Observed value
*
95% CI
Figure 2.9: Relationship between Significant Hypothesis Test and Confidence Interval
Test non-significant at 1% level, 99%,

confidence interval
includes hypothesised mean.
Hypothesised Mean
Observed value
Critical value
*
99% CI
2.7
57
Summary
We have only considered 4 hypothesis testing problems at this stage. Further problems
will be dealt with in later chapters after more sampling distributions are introduced. The
following might be helpful as a pattern to follow in doing examples in hypothesis testing.
1. State the hypothesis and the alternative. This must always be a statement about
the unknown parameter in a distribution.
2. Select the appropriate statistic (function of the data). In the problems considered
so far this is an unbiased estimate of the parameter or a function of it. State the
distribution of the statistic and its particular form when H0 is true.
Alternative Procedures
1.
Find the critical region using the
appropriate value of (.05 or .01 usually).
2.
Find the observed value of the
statistic (using the data).
3.
Draw conclusions. If the calculated value falls in the CR, this provides evidence against H0 . You could
say that the result is significant at the
5% (or 1% or .1% level).
.
2. Calculate P , the probability associated with values as extreme or more
extreme than that observed. For a 2sided H1 , youll need to double a probability such as P (X k).
3.
Draw conclusions. For example, if P < .1% we say that there is
very strong evidence against H0 . If
.1% < P < 1% we say there is strong
evidence. If 1% < P < 5% we say there
is some evidence. For larger values of
P we conclude that the event is not an
unusual one if H0 is true, and say that
this set of data is consistent with H0 .
2.8
2.8.1
58
Bayesian Hypothesis Testing

Notation
The notation that we have used in classical hypothesis testing is:

=
H0 :
H1 :
There is a set of observations x1 , x2 , . . . , xn whose density is p(x|).

A test is decided by the rejection region
R = {x | observing x would lead to the rejection of H0 }
Decisions between tests are based on
Size
= P (R|) for
Type I error
Power = 1 P (R|) for
Type II error
The smaller the Type I error the larger the Type II error and vice versa. The rejection
region is usually chosen as a balance between the 2 types of error.
2.8.2
Bayesian approach
We calculate the posterior probabilities,

p0 = P ( |x)
p1 = P (
|x)
and decide between H0 and H1 using p0 and p1 .
Since
= and
= , then p0 + p1 = 1.
We require prior probabilities
0 = P ( )
1 = P (
)
Thus 0 + 1 = 1.
The prior odds on H0 against H1 is
0
1
and the posterior odds on H0 against H1 is
p0
.
p1
If the prior odds is close to 1, then H0 is approximately equally likely as H1 a priori.

If the prior odds ratio is large, H0 is relatively likely.
If the prior odds ratio is small, H0 is relatively unlikely.
The same remarks apply to the posterior odds.
59
Bayes factor
The Bayes factor, B, is the odds in favour of H0 against H1 ,
B=
p0 /p1
p0 1
=
0 /1
p1 0
(2.13)
The posterior probability p0 of H0 can calculated from its prior probability and the
Bayes factor,
1
1
p0 =
=
1
[1 + (1 /0 )B ]
[1 + {(1 0 )/0 } B1 ]
Simple Hypotheses
= {0 }
= {1 }
p0 0 p(x|0 ) p1 1 p(x|1 )
p0
p1
0 p(x|0 )
1 p(x|1 )
p(x|0 )
p(x|1 )
Example 2.6
Consider the following prior distribution, density and null hypothesis,
N (82.4, 1.12 )
x| N (82.1, 1.72 )
H0 : x < 83.0
From the results in section 1.9,
0 =
=
1 =
1 =
=
1
1
= 0.83
=
2
0
1.12
1
1
=
= 0.35
2
1.72
0 + 1 = 1.18
12 = (1.18)1 = 0.85

0
+
0
1

1

0.83
0.35
+ 82.1
= 82.3
82.4
1.18
1.18
60
For H0 : x < 83, and with 0 , p0 being the prior and posterior probabilities under H0 ,
0 = P (x < 83|0 = 82.4, 0 = 1.1) = 0.71
Use pnorm(mean=82.4,sd=1.1,q=83)
0
0.71
=
= 2.45
1 0
0.29
p0 = P (x < 83|1 = 82.3, 1 = 0.85) = 0.77

0.77
p0
=
= 3.35
1 p0
0.23
The Bayes factor is
B=
3.35
p0 1
=
= 1.4
p1 0
2.45
The data has not altered the prior beliefs about the mean, B 1.
2.9
61
Non-Parametric Hypothesis testing.
Figure 2.10 shows 2 ways in which distributions differ. The difference depicted in Figure 2.10 (a) is a shift in location (mean) and in Figure 2.10 (b) there is a shift in the scale
(variance).
1.0
1.0
0.8
0.8
0.6
0.6
F(x)
F(x)
Figure 2.10: Distributions that differ due to shifts in (a) location and (b) scale.
0.4
0.4
0.2
0.2
0.0
0.0
0
2.9.1
10
15
10
15
Kolmogorov-Smirnov (KS)
The KS test is a test of whether 2 independent samples have been drawn from the same
population or from populations with the same distribution. It is concerned with the agreement between 2 cumulative distribution functions. If the 2 samples have been drawn from
the same population, then the cdfs can be expected to be close to each other and only
differ by random deviations. If they are too far apart at any point, this suggests that the
samples come from different populations.
The KS test statistics is

D = max |F1 (x) F2 (y)|
(2.14)
Exact sampling distribution
The exact sampling distribution of D under H0 : F1 = F2 can be enumerated.
If H0 is true, then [(X1 , X2 , . . . , Xm ), (Y1 , Y2 , . . . , Yn )] can be regarded as a random
sample from the same population with actual realised samples
[(x1 , x2 , . . . , xm ), (y1 , y2 , . . . , yn )]
Thus (under H0 ) an equally likely sample would be
[(y1 , x2 , . . . , xm ), (x1 , y2 , . . . , yn )]
where x1 and y1 were swapped.
62
+ n possible realisations of allocating the combined sample to 2 groups

There are m m
1
of sizes m and n and under H0 the probability of each realisation is m+n . For each sample
m
generated this way, a D? is observed.

1
1
Now F1 (x) is steps of m+1
and F2 (y) is steps of n+1
so for given m and n, it would
?
be possible to enumerate all Dm,n if H0 is true. From this enumeration the upper 100%
?
point of {Dm,n
} , {Dm,n ; }, gives the critical value for the sized test. If the observed
{Dm,n } is greater than {Dm,n ; }, reject H0 .
2.9.2
Asymptotic distribution
If m and n become even moderately large, the enumeration is huge. In that case we can
utilize the large sample approximation that
2 =
4D2 (nm)
n+m
This was shown to be so by Goodman in 1954, Psychological Bulletin 51 160-168.

Example 2.7
These data are the energies of sway signals from 2 groups of subjects, Normal group and
Whiplash group. Whiplash injuries can lead to unsteadiness and the subject may not be
able to maintain balance. each subject had their sway pattern measured by standing on a
plate blindfolded. Does the distribution of energies differ between groups?
Table 2.1: Wavelet energies of the sway signals from normal subjects and subjects with
whiplash injury.
Normal
Whipl
33
1161
269
2462
211
1420
352
2780
284
1529
386
2890
545
1642
1048
4081
570
1994
1247
5358
591
2329
1276
6498
602
2682
1305
7542
The plots of the ecdf suggest a difference.

We apply the Kolmogorov-Smirnov test to these data.
786
2766
1538
13791
945
3025
2037
23862
951
13537
2241
34734
63
Figure 2.11: The ecdfs of sway signal energies for N & W groups
1.0
Fn(x)
0.8
0.6
0.4
0.2
0.0
n
n
n
n
n
n
w
n
w
n
w
n
w
n
n w
n w
n w
nw
nw
nw
w
n
w
n
nw
0
n
w
5000
10000
15000
energy
N.energy <- c(33,211,284,545,570,591,602,786,945,951,1161,1420,

1529,1642,1994,2329,2682,2766,3025,13537)
W.energy <- c(269,352,386,1048,1247,1276,1305,1538,2037,2241,2462,2780,
2890,4081,5358,6498,754,1379,23862,34734)
KS <- ks.test(N.energy,W.energy,alternative="greater")
> KS
Two-sample Kolmogorov-Smirnov test
data: N.energy and W.energy
D^+ = 0.35, p-value = 0.0863
alternative hypothesis: the CDF of x lies above that of y
# ______
the Asymptotic distribution __________
D
<- KS$statistic
Chi <- 4*(KS$statistic^2)*m*n/(m+n)
P
<- pchisq(q=Chi,df=2,lower.tail=F)
> cat("X2 = ",round(Chi,2),"P( > X2) = ",P,"\n")
X2 = 4.9 P( > X2) = 0.08629
1. The Kolmogorov-Smirnov test of whether the null hypothesis can be rejected is a

permutation test.
2. The equality F1 = F2 means that F1 and F2 assign equal probabilities to all sets;
PF1 (A) = PF2 (A) for and A subset of the common sample space of x and y. If H0 is
true, there is no difference between the randomness of x or y.
3. The null hypothesis is set up to be rejected. If however, the data are such that the null
hypothesis cannot be decisively rejected, then the experiment has not demonstrated
a difference.
64
for comparing the distributions. In the

4. A hypothesis test requires a statistic, ,
Kolmogorov-Smirnov test = D.
the achieved significance level of the test is the probability of
5. Having observed ,
The observed
observing at least as large a value when H0 is true, PH0 (? ).
statistic, is fixed and the random variable ? is distributed according to H0 .
6. The KS test enumerated all permutations of elements in the samples. This is also
termed sampling without replacement. Not all permutations are necessary but an
accurate test does require a large number of permutations.
7. The permutation test applies to any test statistic. For the example in Figure 2.10(b),
2
we might use = x2 .
y
2.9.3
65
Bootstrap Hypothesis Tests
The link between confidence intervals and hypothesis tests also holds in a bootstrap setting.
The bootstrap is an approximation to a permutation test and a strategic difference is that
bootstrap uses sampling with replacement.
A permutation test of whether H0 : F1 (x) = F2 (y) is true relies upon the ranking of the
combined data set (x, y). The data were ordered smallest to largest and each permutation
was an allocation of the group labels to each ordered datum. In 1 permutation, the label
x was ascribed to the first number and in another, the label y is given to that number and
so on.
The test statistic can be a function of the data (it need not be an estimate of a parameter) and so denote this a t(z).
The principle of bootstrap hypothesis testing is that if H0 is true, a probability atom
1
of m+n
can be attributed to each member of the combined data z = (x, y).
The empirical distribution function of z = (x, y), call it F0 (z), is a non-parametric
estimate of the common population that gave rise to x and y, assuming that H0 is true.
Bootstrap hypothesis testing of H0 takes these steps,
1. Get the observed value of t, e.g. tobs = x y.
2. Nominate how many bootstrap samples (replications) will be done, e.g. B = 499.
3. For b in 1:B, draw samples of size m + n with replacement from z. Label the first m
of these x?b and the remaining n be labelled yb? .
4. Calculate t(zb? ) for each sample. For example, t(zb? ) = x?b yb?
number of t(z?b ) tobs
5. Approximate the probability of tobs or greater by
B
66
Example
The data in Table 2.1 are used to demonstrate bootstrap hypothesis testing with the
test statistic,
y x
t(z) = q
m1 + n1
The R code is written to show the required calculations more explicitly but a good
program minimises the variables which are saved in the iterations loop.
#_____________ Bootstrap Hypothesis Test ____________________
N.energy <- c(33,211,284,545,570,591,602,786,945,951,1161,1420,
1529,1642,1994,2329,2682,2766,3025,13537)
W.energy <- c(269,352,386,1048,1247,1276,1305,1538,2037,2241,2462,2780,
2890,4081,5358,6498,754,1379,23862,34734)
Z <- c(N.energy,W.energy)
m <- length(N.energy)
n <- length(W.energy)
T.obs <- (mean(W.energy) - mean(N.energy))/(sd(Z)*sqrt(1/m + 1/n))
nBS <- 999
T.star <- numeric(nBS)
for (j in
z.star <w.star <n.star <T.star[j]
1:nBS){
sample(Z,size=(m+n))
z.star[(m+1):(m+n)]
z.star[1:m]
<- ( mean(w.star) - mean(n.star) )/( sd(z.star) * sqrt(1/m + 1/n) )
}
p1 <- sum(T.star >= T.obs)/nBS
cat( "P(T > ",round(T.obs,1),"|H0) = ",round(p1,2),"\n")
The results are:T = 1.4

P (t > 1.4|H0 ) = 0.09
Thus this statistic does not provide evidence that the 2 distributions are different.
Chapter 3
Chisquare Distribution
Distribution of S 2
3.1
Recall that if X1 , X2 , . . . , Xn is a random sample from a N(, 2 ) distribution then

2
S =
n
X
(Xi X)2 /(n 1)
i=1
is an unbiased estimator of 2 . We will find the probability distribution of this random

variable. Firstly note that the numerator of S 2 is a sum of n squares but they are not
independent as each involves X. This sum of squares can be rewritten as the sum of
squares of n 1 independent variables by the method which is illustrated below for the
cases n = 2, 3, 4.
For n = 2,
2
X
2 = Y12 where Y1 = (X1 X2 )/ 2;

(Xi X)
i=1
for n = 3,
3
X
2=
(Xi X)
i=1
2
X
Yj2 where Y1 = (X1 X2 )/ 2, Y2 = (X1 + X2 2X3 )/ 6;
j=1
for n = 4,
4
X
i=1
2=
(Xi X)
3
X
Yj2 where Y1 , Y2 are as defined above and
j=1
Y3 = (X1 + X2 + X3 3X4 )/ 12.

Note that Y1 , Y2 , Y3 are linear functions of X1 , X2 , X3 , X4 which are mutually orthogonal
with the sum of the squares of their coefficients equal to 1.
Consider now the properties of the Xi and the Yj as random variables. Since Y1 , Y2 ,
Y3 are mutually orthogonal linear functions of X1 , X2 , X3 , X4 they are uncorrelated, and
67
CHAPTER 3. CHISQUARE DISTRIBUTION
68
since they are normally distributed (being sums of normal random variables), they are
independent. Also,
E(Y1 ) = 0 = E(Y2 ) = E(Y3 )
and,
1
(Var(X1 ) + Var(X2 )) = 2
2
1
4
1
Var(Y2 ) = Var(X1 ) + Var(X2 ) + Var(X3 ) = 2 .
6
6
6
2
Similarly, Var(Y3 ) = .
In general the sum of n squares P
involving the Xs can be expressed as the sum of n 1
squares involving the Y s. Thus ni=1 (Xi X)2 can be expressed as
Var(Y1 ) =
n
X
(Xi X) =
i=1
n1
X
Yj2
j=1
Yj2
j=1
where = n 1 is called the number of degrees of freedom and

Yj =
X1 + X2 + + Xj jXj+1
p
,
j(j + 1)
j = 1, 2, , n 1.
The random variables Y1 , Y2 , . . . , Y each have mean zero and variance 2 . So each
Yj N (0, 2 ) and the Yj0 s are independent.
P
2
j=1 Yj
2
and recall that
Now write S =

(X )2
1
2
Gamma
(i) If X N(, ) then
, [Statistics 260, (8.16)]
2
2
2
P
2
j=1 (Xj )
2
(ii) If X1 , X2 , . . . , X are independent N(, ) variates, then
is dis2 2

[Statistics 260, section 7.4].
tributed as Gamma
2

Yj2
1
Applying this to the Yj where = 0,
Gamma
and
2
2
2

1 X Yj2
V =
is distributed as Gamma
.
2 j=1 2
2
Thus the pdf of V is given by

f (v) =
1 ( 1) v
v 2 e ,
( 2 )
v (0, )
(3.1)
69
with V and S 2 being related by

P
2
S =
j=1
Yj2
2 2 V
or
S2
(3.2)
2
2
Now V is a strictly monotone function of S 2 so, by the change-of-variable technique, the
pdf of S 2 is
V =
g(s2 ) = f (v)|dv/ds2 |
(/2)1
2
2
es /2
2
s2
.
=
, s (0, )
(/2) 2 2
2 2
o
/2
n
1
2 ( 2 1)
2
s
(s
)
=
exp
( 2 )
2 2
2 2
(3.3)
This is the pdf of S 2 derived from a N(, 2 ) distribution.
3.2
Chi-Square Distribution
Define the random variable W as

W = S 2 / 2 = 2V,
where V is defined in (3.2). Note that W is a sum of squares divided by 2 , and can be
thought of as a standardized sum of squares. Then the p.d.f. of W is
2

ds2
2
2 ds
,
where
=
h(w) = g(s )
dw
dw
w/2 (/2)1
e
w
=
, w [0, ].
(3.4)
/2
2 (/2)
A random variable W with this pdf is said to have a chi-square distribution on
degrees of freedom (or with parameter )and we write W 2 .
Notes: (a) W/2 (/2).
(b) This distribution can be thought of as a special case of the generalized gamma
distribution.
(c) When = 2, (3.4) becomes h(w) = 21 ew/2 , w [0, ], which is the exponential
distribution.
70

Graph the chi-square distributions with 2, 3 and 4 degrees of freedom for x = 0.05, 0.1, 0.15, . . . , 10,
using one set of axes.
Solution:
x <- seq(from=0.05,to=10,by=0.05)
0.5
0.4
22
23
24
0.3
fx
for (d in 2:4){
fx <- dchisq(x,df=d)
if(d==2) plot(fx ~ x,type=l,
ylim=c(0,0.5),las=1)
else lines(x,fx,lty=(d-1))
}
# end of d loop
legend(6,0.4,
expression(chi[2]^2,chi[3]^2,chi[4]^2),
lty=1:3)
0.2
0.1
0.0
0
10
Rcmdr plots the Chi-square density or distribution function readily,
Cumulative Distribution Function

If W 2 , percentiles (i.e 100P ) of the chi-square distribution are determined by the
inverse of the function
Z w1.01P
1
1
P
= /2
w 2 1 ew/2 dw = P (W w1.01P ).
100
2 (/2) 0
Figure 3.1 depicts the tail areas corresponding to P (lower tail) and 1 P (upper tail)
for the density function and superimposed is the distribution function. The scales for the
Y-axes of the density function (left side) and the distribution function (right side) are
different.
71
Figure 3.1: Area corresponding to the 100P percentile of the 2 random variable w.
f(w)
1
P
1P
w1P
W
The R function for calculating tail area probabilities for given quantiles is
pchisq(q= , df = ,lower.tail= T (or F) )
and for calculating quantiles corresponding to a probability, qchisq(p = , df = )
These functions are included in the Rcmdr menus.
The following example requires us to find a probability.
Example 3.1
A random sample of size 6 is drawn from a N (, 12) distribution.
Find P(2.76 < S 2 < 22.2).
Solution:
We wish to express this as a probability statement about the random variable W . That is,
S 2
5
5
2.76 <
<
22.2)
2
12
12
= P (1.15 < W < 9.25) where W 25
= P (W < 9.25) P (W < 1.15)
P (2.76 < S 2 < 22.2) = P (
Solution:
#___ Pint.R _______
Q <- c(2.76,22.2)*5/12
Pint <- diff( pchisq(q=Q,df=5))
cat("P(2.76 < S2 < 22.2) = ",Pint,"\n")
> source("Pint.R")
P(2.76 < S2 < 22.2) =
0.85
72
Figure 3.2: P (2.76 < S 2 < 22.2)

1.0
0.90
0.8
F(w)
0.6
P(1.15 < W < 9.25)
0.4
0.2
0.05
0.0
(1.15<W<9.25)
10
15
20
Moments
As V (defined in (3.2)) has a gamma distribution its mean and variance can be written
down. That is, V (/2), so that
E(V)= /2 and Var(V)= /2.
Then since W is related to V by W = 2V
E(W ) = 2(/2) =
Var(W ) = 4(/2) = 2.
(3.5)
Thus, a random variable W 2 has mean and variance 2.

Exercise: Find E(W ) and Var(W ) directly from h(w).
Moment Generating Function

The MGF of a chi-square variate can be deduced from that of a gamma variate. Let
V (/2) and let W = 2V . We know MV (t) = (1 t)/2 from Statistics 260, Theorem
4.4. Hence
MW (t) = M2V (t) = MV (2t) = (1 2t)/2 .
So if W 2 then
MW (t) = (1 2t)/2 .
(3.6)
Exercise: Find the MGF of W directly from the pdf of W . (Hint: Use the substitution
u = w(1 2t)/2 when integrating.)
73
To find moments, we will use the power series expansion of MW (t).

(2t)2

(2t)3

MW (t) = 1 + .2t +
+1
+
+1
+2
+
2
2 2
2!
2 2
2
3!
t2
t3
= 1 + t + ( + 2)
+ ( + 2)( + 4) +
2!
3!
Moments can be read off as appropriate coefficents here. Note that 01 = and 02 =
( + 2). The cumulant generating function is
KW (t) = log MW (t) = log(1 2t)

2

22 t2 23 t3 24 t4
= 2t

2
2
3
4
2t2
8t3
48t4
= t +
+
+
+
2!
3!
4!
so the cumulants are
1 = , 2 = 2, 3 = 8, 4 = 48.
We will now use these cumulants to find measures of skewness and kurtosis for the chisquare distribution.
Comparison with Normal

(i) Coefficient of skewness,
3/2
1 = 3 /2
for the 2 distribution

2 2
0 as
That is, the 2 distribution becomes symmetric for .
(ii) Coefficient of kurtosis,
2 = 4 /22 for any distribution
48
=
for the 2 distribution
4 2
0 as .
This is the value 2 has for the normal distibution.
74
Additive Property
Let W1 21 and W2 (independent of W1 ) 22 . Then from (3.6) W1 + W2 has moment
generating function
MW1 +W2 (t) = MW1 (t)MW2 (t) = (1 2t)1 /2 (1 2t)2 /2
= (1 2t)(1 +2 )/2
This is also of the form (3.6); that is, we recognize it as the MGF of a 2 random variable
on (1 + 2 ) degrees of freedom.
Thus if W1 21 and W2 22 and W1 and W2 are independent then
W1 + W2 21 +2
The result can be extended to the sum of k independent 2 random variables.
If W1 , . . . , Wk are independent
21 , . . . , 2k
then
k
X
Wi 2
(3.7)
i=1
P
where = i . Note also that a 2 variate can be decomposed into a sum of independent
chi-squares each on 1 d.f.
Chi-square on 1 degree of freedom

For the special case = 1, note that from (3.1) if Y N (0, 2 ) then V =
and W = 2V = Y 2 / 2 21 .
Thus, if Z = Y /, it follows Z N (0, 1) and
Y2
(1/2)
2 2
Z 2 21 .
(3.8)
(The square of a N (0, 1) random variable has a chi-square distribution on 1 df.)
Summary
You may find the following summary of relationships between 2 , gamma, S 2 and normal
distributionsP
useful.
Define S 2 = ni=1 (Xi X)2 /(n 1), the Xi being independent N(, 2 ) variates, then
(i) W = S 2 / 2 2 where = n 1,
(ii)
1
W
2
= S 2 /2 2 (/2),
(iii) If Zi =
Xi
, (that is, Zi N (0, 1)) then
Zi2 21 and Z12 + Z22 + , + Zk2 2k .
3.3
75
Independence of X and S 2
When X and S 2 are defined for a sample from a normal distribution, X and S 2 are
statistically independent. This may seem surprising as the expression for S 2 involves X.
Consider P
again the transformation from Xs P
to Ys given in 3.1. Weve seen that
n
2
2
(n 1)S = i=1 (Xi X) can be expressed as j=1 Yj2 where the Yj defined by
Yj =
X1 + X2 + + Xj jXj+1
p
,
j(j + 1)
j = 1, 2, , n 1,
have zero means and variances 2 . Note also that the sample mean,
X=
1
1
1
X1 + X2 + , + Xn
n
n
n
is a linear function of X1 , . . . , Xn which is orthogonal to each of the Yj , and hence uncorrelated

with each Yj . Since the Xi are normally distributed, X is thus independent of each of the
Yj and therefore independent of any function of them.
Thus when X1 , . . . , Xn are normally and independently distributed random variables
X and S 2 are statistically independent.
3.4
Confidence Intervals for 2
We will use the method indicated in 1.8 to find a confidence interval for 2 in a normal
distribution, based on a sample of size n. The two cases (i) unknown; (ii) known must
be considered separately.
Case (i)
Let X1 , X2 , . . . , Xn be a random sample from N(, 2 ) where both and 2 are unknown.
It has been shown that S 2 is an unbiased estimate of 2 (Theorem 1.4) and we can find
a confidence interval for 2 using the 2 distribution. Recall that W = S 2 / 2 2 . By
way of notation, let w, be defined by P (W > w, ) = , where W 2 .
The quantile for the upper 5% region is obtained by:qchisq(p=0.05,df=5,lower.tail=F) or

qchisq(p=0.95,df=5)
76
f(w)
Figure 3.3: Area above w,
1-
,
We find two values of W , w,/2 and w,1(/2)
, such that

P w,1(/2)w < W < w,/2 = 1 .
Figure 3.4: Upper and lower values for w
f(w)
/2
, 1(/2)
The event w,1(/2)
, /2
w
< W < w,/2 occurs if and only if the events
2 < S 2 /w,1(/2) , 2 > S 2 /w,/2

occur. So

P w,1(/2) < W < w./2 = P S 2 /w,/2 < 2 < S 2 /w,1(/2)
and thus
A 100(1 )% CI for 2 is (s2 /w,/2 , s2 /w,1(/2) )
(3.9)
Example 3.2
For a sample of size n = 10 from a normal distribution s2 was calculated and found to be
6.4. Find a 95% CI for 2 .
77
Solution: Now = 9, and

qchisq(p=c(0.025,0.975),df=9,lower.tail=F)
[1] 19.0 2.7
w9,.025 = 19 and w9,.975 = 2.7.

Hence, s2 /w9,.025 = 3.02, and s2 /w9,.975 = 21.33.
That is, the 95% CI for 2 is (3.02, 21.33).
Case (ii)
Suppose now that X1 , X2 , . . . , Xn is a random sample from N(, 2 ) where is known
and we wish to find a CI for the unknown 2 . Recall (Assignment 1, Question 4) that the
maximum likelihood estimator of 2 (which well denote by S 2 ) is
S
n
X
(Xi )2 /n.
i=1
We can easily show that this is unbiased.

n
X
E(Xi )2
1
= n 2 = 2
n
n
i=1
P
The distribution of S 2 is found by noting that nS 2 / 2 = ni=1 (Xi )2 / 2 is the sum
of squares of n independent N(0,1) variates and is therefore distributed as 2n (using (3.8)
and (3.7)). Proceeding in the same way as in Case (i) we find

ns2
ns2
2
,
(3.10)
A 100(1 )% CI for when is known is
wn,/2 wn,1(/2)
2
E(S ) =
3.5
Testing Hypotheses about 2
Again the cases (i) unknown; and (ii) known are considered separately.
Case (i)
Let X1 , X2 , . . . , Xn be a random sample from a N(, 2 ) distribution where is unknown,
and suppose we wish to test the hypothesis
H : 2 = 02 against A : 2 6= 02 .
Under H, S 2 /02 2 and values of s2 /02 too large or too small would support A. For
= .05, say, and equal-tail probabilities we have as critical region

s2
s2
2
R= s :
> w,.025 or
< w,.975 .
02
02
78
Figure 3.5: Critical Region
f(w)
0.05

, 0.025
, 0.975
Consider now a one-sided alternative. Suppose we wish to test

H : 2 = 02 against A : 2 > 02 .
Large values of s2 would support this alternative. That is, for = .05, use as critical
region
{s2 : s2 /02 > w,.05 }.
Similarly, for the alternative A: 2 < 02 , a critical region is
{s2 : s2 /02 < w,.95 }.
Example 3.3
A normal random variable has been assumed to have standard deviation = 7.5. If a
sample of size 25 has s2 = 95.0, is there reason to believe that is greater than 7.5?
Solution: We wish to test H: 2 = 7.52 (= 02 ) against A: 2 > 7.52 .
Using = .05, the rejection region is {s2 : s2 /02 > 36.4}.
24 95
The calculated value of s2 / 2 is
= 40.53.
56.25
> pchisq(q=40.53,df=24,lower.tail=F)
[1] 0.019
When testing at the 5% level, there is evidence that the standard deviation is greater
than 7.5.
Case (ii)
Let X1 , X2 , . . . , Xn be a random sample from N(, 2 ) where is known, and suppose
we wishPto test H: 2 = 02 . Again we use the fact that if H is true, nS 2 /02 2n where
S 2 = ni=1 (Xi )2 /n, and the rejection region for a size- 2-tailed test, for example,
would be

ns2
ns2
2
> wn,/2 or
< wn,1(/2)
s :
02
02
79
2 and Inv-2 distributions in Bayesian inference
3.6
3.6.1
Non-informative priors
A prior which does not change very much over the region in which the likelihood is appreciable and does not take very large values outside that region is said to be locally uniform.
For such a prior,
p(|y) p(y|) = `(|y)
The term pivotal quantity was introduced in section 1.7.1 and now is defined for (i)
location parameter and (ii) scale parameter.
(i) If the density of y, p(y|), is such that p(y |) is a function that is free of y and
, say f (u) where u = y , then y is a pivotal quantity and is a location
parameter.
Example. If (y|, 2 ) N (, 2 ), then (y |, 2 ) N (0, 2 ) and y is a
pivotal quantity.
(ii) If p( y |) is a function free of and y, say g(u) where u =
quantity and is a scale parameter.
y
N (0, 1).
Example. If (y|, 2 ) N (, 2 ),then
y
,
then u is a pivotal
A non-informative prior for a location parameter, , would give f (y ) for the

posterior distribution p(y |y). That is under the posterior distribution, (y ) should
still be a pivotal quantity.
Using Bayes rule,
p(y |y) p()p(y |)
Thus p() Constant.
For the case of a scale parameter, , Bayes rule is
y
y
p( |y) p()p( |)
or
p(u|y) p()p(u|)
(3.11)
(3.12)
(The LHS of (3.11) is the posterior of a parameter say = y and the RHS is the
density of a scaled variable y = y . Both sides are free of y and .)

du
p(y|) = p(u|) =
dy

du
p(|y) = p(u|y) =
d
1
p(u|)
y
p(u|y)
2
80
Thus from (3.12), equate p(u|y) to p(u|),

p(|y) =
y
p(y|)
so that the uninformative prior is

p()
3.7
(3.13)
The posterior distribution of the Normal variance
Consider normally distributed data,

y|, 2 N (, 2 )
The joint posterior density of parameters , 2 is given by
p(, 2 ) p(y|, 2 ) p(, 2 )
(3.14)
To get the marginal posterior distribution of the variance, integrate with respect to ,
Z
2
(3.15)
p( |y) =
p(, 2 |y)d
Z
=
p( 2 |, y)p(|y)d
(3.16)
Choose the prior
p(, 2 ) p()p( 2 )
p(, 2 ) ( 2 )1
( 2 )
(p() Const.)
Write the posterior density as

(
)
n
X
1
p(, 2 ) n2 exp 2
(y )2
2 i=1
" n
(
#)
X
1
= n2 exp 2
(yi y)2 + n(
y )2
2 i=1

1
n2
2
2
y )
=
exp 2 (n 1)S + n(
2
P
(yi y)2
2
where S =
(n 1)
(3.17)

Now integrate the joint density with respect to ,

Z

1
2
n2
2
2
p( |y)
exp 2 (n 1)S + n(
y )
2
Z

1
1
2
n2
2
exp 2 (
y ) d
=
exp 2 (n 1)S
2
2 /n

p
1
n2
2
=
exp 2 (n 1)S
2 2 /n
2

(n 1)S 2
2 n+1
= ( ) 2 exp
2 2
81
(3.18)
The pdf of S 2 was derived at (3.3),

/2
s2
1
2 ( 2 1)
g(s ) =
(s )
exp 2
( 2 )
2 2
2

s2
(s2 )( 2 1) exp 2
2

n1 n1
distribution.
with = (n 1) and this is a Gamma
,
2
2 2
2
3.7.1
Inverse Chi-squared distribution
Its Bayesian counterpart at (3.18) is a Scaled Inverse Chi-squared distribution. Since the
prior was uninformative, similar outcomes are expected.
The inverse 2 distribution has density function
1
p( |) =
( 2 )
2

2 2 +1

1
1
1
exp 2 I(0,) ( 2 )
2
2
2
The scaled inverse chi-squared distribution has density

s2
1 2
2 ( 2 +1)
p( |, s ) =
( )
exp 2
( 2 ) 2
2
2
1
The prior p( 2 ) 2 can be said to be an inverse chi-squared distribution on = 0
degrees of freedom or sample size n = 1. Is there any value in it? Although uninformative,
it ensures a mathematical smoothness and numerical problems are reduced.
The posterior density is Scaled Inverse Chi- squared with degrees of freedom = (n1)
and scale parameter s.
3.8
82
Relationship between 2 and Inv-2

1
,
.
Recall that
is Ga
2 2
The Inverse-Gamma distribution is also prominent in Bayesian statistics so we examine
it first.
2v
3.8.1
Gamma and Inverse Gamma
The densities of the Gamma and Inverse Gamma are:1 (1)

exp {} I0, () , > 0 (3.19)
()
1 (+1)
Inverse Gamma p(|, ) =
exp{ } I0, () , > 0 (3.20)

()
Gamma p(|, ) =
If 1 Ga(, ), then InvGamma(, ).

Put = 1 . then

d
f (; , ) = f ( ; , )
d

1 (1)
exp
2
=
()

1 (+1)
exp
=
()
3.8.2
Chi-squared and Inverse Chi-squared
If Y = SX such that Y 1 S 1 2 , then Y is S times an inverse 2 distribution.

The Inverse-2 (, s2 ) distribution is a special case of the Inverse Gamma distribution
2
with = 2 and = s2 .
3.8.3
Simulating Inverse Gamma and Inverse-2 random variables.
InvGa. Draw X from Ga(, ) and invert it.

ScaledInv2,s2 . Draw X from 2 and let Y =
s2
.
X
83
Example
Give a 90% HDR for the variance of the population from which the following sample
is drawn.
4.17
5.58
5.18
6.11
4.50
4.61
5.17
4.53
5.33
5.14
S 2 = 0.34
p( 2 |, S 2 ) = 0.342
9
The 90% CI for 2 is (0.18, 0.92). The mode of the posterior density of 2 is 0.28 and
the 90% HDR for 2 is (0.13, 0.75).
The HDR was calculate numerically in this fashion,
1. Calculate the posterior density, (3.20)
2. Set an initial value for the horizon, estimate the abscissas (left and right of the
mode) whose density is at the horizon. Call these xl and xr
3. Integrate the density function over (xl , xr ).
4. Adjust the horizon until this is 0.9. The HDR is then (xl , xr ) at the current values.
(2)
0.34
9
p(2, |,, , s)
2.5
2.0
1.5
0.9
1.0
0.5
horizon
0.0
0.0 xl
0.5
1.0
xr
2
1.5
84
#_________________ to calculate HDR of \sigma^2 ______________________

options(digits=2)
#_________________ functions to use later in the job __________________
closest <- function(s,v){
delta <- abs(s-v)
p <- delta==min(delta)
return(p) }
#
________________
IGamma <- function(v,a=df/2,b=0.5*df*S2){
p <- (1/gamma(a))* (v**(-(a+1)) ) * (b**a) * exp(-b/v)
return(p)
}
#_______________________________________________________________________________
wts <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
# the data
n <- length(wts);
S2 <- var(wts);
df <- n - 1
# statistics
cat("S-sq = ",S2,"\n")
# ___________ 90% CI ______________
Q <- qchisq(p=c(0.95,0.05),df=df)
CI <- df*S2/Q
cat("CI.sigma = ",CI,"\n")
# _____________
Posterior ___________________
Ew <- df*S2/(df-2)
Vw <- (2*df^2*S2^2)/((df-2)^2*(df-4)^2)
w <- seq(0.01,(Ew+10*sqrt(Vw)),length=501)
ifw <- IGamma(v=w)
mode <- w[closest(max(ifw),ifw)]
# ________ deriving the HDR by numerical integration ___________
PHDR <- 0.9 # this is the level of HDR we want
step <- 0.5; convergence.test <- 1e3; prop <- 0.9 # scalar variables for the numerical steps
while (convergence.test > 1e-3 ){
# iterate until the area is very close to 0.9
horizon <- max(ifw)*prop
left.ifw <- subset(ifw,subset=w < mode);lw <- w[w < mode]
right.ifw <- subset(ifw,subset=w > mode);rw <- w[w > mode]
xl <- lw[closest(horizon,left.ifw)]
xr <- rw[closest(horizon,right.ifw)]
Pint <- integrate(f=IGamma,lower=xl,upper=xr)
convergence.test <- abs(Pint$value - PHDR)
adjust.direction <- 2*(0.5 - as.numeric(Pint$value < PHDR)) # -1 if < +1 if >
prop <- prop+ adjust.direction*step*convergence.test
}
# end of while loop
HDR <- c(xl,xr)
cat("HDR = ",HDR,"\n")
Chapter 4
4.1
F Distribution
Derivation
Definition 4.1
Suppose S12 and S22 are the sample variances for two samples of sizes n1 , n2
drawn from normal populations with variances 12 and 22 , respectively. The
random variable F is then defined as
F = S12 /S22 .
(4.1)
Suppose now that 12 = 22 (= 2 , say), then (4.1) can be written as

1 S12 / 2
1 S12 /2 2
1 F
=
=
2
2 S22 / 2
2 S22 /2 2
(4.2)
where the middle term is the ratio of 2 independent 2 variates on 1 , 2 degrees of freedom,
or equivalently, the ratio of 2 independent gamma variates with parameters 12 1 , 12 2 .
1 F
Thus , Y =
has a derived beta distribution with parameters 12 2 , 12 1 . (Statistics
2
260 study guide, section 7.3.1.) Then (Example 7.5, from Statistics 260 study guide), Y
has p.d.f.
y (1 /2)1
f (y) =
, y [0, )
(1 + y)(1 +2 )/2 B( 12 1 , 21 2 )

dy
and g(F ) = f (y) . So
dF
g(F ) =
(1 F/2 )(1 /2)1

1
, F [0, )
(1 +2 )/2
2
1
1
1
1 + 2 F
B( 2 1 , 2 2 )
g(F ) =
1 1 2 2 F (1 /2)1
, F [0, )
B( 12 1 , 12 2 )(2 + 1 F )(1 +2 )/2
Thus
/2 /2
85
(4.3)
CHAPTER 4. F DISTRIBUTION
86
This is the p.d.f. of a random variable with an F-distribution. A random variable F which
can be expressed as
W1 /1
F =
(4.4)
W2 /2
where W1 21 , W2 22 and W1 , and W2 are independent random variables, is said to
be distributed as F(1 , 2 ), or sometimes as F1 ,2 . [Note that we have departed from the
procedure of using a capital letter for the random variable and the corresponding small
letter for its observed value, and will use F in both cases here.]
4.2
Properties of the F distribution
Mean
R
The mean could be found in the usual way, E(F ) = 0 F g(F ) dF , but the rearrangement
of the integrand to get an integral that can be recognized as unity, is somewhat messy, so
we will use another approach.
1
For W 2 , E(W)= and we will show that E(W 1 ) =
.
2
Z 1 w/2 (/2)1
w e
w
dw
1

E(W ) =
1
2(/2) 2
0
( 12 1)
=
2( 21 )
Z
0
ew/2 w(/2)11 dw
2(/2)1 ( 12 1)
( 12 1)
2( 12 1)( 21 1)
1
.
2
For independent random variables W1 21 and W2 22 , define F =
W1 /1
2 W1
=
.
W2 /2
1 W2
Then,
E(F ) =
2
E(W1 )E(W21 )
1
2 1
1 2 2
2
, for 2 > 2.
2 2
(4.5)
87
Thus if a random variable F F (1 , 2 ) then

E(F ) =
2
.
2 2
(4.6)
Notes:
1. The mean is independent of the value of 1 and is always greater than 1.
2. As 2 , E(F) 1.
Mode
By differentiating g(F ) with respect to F it can verified that the mode of the F distribution
is at
2 (1 2)
(4.7)
F =
1 (2 + 2)
which is always less than 1.
g(F)
Figure 4.1: pdf of F-distribution
88

Examine the density function of the F -distribution. To do this plot the density function
for the F -distribution for 1 = 5 and 2 = 3, 5, 10 for x = 0, 0.01, 0.02, . . . , 5. Overlay the
plots on the same axes.
Solution:
0.8
0.6
f(x)
#____ Fdensity.R ___________

x <- seq(from=0,to=5,by=0.01)
l <- 1
for (d in c(3,5,10)){
l <- l+1
fx <- df(x=x,df1=5,df2=d)
if(d==3) plot(fx ~ x,type=l,ylab="f(x)")
else lines(x,fx,lty=l)
} # end of d loop
F53
F55
F510
0.4
0.2
0.0
0
Now plot the density function for 2 = 10 and 1 = 3, 5, 10 again overlaying the plots
on the same axes.
0.8
0.6
f(x)
#____ Fdensity.R ___________

x <- seq(from=0,to=5,by=0.01)
l <- 1
for (d in c(3,5,10)){
l <- l+1
fx <- df(x=x,df1=d,df2=10)
if(d==3) plot(fx ~ x,type=l,ylab="f(x)")
else lines(x,fx,lty=l)
} # end of d loop
F310
F510
F1010
0.4
0.2
0.0
0
89

The righthand tail areas of the distribution are tabulated for various 1 , 2 . For
P = 5, 2.5, 1, .1, values of F.01P are given where
Z
P/100 =
g(F ) dF.
F.01P
Reciprocal of an F-variate
Let the random variable F F (1 , 2 ) and let Y = 1/F . Then Y has p.d.f.

dF
f (y) = g(F )
dy
( /2)
/2
1 1 y 1(1 /2) 2 2 y (1 +2 )/2 1

B( 12 1 , 21 2 )(2 y + 1 )(1 +2 )/2 y 2
/2 /2
2 2 1 1 y (2 /2)1
, y [0, ).
=
B( 12 2 , 21 1 )(1 + 2 y)(1 +2 )/2
Thus if F F (1 , 2 ) and Y = 1/F then Y F (2 , 1 ).
4.3
(4.8)
Use of F-Distribution in Hypothesis Testing
Let S12 and S22 be the sample variances of 2 samples of sizes n1 and n2 drawn from normal
populations with variances 12 and 22 . Recall that (from (4.1), (4.2)) it is only if 12 = 22
(= 2 , say) that S12 /S22 has an F distribution. This fact can be used to test the hypothesis
H: 12 = 22 .
If the hypothesis H is true then,
S12 /S22 F (1 , 2 ) where 1 = n1 1, 2 = n2 1.
For the alternative
A : 12 > 22
only large values of the ratio s21 /s22 would tend to support it, so a rejection region {F :
F > F.01P } is used (Fig 4.2).
Since only the right hand tail areas of the distribution are tabulated it is convenient to
always use s2i /s2j > 1. That is, always put the larger sample variance in the numerator.
90
g(F)
Figure 4.2: Critical region for F-distribution
P/100
.01P
F
Example 4.1
For two samples of sizes 8 and 12, the observed variances are .064 and .024 respectively.
Let s21 = .064 and s22 = .024.
Solution: Test H: 12 = 22 against A: 12 > 22 .

Then,
s21 /s22 = .064/.024 = 2.67, and 1 = 7, 2 = 11.
The 5% rejection region for F(7,11) is {F : F > 3}. The observed value of 2.67 is less
than 3 and so supports the hypothesis being true. (The observed value is not significant
at the 5% level.)
Note: The exact P (F 2.67) can be found using R.
> qf(p=0.05,df1=7,df2=11,lower.tail=F)
[1] 3
> pf(q=2.67,df1=7,df2=11,lower.tail=F)
[1] 0.07
Thus P (F > 2.67) = 0.07 which agrees with the result above.
If the alternative is 12 6= 22 , then both tails of the distribution could be used for
rejection regions, so it may be necessary to find the lower critical value. Let F F (1 , 2 ).
That is we want find a value F1 so that
Z F1
g(F ) dF = /2.
0
Put Y = 1/F so that from (4.8), Y F (2 , 1 ). Then

Z F1
g(F )dF = P (F F1 ) = P (Y > 1/F1 ) = P (Y > F2 ), say.
0
91
% critical value, F1 , first find the upper % critical value, F2

2
2
from tables of F (2 , 1 ), and then calculate F1 as F1 = 1/F2 .
Thus to find the lower
Figure 4.3: Upper
% point of F -distribution
2
,
2 1
g(F)
/2
Figure 4.4: Lower
% point of F -distribution
2
1 2
g(F)
/2
F = 1/F
1
To find the lower /2% point of an F distribution with parameters

1 , 2 , take the reciprocal of the upper /2% point of an F
distribution with parameters 2 , 1 .
(4.9)
92
Example 4.2
Given s21 = 3.22 , n1 = 11, s22 = 3.02 , n2 = 17, test the hypothesis H: 12 = 22 against A:
12 6= 22 , at the 5% significance level.
Solution: Under H, S12 /S22 F (10, 16).
From tables F2.5% (10, 16) = 2.99 = F2 .
The lower 2.5% critical point is then found by F1 = 1/F2.5% (16, 10) = 1/3.5 = .29.
The calculated value of the statistic is 3.22 /3.02 = 1.138 which does not lie in the rejection
region, and so is not significant at the 5% level. Thus the evidence supports the hypothesis
that 12 = 22 .
Of course, so long as we take s21 /s22 to be greater than 1, we dont need to worry about
the lower critical value. It will certainly be less than 1.
Use R to find the critical points in example 4.5.
Solution: We use the qf command.
> qf(p=c(0.975,0.025),df1=10,df2=16)
[1] 3.0 0.29
> pf(q=1.138,df1=10,df2=16,lower.tail=F)
[1] 0.27
4.4
Pooling Sample Variances
Given 2 unbiased estimates of 2 , s21 and s22 , it is often useful to be able to combine them
to obtain a single unbiased estimate. Assume the new estimator, S 2 , is linear combination
of s21 and s22 so that S 2 has the smallest variance of all such linear, unbiased estimates (that
is it is said to have minimum variance). Let
S 2 = a1 S12 + a2 S22 , where a1 , a2 are positive constants.
Firstly, to be unbiased,
E(S 2 ) = a1 E(S12 ) + a2 E(S22 ) = 2 (a1 + a2 ) = 2
which implies that
a1 + a2 = 1.
Secondly, if it is assumed that S12 and S22 are independent then
Var(S 2 ) = a21 Var(S12 ) + a22 Var(S22 )
= a21 Var(S12 ) + (1 a1 )2 Var(S22 ) using (4.10)
(4.10)
93
The variance of S 2 is minimised when, (writing V (.) for Var(.)),

dV (S 2 )
= 2a1 V (S12 ) 2(1 a1 )V (S22 ) = 0.
da1
That is when,
a1 =
V (S22 )
V (S12 )
,
a
=
2
V (S12 ) + V (S22 )
V (S12 ) + V (S22 )
(4.11)
In the case where the Xi are normally distributed, V (Sj2 ) = 2 4 /(nj 1) (see Assignment
3, Question 1). Then the pooled sample variance is
s2 =
(n1 1)s21 (n2 1)s22

+
2 4
2 4
n1 1 n2 1
+
2 4
2 4
1 s21 + 2 s22
1 + 2
(4.12)
where 1 = n1 1, 2 = n2 1.
The above method can be extended to pooling k unbiased estimate s2 of 2 . That is,
s2 =
1 s21 + 2 s22 + + k s2k

,
1 + 2 + + k
(4.13)
P
where S 2 is on ki=1 i (= , say) degrees of freedom, and S 2 / 2 is distributed as 2 .
Also the theory applies more generally to pooling unbiased estimates 1 , 2 , . . . , k of a
parameter .
1
2
k
+
+ +
V (1 ) V (2 )
V (k )
=
.
(4.14)
1
1
1
+
+ +
V (1 ) V (2 )
V (k )
The estimator thus obtained is unbiased and has minimum variance.
Note the following
(i) s2 = 12 (s21 + s22 ) if 1 = 2 ;
(ii) E(S 2 ) = 2 ;
(iii) The s2 in (4.11) is on 1 + 2 degrees of freedom.
4.5
94
Confidence Interval for 12/22
Given s21 , s22 are unbiased estimates of 12 , 22 derived from samples of size n1 , n2 respectively,
from two normal populations, find a (1 %) confidence interval for 12 /22 .
Now 1 S12 /12 and 2 S22 /22 are distributed as independent 21 , 22 variates, and
S22 /22
W2 /2
F (2 , 1 ).
2
2
S1 /1
W1 /1
So

P
That is

P

S22 12
F1 2 (2 , 1 ) < 2 2 < F 2 (2 , 1 ) = 1 .
S1 2

S12
12
S12
F1 2 (2 , 1 ) < 2 < 2 F 2 (2 , 1 ) = 1 .
S22
2
S2
Thus a 100(1 )% confidence interval for 12 /22 is

2
s21
s1
F 1 (2 , 1 ), 2 F 2 (2 , 1 ) .
s22 2
s2
4.6
(4.15)
Comparing parametric and bootstrap confidence

intervals for 12/22
Example 4.3
These data are sway signal energies (1000) from subjects in 2 groups; Normal and
2
Whiplash injured. The data and a program to calculate the confidence interval for 12 ,
2
defined at equation (4.15), are listed in Table 4.1
Although the data are presented in 2 blocks, you imagine them in a single file called
test1E.txt with the W data under the N data in columns.
We find that using (4.15)

12
P 0.52 < 2 < 3.6 = 0.95
2
The bootstrap CI is calculated in R by the script in Table 4.2.
By this method,

12
P 0.26 < 2 < 7.73 = 0.95
2
The CIs are much larger because the method has not relied upon the assumptions of
(4.15) and uses only the information contained in the data.
95
Table 4.1: Confidence interval for variance ratio using the F quantiles
category D1
N
0.028
N
0.036
N
0.041
N
0.098
N
0.111
N
0.150
N
0.209
N
0.249
N
0.360
N
0.669
N
0.772
N
0.799
N
0.984
N
1.008
N
1.144
N
1.154
N
2.041
N
3.606
N
4.407
N
5.116
category D1
W
0.048
W
0.057
W
0.113
W
0.159
W
0.214
W
0.511
W
0.527
W
0.635
W
0.702
W
0.823
W
0.943
W
1.474
W
1.894
W
2.412
W
2.946
W
3.742
W
3.834
E1 <- read.table("test1E.txt",header=T)
Sigmas <- tapply(E1$D1,list(E1$category),var)
nu <- table(E1$category) -1
VR <- Sigmas[1]/Sigmas[2]
Falpha <- qf(p=c(0.975,0.025),df1=nu[1],df2=nu[2] )
CI <- VR/Falpha
> CI
[1] 0.52 3.60
Table 4.2: R code to use the boot package to calculate CI of
12
22
library(boot)
var.ratio <- function(E1,id){
# a user supplied function
yvals <- E1[[2]][id]
# to calculate the statistic of interest
vr <- var(yvals[E1[[1]]=="N"]) /
var(yvals[E1[[1]]=="W"])
return(vr)
} # end of the user supplied function
doBS <- boot(E1,var.ratio,999)
bCI <- boot.ci(doBS,conf=0.95,type=c("perc","bca"))
print(bCI)
> bCI
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = boot(E1, var.ratio, 999), conf = 0.95, type = c("perc","bca") )
Intervals :
Level
Percentile
BCa
95%
( 0.26, 7.73 )
( 0.58, 22.24 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
Chapter 5
5.1
t-Distribution
Derivation
Let X1 , X2 , . . . , Xn be a random sample from a N(, 2 ) distribution. Then, provided 2

is known, the random variable
Z=
/ n
is distributed N(0,1).
In practice 2 is usually not known and is replaced by its unbiased estimate S 2 so that in
place of Z we define
X
,
T =
S/ n
We need to find the probability distribution of this random variable. T can be written
as
(X )/
Z
n
T = p
=p
,
S 2 / 2
W/
(5.1)
where = n 1, Z N (0, 1), W 2 . Furthermore, Z and W are independent since

Z2
X and S 2 are independent (given the Xi N (, 2 )). Now T 2 = W/
so that its distribution is the same as that of the ratio of two independent chi-squares (on 1 and degrees
of freedom), that is, F1, . This fact enables us to find the pdf of the random variable T
which is said to have Students t distribution (after W.S. Gossett, who first derived the
distribution (1908) and wrote under the pseudonym of Student). So we have the following
definition:
96
CHAPTER 5. T-DISTRIBUTION
97
Definition 5.1
A random variable has a t-distribution on degrees
pof freedom (or with parameter ) if it can be expressed as the ratio of Z to W/ where Z N (0, 1)
and W (independent of Z) 2 .
Theorem 5.1
A random variable T which has a t-distribution on d.f. has pdf
f (t) =
[ 21 (1 + )]
, t (, ).
(/2)[1 + (t2 /)](1+)/2
(5.2)
Proof Putting 1 = 1, 2 = in the pdf for the F-distribution (4.3), we have

/2 F 1/2
, F [0, ).
B( 12 , 2 )( + F )(1+)/2
Defining a r.v. T by F = T 2 , with inverse T = F , it can be seen that to every value of

F in [0, ) there correspond 2 values of t in (, ). So,

dt
g(F ) = 2 f (t)
dF
g(F ) =
and

dF
1
g(F )
f (t) =
2
dt
1 2
=
g(t ).2t
2
/2 t1 t
=
B( 12 , 2 )( + t2 )(1+)/2
=
But ( 21 ) =
5.2
[ 12 ( + 1)]
.
( 21 ) ( 2 ) [1 + (t2 /)](1+)/2
which completes the proof.
Properties of the tDistribution
Graph
The graph of f(t) is symmetrical about t = 0 since f (t) = f (t), unimodal, and f (t) 0
as t . It resembles the graph of the normal distribution but the tails are lower and
the central peak higher than for a normal curve of the same mean and variance. This is
illustrated in the figure 5.1 for = 4.
Note: The density functions in Figure 5.1 were found and plotted using R.
98
#________ tdist.R ________

x <- seq(-5,5,length=101)
ftx <- dt(x,df=4)
fnx <- dnorm(x,mean=0,sd=sqrt(2) )
plot(ftx ~ x,type=l)
lines(x,fnx,lty=2)
0.3
t
normal
0.2
0.1
0.0
4
Plots of the T distribution can also be done in Rcmdr,

Distributions Continuous distributions t distribution Plot t distribution
You are required to enter the df and check whether to plot density or distribution
function.
Special Cases
(i) A special case occurs when = 1. This is called the Cauchy distribution and it has
pdf
1
, t (, ).
f (t) =
(1 + t2 )
Check that the mean and variance of this distribution do not exist.
2
(ii) It can be shown that as , f (t) 12 et /2 , the pdf of a standardized normal

distribution. To see this note that

(+1)/2
1/2
/2
t2
t2
t2
lim 1 +
= lim 1 +
. lim 1 +
2 /2
= 1.et
Then, using Stirlings approximation for n!, that is,

1
n! ' (2)1/2 nn+ 2 en ,

we have

+1
lim (
)/ (/2) = (2)1/2 .
Mean and Variance

Because of symmetry, the mean, median and mode coincide with E(T)= 0. Also,
Var(T)=E(T2 )=E(F)= /( 2) for > 2. Note that, as Var(T) 1.
99

The distribution function for the t-distribution is given by
Z t1.01P
P
=
f (t) dt = P (T t1.01P )
100
Note that for = , t-distribution becomes the standard normal distribution.

Example 5.1
For T t10 find P (T > 2.23).
pt(q=2.23,df=10,lower.tail=F)
[1] 0.025
In Rcmdr, the menu is

Distributions Continuous distributions t distribution t probabilities
Into the GUI you enter the quantile (i.e. 2.23 in this case) and the df.
Example 5.2
For T t6 find P (|T | > 1.94).
> 2*pt(q=1.94,df=6,lower.tail=F)
[1] 0.1
Example 5.3
For T t8 find tc such that P (|T | > tc ) = .05.
> qt(p=0.025,df=8,lower.tail=F)
[1] 2.3
Distributions Continuous distributions t distribution t quantiles

Sampling Distributions
The 2 , F and t distributions are often referred to as sampling distributions because
they are distributions of statistics arising when sampling from a normal distribution.
5.3
Use of tDistribution in Interval Estimation
In Chapter 2 we studied the problems of getting a confidence interval for the mean , and
testing hypotheses about when 2 was assumed known. In practice 2 is usually not
known and must be estimated from the data and it is the tdistribution that must be
used to find a confidence interval for and test hypotheses about .
In this section we will derive a 100(1 )% confidence interval for the unknown parameter .
100
One-sample Problem
Given X1 , X2 , . . . , Xn is a random sample from a N(, 2 ) distribution where 2 is unknown
, then
X
tn1 .
T =
(5.3)
S/ n
Then defining t, as
P (T > t, ) = where T t ,
we have

P
t, 2
X
< t,1 2
<
S/ n
(5.4)

= 1 .
That is,
S
S
P (X t,/2 < < X + t,/2 ) = 1 .
n
n
(5.5)
Now rearrange the terms on the LHS of (5.5) as follows,

X
< t,1 2
P t, 2 <
S/ n

S
S
= P t, 2 < X < t,1 2

n
n

S
S
= P X + t, 2 < < X + t,1 2
n
n

S
S
= P X t, 2 > > X t,1 2
inequality directions not conventional
n
n

S
S
inequality directions conventional
= P X t,1 2 < < X t, 2
n
n
A 100(1 )% confidence interval for is

s
s
x t,1 2 , x t, 2
.
n
n
(5.6)
Note how in (5.6), the upper tail quantile is subtraced from the sample mean to calculate
the lower limit and the lower tail quantile is subtracted to calculate the upper limit. This
arose by reversing the inequalities when making the transform .
By the symmetry of the t-distribution, t, 2 = t,1 2 and the lower tail quantile is a
negative number, the upper tail quantile is the same magnitude but positive. So you would
get the same result as (5.6) if you calculated

s
s
x t,1 2 , x + t,1 2
n
n
101
which is often how we think of it. However, it is very important that the true relationship be understood and known because it will be a critical point when we examine the
bootstrap-t where the symmetry does not hold.
Example 5.4
The length (in cm) of skulls of 10 fossil skeletons of an extinct species of bird were measured
with the following results.
5.22, 5.59, 5.61, 5.17, 5.27, 6.06, 5.72, 4.77, 5.57, 6.33.
Find a 95% CI for the true mean length of skulls of this species.
Solution: Computer Solution:(Ignore ttest output except for confidence interval.)
skulls <- data.frame(length=c(5.22,5.59,5.61,5.17,5.27,6.06,5.72,4.77,5.57,6.33) )
t.test(skulls$length,alternative="two.sided",mu=0,conf.level=0.95)
data: skulls$length
t = 38.69, df = 9, p-value = 2.559e-11
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.20 5.85
sample estimates:
mean of x
5.53
A 95% CI is calculated by default. To obtain a CI with a different level of confidence

(say 99%) we would use the command: t.test(x, conf.level=0.99).
The Rcmdr menus mostly work only when there is an active data set and the data are
organised into a data frame. To do the above in Rcmdr,
(i) Make skulls the active data set either
Enter the data using Data New data set or
Use a script such as above ( skulls <- data.frame( ....)), Submit and
then Data Active data set Select active data set
Statistics Means Single-sample t-test
Two-sample Problem
Let us now consider the two-sample problem where X1 , X2 , . . . , Xn1 and Y1 , Y2 , . . . , Yn2 are
2
independent random samples from N(1 , 2 ) and
N(2 , ) distributions
respectively. Now
1
2 1
the random variable X Y is distributed as N 1 2 , ( n1 + n2 ) . That is,
X Y (1 2 )
q
N (0, 1).
2 ( n11 + n12 )
102
If 2 is unknown, its minimum variance unbiased estimate S 2 is given in (4.12). Consider

the random variable T defined by
T =
X Y (1 2 )
r

S 2 n11 + n12
(5.7)
where
1 S12 + 2 S22
(1 + 2 )S 2
, and
21 +2 .
(5.8)
1 + 2
2
q
Rewriting T with a numerator of (X Y (1 2 ))/ 2 ( n11 + n12 ) and a denominator
p
of S 2 / 2 , we see that T can be expressed as the ratio of a N(0,1) variate to the square
root of an independent chi-square variate divided by its degree of freedom. Hence it has a
tdistribution with 1 + 2 = n1 + n2 2 degrees of freedom.
We will now use (5.5) to find a confidence intervals for 1 2 .
Given X1 , X2 , . . . , Xn1 is a random sample from N(1 , 2 ) and Y1 , Y2 , . . . , Yn2 is an
independent sample from N(2 , 2 ), and with t, defined as in (5.3), we have
X Y (1 2 )
q
< t,/2 = 1 .
P t,/2 <
1
1
2
S ( n1 + n2 )
S2 =
Rearranging, the 100(1 )% CI for 1 2 is

r
r

1
1
1
1
x y t,/2 s
+ , x y + t,/2 s
+
n1 n2
n1 n2
where S 2 is defined in (5.8) and = 1 + 2 = n1 + n2 2.
(5.9)
103
Example 5.5
The cholesterol levels of seven male and 6 female turtles were found to be:
Male
226 228 232 215 223 216 223
Female 231 231 218 236 223 237
Find a 99% CI for m f .
Solution:
It will be assumed variances are equal. See chapter 4 for method of testing using R.
x <- c(226,228,232,215,223,216,223)
y <- c(231,231,218,236,223,237)
t.test(x,y,var.equal=T,conf.level=0.99)
Two Sample t-test
data: x and y
t = -1.6, df = 11, p-value = 0.1369
alternative hypothesis: true difference in means is not equal to 0
-17.8
5.7
sample estimates:
mean of x mean of y
223
229
A 99% CI for 1 2 is (17.8, 5.7)

If you were to use rcmdr, you would need to organise the data into a data frame like
this:x <- c(226,228,232,215,223,216,223)
y <- c(231,231,218,236,223,237)
turtles <- data.frame(chol=c(x,y),sex = c(rep("M",7),rep("F",6)))
> turtles
chol sex
1
226
M
2
228
M
3
232
M
4
215
M
5
223
M
6
216
M
7
223
M
8
231
F
9
231
F
10 218
F
11 236
F
12 223
F
13 237
F
Make that data frame active and then use Statistics Means Independent samples
t-test
5.4
104
Use of t-distribution in Hypothesis Testing
One-sample Problem
Given X1 , X2 , . . . , Xn is a random sample from N(, 2 ) where both parameters are unknown, we wish to test the hypothesis, H : = 0 . Using (5.2) we can see that
(a) for the alternative, H1 : 6= 0 , values of x close to 0 support the hypothesis
being true while if |x 0 | is too large there is evidence the hypothesis may be
incorrect. That is, reject H0 at the 100% significance level if
|x 0 |
> t,/2 .
s/ n
Figure 5.1: Critical Region for tdistribution: Two Sided
Rejection Region: |t | > t
,/2
/2
/2
-t
,/2
,/2
(b) For H1 : > 0 , only large values of (x 0 ) tend to caste doubt on the hypothesis.
That is, reject H0 at the 100% significance level if
x 0
> t, .
s/ n
An alternative H1 : < 0 , would be treated similarly to (b) but with lower critical
value t, .
105
Figure 5.2: Critical Region for tdistribution: One Sided
Rejection Region: t > t
Example 5.6
A certain type of rat shows a mean weight gain of 65 gms during the first 3 months of
life. A random sample of 12 rats were fed a particular diet from birth. After 3 months the
following weight gains were recorded: 55, 62, 54, 57, 65, 64, 60, 63, 58, 67, 63, 61. Is there
any reason to believe that the diet has resulted in a change of weight gain?
Solution: Let X be the weight gain in 3 months and assume that X N (, 2 ). The
hypothesis to be tested is H : = 65.0 and the appropriate alternative, H1 : 6= 65.0
Then, x = 60.75, s2 = 16.38 and.
60.75 65.0
= 3.64.
t=
16.38/ 12
For a 2-tailed test with = .05, t11,.025 ' 2.20
wt <- c(55,62,54,57,65,64,60,63,58,67,63,61)
xbar <- mean(wt);
s <- sd(wt);
n <- length(wt)
tT <- (xbar-65)/(s/sqrt(n) );
cat("t = ",tT,"\n")
t = -3.6
qt(p=0.025,df=(n-1) )
[1] -2.2
pt(q=tT,df=(n-1))
[1] 0.0020
Our calculated value is less than t11,.025 and so is significant at the 5% level. Furthermore, t11,.005 ' 3.11 and our calculated value lies in the 1% critical region for the two-tailed
test, so H is rejected at the 1% level. A better (and more modern) way to say this is that
106
if the hypothesis is true then the probability of an observed t-value as extreme (in either
direction) as the one obtained is less than 1% . Thus there is strong evidence to suggest
that the hypothesis is incorrect and that this diet has resulted in a change in the mean
weight gained.
> t.test(wt,alternative="two.sided",mu=65)
One Sample t-test
data: wt
t = -3.6, df = 11, p-value = 0.003909
alternative hypothesis: true mean is not equal to 65
58 63
sample estimates:
mean of x
61
Comment
The procedure adopted in the above example is a generally accepted one in hypothesis
testing problems. That is, it is customary to start with = .05, and if the hypothesis
is rejected at the 5% level (this is equivalent to saying that the observed value of the
statistic is significant at the 5% level), then consider = .01. If the observed value is
right out in the tail of the distribution, it may fall in the 1% critical region (one- or twotailed, whichever is appropriate). To make a conclusion, claiming significance at the 1%
level carries more weight than one claiming significance at the 5% level. This is because
in the latter case we are in effect saying that, on the basis of the data we have, we will
assert that H is not correct. In making such a statement we admit that 5 times in 100
we would reject H0 wrongly. In the former case however, (significance at the 1% level)
we realize that there is only 1 chance in 100 that we have rejected H wrongly. The commonly accepted values of to consider are .05, .01, .001. For the t, F and 2 distributions,
critical values can be read from the tables for both 1- and 2-tailed tests for these values of .
Two-sample Problem
Given X1 , X2 , . . . , Xn1 and Y1 , Y2 , . . . , Yn2 are independent random samples from N(1 , 12 )
and N(2 , 22 ) respectively, we may wish to test H : 1 2 = 0 , say. Using (5.3) we can
see that, under H0 ,
X Y 0
q
tn1 +n2 2 .
S n11 + n12
So H0 can be tested against one- or two-sided alternatives.
Note however, that we have assumed that both populations have the same variance
2 , and this in general is not known. More generally, let X1 , X2 , . . . , Xn1 be a random
107
sample from N(1 , 12 ) and Y1 , Y2 , . . . , Yn2 be an independent random sample from N(2 ,
22 ) where 1 , 2 , 12 , 22 are unknown, and suppose we wish to test H : 1 2 = 0 . From
the samples of sizes n1 , n2 we can determine x, y, s21 , s22 . We first test the preliminary
hypothesis that 12 = 22 and if evidence supports this, then we regard the populations as
having a common variance 2 . So the procedure is:
(i) Test H0 : 12 = 22 (= 2 ) against H1 : 12 6= 22 , using the fact that under H0 , S12 /S22
F1 ,2 . [This is often referred to as testing sample variances for compatibility.] A twosided alternative and a two-tailed test is always appropriate here. We dont have any
prior information about the variances. If this test is survived (that is, if H0 is not
rejected), proceed to (ii).
(ii) Pool s21 and s22 using s2 =
degrees of freedom.
1 s21 +2 s22
1 +2
which is now an estimate of 2 based on 1 + 2
(iii) Test H0 : 1 2 = 0 against the appropriate alternative using the fact that, under
H0 ,
X Y 0
q
t1 +2 .
S n11 + n12
Example 5.7
A large corporation wishes to choose between two brands of light bulbs on the basis of
average life. Brand 1 is slightly less expensive than brand 2. The company would like to
buy brand 1 unless the average life for brand 2 is shown to be significantly greater. Samples
of 25 lights bulbs from brand 1 and 17 from brand 2 were tested with the following results:
Brand 1 (X):
997, 973, 977, 1051, 1029, 934, 1007, 1020, 961, 948, 954, 939, 987, 956, 874, 1042, 1010,
942, 1011, 962, 993, 1042, 1058, 992, 979
Brand 2 (Y):
973, 970, 1018, 1019, 1004, 1009, 983, 1013, 968, 1025, 935, 1018, 1033, 992, 1037, 964, 1067
We want to test H0 : 1 = 2 against H1 : 1 < 2 where 1 , 2 are the means from
brands 1 and 2 respectively.
Solution: For the above data, x = 985.5 hours, y = 1001.6 hours, s1 = 43.2, s2 = 32.9.
(i) Firstly test H0 : 12 = 22 against a two-sided alternative noting that under H0 ,
S12 /S22 F24,16 .
Then, s21 /s22 = 1.72 and from the F-tables, the critical value for a two-tailed test with
= .05 is F2.5% (24, 16) = 2.63. The calculated value is not significant (that is, does
not lie in the critical region) so there is no reason to doubt H0 .
108
(ii) Hence, pooling sample variances,

s2 =
1 s21 + 2 s22
24 1866.24 + 16 1.82.41
=
= 1552.71.
1 + 2
24 + 16
(iii) Assuming the hypothesis, 1 = 2 is true,
qXY
1
+ n1
n
1
tn1 +n2 2 . But,
16.1
1001.6 + 985.5
=
t=
= 1.30.
12.387
1552.71 .098824
For a 1-tailed test with = .05 (with left- hand tail critical region), the critical value
is t40,.95 = t40,.05 = 1.68. The observed value is not in the critical region so is
not significant at the 5% level and there is insufficient evidence to caste doubt on
the truth of the hypothesis. That is, the average life for brand 2 is not shown to be
significantly greater than that for brand 1.
Computer Solution:
x <- c(997, 973, 977, 1051, 1029, 934, 1007, 1020, 961, 948, 954, 939, 987,
956, 874, 1042, 1010, 942, 1011, 962, 993, 1042, 1058, 992, 979)
y <- c(973, 970, 1018, 1019, 1004, 1009, 983, 1013, 968, 1025, 935, 1018, 1033,
992, 1037, 964, 1067)
t.test(x,y,var.equal=T)
Two Sample t-test
data: x and y
t = -1.3, df = 40, p-value = 0.2004
alternative hypothesis: true difference in means is not equal to 0
-41.2
8.9
sample estimates:
mean of x mean of y
986
1002
Notice here the actual P -value is given as 20%. The 95% confidence interval for 1 2
is also given. Both the ttest and the confidence interval are based on the pooled standard
deviation which is not reported and would have to be calculated separately if needed.
Comment
When the population variances are unequal and unknown, the methods above for finding confidence intervals for 1 2 or for testing hypotheses concerning 1 2 are not
appropriate. The problem of unequal variances is known as the Behrens-Fisher problem, and various approximate solutions have been given but are beyond the scope of this
course. One such approximation (the Welch ttest) can be obtained in R by omitting the
var.equal=T option from t.test.
5.5
109
Paired-sample t-test
Consider two random samples X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn from normal distributions

with means 1 , 2 respectively, but where the two samples consist of n logically paired
observations, (Xi , Yi ). For example, Xi could be a persons pre diet weight while Yi could
be the weight of the same person after being on the diet for a specified period. (That is,
the two samples are not independent.)
Define the random variables
Di = Xi Yi , i = 1, . . . , n.
The hypothesis of interest is then
H0 : diff = 1 2 = 0 .
Then, under the hypothesis D1 , D2 , . . . , Dn can be regarded as a random sample from a
normal distribution with mean 0 and variance 2 (unknown). Let
D=
n
X
Di /n and
s2d
Pn
=
d)2
.
n1
i=1 (d1
i=1
If the hypothesis H0 is true, E(D)= 0 and Var(D) = 2 /n. So

0
D
p
tn1 .
Sd2 /n
Thus the hypothesis is tested by comparing
d 0
with the appropriate critical value from
sd / n
tables.
One important reason for using paired observations is to eliminate effects in which
there is no interest. Suppose that two teaching methods are to be compared by using 50
students divided into classes with 25 in each. One way to conduct the experiment is to
assign randomly 25 students to each class and then compare average scores. But if one
group happened to have the better students, the results may not give a fair comparison
of the two methods. A better procedure is to pair the students according to ability (as
measured by IQ from some previous test) and assign at random one of each pair to each
class. The conclusions then reached are based on differences of paired scores which measure
the effect of the different teaching methods.
When extraneous effects (for example, students ability) are eliminated, the scores on
which the test is based are less variable. If the scores measure both ability and difference
in teaching methods, the variance will be larger than if the scores reflect only teaching
method, as each score then has two sources of variation instead of one.
110
Example 5.8
The following table gives the yield (in kg) of two varieties of apple, planted in pairs at eight
(8) locations. Let Xi and Yi represent the yield for varieties 1, 2 respectively at location
i = 1, 2, . . . , 8.
i
xi
yi
di
1
2
3
4
5
6
7
114 94 64 75 102 89 95
107 86 70 70 90 91 86
7
8 6 5 12 2 9
8
80
77
3
Test the hypothesis that there is no difference in mean yields between the two varieties,
that is test: H0 : X Y = 0 against H1 : X Y > 0.
Solution:
Paired t-test
x <-c(114,94,64,75,102,89,95,80)
y <-c(107,86,70,70,90,91,86,77)
t.test(x,y,paired=T,
alternative="greater")
data: x and y
t = 2.1, df = 7, p-value = 0.03535
alternative hypothesis:
true difference in means is greater than 0
0.5 Inf
sample estimates:
mean of the differences
4.5
The probability of observing T > 2.1 under H0 is 0.035 which is sufficiently small a
probability to reject H0 and conclude that the observed difference is due to variety and
not just random sampling.
In Rcmdr, the data are organised in a data frame in pairs,
> apples <- data.frame(yld1=x,yld2=y)
> apples
yld1 yld2
1 114 107
2
94
86
3
64
70
4
75
70
5 102
90
6
89
91
7
95
86
8
80
77
Make the data set active and then use Statistics Means paired t-test.
111
Comments
1. Note that for all the confidence intervals and tests in this chapter, it is assumed the
samples are drawn from populations that have a normal distribution.
2. Note that the violation of the assumptions underlying a test can lead to incorrect
conclusions being made.
5.6
Bootstrap T-intervals
We can make accurate intervals without depending upon the assumption of the assumption
of normality made at (5.1) by using the bootstrap. The method is named bootstrap-t.
The procedure is as follows:1. Estimate the statistic (e.g. the sample mean) and its standard error, se,
and
determine the sample size, n.
2. Nominate the number of bootstrap samples B, e.g. B = 199.
3. Loop B times
Generate a bootstrap sample x?(b) by taking a sample of size n with replacement.
?
Calculate the bootstrap sample statistic, (b)
, and its standard error, se?(b)
Calculate
?
T(b)
?
(b)

se
?(b)
and save this result
?
?
?
4. Estimate the bootstrap-t quantiles from T(1)
, T(2)
, . . . , T(B)
. Denote these as t2 and
t1 2 .
5. The 100% CI for is

t1 2 se
, t2 se
The point made at equation (5.6) about selecting the correct quantiles to make the
confidence limits now becomes important because the symmetry no longer holds.
112
Example 5.9
In example 5.4 the 95% CI for the mean was calculated to be (5.2, 5.9). We now calculate
the CI using bootstrap-t.
x <- c(5.22,5.59,5.61,5.17,5.27,6.06,5.72,4.77,5.57,6.33)
n <- length(x)
skulls <- data.frame(length=c(5.22,5.59,5.61,5.17,5.27,6.06,5.72,4.77,5.57,6.33) )
heta <- mean(skulls$length)
se.theta <- sd(skulls$length)/sqrt(n)
nBS <- 199
Tstar <- numeric(nBS)
i <- 1
while( i < (nBS+1)){
# looping 1 to nBS
x.star <- sample(skulls$length,size=n,replace=T)
Tstar[i] <- (mean(x.star) -theta ) / ( sd(x.star)/sqrt(n) )
i <- i+1
}
# end of the while loop
bootQuantiles <- round(quantile(Tstar,p=c(0.025,0.975)),2)
cat("Bootstrap T quantiles = ",bootQuantiles,"\n")
CI <- theta - se.theta*rev(bootQuantiles)
cat("CI = ",CI,"\n")
Bootstrap T quantiles =
CI = 5.2 5.83
-2.37 2.41
Note in the code the use of the rev() function to reverse the quantiles for calculating
the CI. Also observe that the quantiles are not of the same magnitude, symmetry is absent.
However, the asymmetry is not much.
Example 5.10
q
1
In example 5.7 the 95% CI for the mean difference was determined as 16239.4 25
+
(41, 9). What is the bootstrap-t CI for mean difference?
1
17
The mndiff function in the code below computes the mean and variance of the difference. The user must supply the variance calculations for boot.ci to calculate the
bootstrap-t (or Studentized) CIs.
x <- c(997, 973, 977,
956, 874, 1042,
y <- c(973, 970, 1018,
992, 1037, 964,
113
1051, 1029, 934, 1007, 1020, 961, 948, 954, 939, 987,
1010, 942, 1011, 962, 993, 1042, 1058, 992, 979)
1019, 1004, 1009, 983, 1013, 968, 1025, 935, 1018, 1033,
1067)
lights <- data.frame(Brand=c(rep("X",length(x)),rep("Y",length(y))),hours=c(x,y) )

library(boot)
mndiff <- function(lights,id){
yvals <- lights[[2]][id]
delta <- mean(yvals[lights[[1]]=="X"]) - mean(yvals[lights[[1]]=="Y"])
v <- var(yvals[lights[[1]]=="X"])/length(yvals[lights[[1]]=="X"])
+
var(yvals[lights[[1]]=="Y"])/length(yvals[lights[[1]]=="Y"])
return(c(delta,v))
}
doBS <- boot(lights,mndiff,999)
bCI <- boot.ci(doBS,conf=0.95,type=c("perc","stud"))
print(bCI)
Intervals :
Level
Studentized
Percentile
95%
(-55.81, -7.99 )
(-24.34, 24.42 )
Calculations and Intervals on Original Scale
Collating the results for the CI of the mean difference (and including the CI for variances
not equal),
t with variances equal
(41, 9)
t with variances unequal (40, 7.7)
bootstrap-t
(56, 8)
percentile-t
(24, 24)
Although the findings from each technique are the same, that there is insufficient evidence to conclude that the means are different, nevertheless there are disparities which we
may attempt to understand.
Density plots of the 2 samples give some clues as to why the results might differ,
Figure 5.3
Although an F-test supported the null hypothesis H0 : 12 = 22 , the plots do indicate that the difference in variances might be an issue. Further, although the densities
seem to be approximately normal, each could also be viewed as displaying skewness. The
assumptions of normality and equal variances may not be justified.
114
Figure 5.3: Densities of lifetimes of brands X & Y light bulbs.
0.012
0.010
0.008
0.006
0.004
0.002
0.000
0.012
0.010
0.008
0.006
0.004
0.002
0.000
Density
Normal densities are given by the

dashed line.
Density
Y
800
900
1000 1100 1200
Example 5.11
The densities of the data used in Example 2.7 are used to demonstrate how the assumption
of normality is a strong condition for parametric t-tests.
The densities of the energies from 2 groups (N & W) are plotted in Figure 5.4.
Figure 5.4: Densities of energies of sway signals from Normal and whiplash subjects
W
4e04
Density
3e04
2e04
1e04
0e+00
N
0
1000
2000
3000
4000
5000
6000
The confidence intervals of mean difference are:parametric t with variances unequal (1385, 224)
bootstrap-t
(2075, 302)
percentile-t
(815, 791)
Only the bootstrap-t suggests that the mean difference is unlikely to be zero.
Parametric-t loses out because the underlying assumptions of normality do not hold.
The percentile bootstrap is unreliable for small sample sizes (< 100).
Chapter 6
6.1
Analysis of Count Data
Introduction
This chapter deals with hypothesis testing problems where the data collected is in the form
of frequencies or counts.
In section (6.2) we study a method for testing the very general hypothesis that a
probability distribution takes on a certain form, for example, normal, Poisson, exponential,
etc. The hypothesis may or may not completely prescribe the distribution. That is, it may
specify the value of the parameter(s), or it may not. These are called Goodness-of-Fit
Tests. Section (6.3) is then concerned with the analysis of data that is classified according
to two attributes in a Contingency Table. Of interest here is whether the two attributes
are associated.
6.2
GoodnessofFit Tests
Consider the problem of testing if a given die is unbiased. The first step is to conduct
an experiment such as throwing the die n times and counting how many 1s, 2s, . . . , 6s
occur. If Y is the number that shows when the die is thrown once and if the die is unbiased
then Y has a rectangular distribution. That is,
P (Y = i) = pi = 1/6, i = 1, 2, 3, 4, 5, 6.
Testing the hypothesis the die is unbiased is then equivalent to testing the hypothesis
H0 : p1 = p2 = p3 = p4 = p5 = p6 = 1/6.
Let Ai be the event: i occurs (on a given throw). Then P (Ai ) = pi = 1/6, under H0
(that is, if and only if, H0 is true). The random variables Yi , i = 1, . . . , 6 are now defined
as follows. Let Yi be the number of times in the n throws that Ai occurs. If the die is
unbiased the distribution of (Y1 , . . . , Y6 ) is then multinomial with parameters p1 , . . . , p6 all
equal to 1/6.
Example 6.1
Suppose that the die is thrown 120 times and the observed frequencies in each category
(denoted by o1 , . . . , o6 ) are
115
CHAPTER 6. ANALYSIS OF COUNT DATA

i
oi
116
1 2 3 4
15 27 18 12
5
25
6
23
Find the expected frequencies.

Solution: For a multinomial distribution, the expected frequencies (which will be
denoted by ei ), are given by E(Yi ) = npi , i = 1, . . . , 6, (Theorem 5.2, STAT260), and
E(Yi ) = 120 16 = 20 if H0 is true. The question to be answered is whether the observed
frequencies are close enough to those expected under the hypothesis, to be consistent with
the given hypothesis. The following theorem provides a test to answer this question.
Theorem 6.1
Given
n!
py1 py2 . . . pykk
y1 ! . . . yk ! 1 2
where
of times in n trials that event Ai (which has probability pi ) occurs,
Pk Yi is the
Pnumber
k
2
i=1 pi = 1,
i=1 yi = n, then the random variable X defined by
P (Y1 = y1 , . . . , Yk = yk ) =
X =
k
X
(Yi npi )2
i=1
npi
is distributed approximately as 2k1 for n large. This is often expressed as

k
k
X
(observed frequencyexpected frequency)2 X (oi ei )2
X =
=
.
expected frequency
ei
i=1
i=1
2
Outline of Proof (for k = 2)

We can write X 2 as
(Y1 np1 )2
X2 =
np1
(Y1 np1 )2
=
np1
(Y1 np1 )2
=
n
(Y1 np1 )2
=
np1 q1
(Y2 np2 )2
np2
(n Y1 n(1 p1 ))2
+
n(1 p1 )

1
1
+
p1 1 p1
+
Now Y1 is distributed as bin(n,p1 ). Thus (approximately)

Y1 np1
X=
N (0, 1)
np1 q1
for n large. Hence (again approximately)
X2 =
for n large.
(Y1 np1 )2
21
np1 q1
(6.1)
117
Comments
1. Note that the multinomial distribution only arises in the above problem in a secondary way when we count the number of occurrences of various events, A1 , A2 , etc.
where {A1 , . . . , Ak } is a partition of the sample space.
2. If the underlying distribution is discrete, the event Ai usually corresponds to the
random variable taking on a particular value in the range space (see MSW, example
14.2). When the underlying distribution is continuous the events {Ai } have to be
defined by subdividing the range space (see Examples 6.3 and 6.4 that follow). The
method of subdivision is not unique but in order for the chisquare approximation to
be reasonable the cell boundaries should be chosen so that npi 5 for all i. We
want enough categories to be able to see whats happening, but not so many that
npi < 5 for any i.
A case can be made for choosing equalprobability categories but this is only one
possibility.
3. The fact that the X 2 defined in (6.1) has approximately a chi-square distribution with
k 1 degrees of freedom under H0 , is only correct if the values of the parameters
are specified in stating the hypothesis. (See MSW, end of 14.2.) If this is not so, a
modification has to be made to the degrees of freedom. In fact, we can still say that
2
X =
k
X
(Yi npi )2
i=1
npi
approximately as 2
when the cell probabilities depend on unknown parameters 1 , 2 , . . . , r (r < k),

provided that 1 , . . . , r are replaced by their maximum likelihood estimates and
provided that 1 degree of freedom is deducted for each parameter so estimated. That
is, with k categories and r parameters estimated, the df would be k r 1.
4. Note that irrespective of the form of the alternative to the hypothesis only large
values of X 2 provide evidence against the hypothesis. The closer the agreement
between the expected values and observed values the smaller the values of X 2 . Small
values of X 2 thus tend to indicate that the fit is good.
5. In R,
> obsv <- c(15,27,18,12,25,23)
> chisq.test(obsv,p=rep(1/6,6) )
Chi-squared test for given probabilities
data: obsv
X-squared = 8.8, df = 5, p-value = 0.1173
118
Example 6.1(cont) Let us return to the first example and test the hypothesis, H0 is
p1 = p2 = . . . = 16 against the alternative H1 that pi 6= 16 for some i.
Solution: The observed frequencies (oi ) and those expected under H (ei ) are
i
oi
ei
1 2 3 4 5 6
15 27 18 12 25 23
20 20 20 20 20 20
So the observed value of X 2 is

2
x =
6
X
(oi ei )2
i=1
ei
(5)2 + 72 + (2)2 + (8)2 + 52 + 32

=
= 8.8.
20
Now the parameters pi , . . . , p6 are postulated by the hypothesis as 1/6 so the df for 2 is
6 1 = 5. Under H0 , X 2 25 and the hypothesis would be rejected for large values of x2 .
The upper 5%ile is 25,.05 = 11.1 ( qchisq(df=5,p=0.05,lower.tail=F) ) so the
calculated value is not significant at the 5% level. There is insufficient evidence to caste
doubt on the hypothesis so that we conclude the die is most likely unbiased.
119
Example 6.2
Merchant vessels of a certain type were exposed to risk of accident through heavy weather,
ice, fire, grounding, breakdown of machinery, etc. for a period of 400 days. The number
of accidents to each vessel, say Y , may be considered as a random variable. For the data
reported below, is the assumption that Y has a Poisson distribution justified?
Number of accidents (y)
Number of vessels with y accidents
0
1448
1
805
2
206
3
34
4
4
5
2
6
1
Solution: Note that the parameter in the Poisson distribution is not specified and we
= y, which is the average number of accidents per vessel.
have to estimate it by its mle,
Thus,
y =
total number of accidents

total number of vessels
(0 1448) + (1 805) + . . . + (5 2) + (6 1)
1448 + 805 + . . . + 2 + 1
1351
2500
= .5404.
We now evaluate P (Y = y) = e.5404 (.5404)x /x! for y = 0, 1, . . . , to obtain
p0 = P (Y = 0) = e.5404 = .5825
p1 = P (Y = 1) = .5404 e.5404 = .3149
Similarly, p2 = .0851, p3 = .0153, p4 = 0.0021, p5 = .00022, p6 = .00002
Recall that the 2 approximation is poor if the expected frequency of any cell is less than
about 5. In our example, E(Y5 ) = 2500 0.00022 = 0.55 and X(Y6 ) = 0.05. This
means that the last 3 categories should be grouped into a category called Y 4 for which
p4 = P (Y 4) = .0022.
The expected frequencies (under H0 ) are then given by E(Yi ) = 2500 pi and are tabulated
below.
observed
expected
1448
805
206
34
7
1456.25 787.25 212.75 38.25 5.50
[Note: Do not round-off the expected frequencies to integers.]

x2 =
X (o e)2
e
(1448 1456.25)2
(1.5)2
+ ... +
= 1.54.
1456.25
5.5
Since there are 5 categories and we estimated one parameter, the random variable X 2 is
distributed approximately as a 23 . The upper 5% critical value is 7.81,
120
> qchisq(p=0.05,df=3,lower.tail=F)
[1] 7.81
so there is no reason to doubt the truth of H0 and we would conclude that a Poisson
distribution does provide a reasonable fit to the data.
Computer Solution: First enter the number of accidents into x and the observed frequencies into counts.
x <- 0:6
counts <- c(1448,805,206,34,4,2,1)
# Calculate rate
lambda <- sum(x*counts)/sum(counts)
#
Merge cells with E(X) < 5
counts[5] <- sum(counts[5:7])
#
Poisson probabilities
probs <- dpois(lam=lambda,x=0:4)
#
ensure that the probabilities sum to 1, no rounding error
probs[5] <- 1- sum(probs[1:4])
#
Chi square test of frequency for Poisson probabilities
chisq.test(counts[1:5],p= probs)
data: counts[1:5]
Notice the value for x2 is slightly different. R uses more accurate values for the probabilities and also retains more decimal places for its calculations and so has less rounding
error than we managed with a calculator. The conclusions reached are however the same.
121
Example 6.3
Let random variable T be the length of life of a certain brand of light bulbs. It is hypothesised that T has a distribution with pdf
f (t) = et , t > 0.
Suppose that a 160 bulbs are selected at random and tested, the time to failure being
recorded for each. That is, we have t1 , t2 , . . . , t150 . Show how to test that the data comes
from an exponential distribution.
Solution: A histogram of the data might be used to give an indication of the distribution.
Suppose this is as shown below.
Figure 6.1: Time to Failure
60
50
40
30
20
10
0
25
75
125
175
225
275
325
375
425
475
The time axis is divided into 10 categories, with cell barriers at 50, 100, . . . , 500,
and we might ask what are the expected frequencies associated with these categories, if T
does have an exponential distribution. Let p1 , p2 , . . . , p8 denote the probabilities of the
categories, where
Z 50
p1 =
et dt = 1 e50
Z0 100
et dt = e50 e100 , etc.
p2 =
50
A value for is needed to evaluate these probabilities.

If is assumed known, (for example it may be specified in the hypothesis where it
may be stated that T has an exponential distribution with = 0 ) then the df for
the 2 distribution is k 1 where there are k categories.
122
If is not known it has to be estimated from the data. The mle of is 1/t and
we use this value to calulate the pi and hence the ei . The degrees of freedom now
become k 2 since one parameter, has been estimated.

Supose 160 lightbulbs are tested with the following results.
Interval
Observed
Interval
Observed
050
60
250300
10
50100
31
300350
3
100150
19
350400
5
150200
18
400450
2
200250
9
>450
3
Test the hypothesis that the failure times follow an exponential distribution.
1
. Thus, = 1/t, the parameter for the exponential
distribution in R. We will approximate t by assuming all observations fall at the midpoint

of the interval they are in with the last three observations (greater than 450) being assumed
to be at 475.
Solution: For the raw data, t =
#_________ ChisqTestExp.R ______________

obsv.freq <- c(60,31,19,18,9,10,3,5,2,3)
total <- sum(obsv.freq)
interval.ends <- seq(from=50,to=500,by=50)
med.times <- interval.ends - 25
tbar <- sum(med.times*obsv.freq)/sum(obsv.freq)
alpha <- 1/tbar
# ________ Cumulative probabilities at interval.ends ___________
probs <- pexp(q=interval.ends,rate=alpha);
probs[10] <- 1 # ensure sum(probs)=1
probs.interval <- c(probs[1],diff(probs) )
# first prob is P(0<x<50)
expt.freq <- total*probs.interval
# __________ bulk the low expectation intervals ________
too.low <- expt.freq < 5
probs.interval[7] <- sum(probs.interval[too.low])
obsv.freq[7] <- sum(obsv.freq[too.low])
CHI.test <- chisq.test(obsv.freq[1:7],p=probs.interval[1:7])
P.gt.X2 <- pchisq(df=5,q=CHI.test$statistic,lower.tail=F)
cat("X2 = ",CHI.test$statistic,"
P(Chi > X2) = ",P.gt.X2,"\n")
X2 =
4.2
P(Chi > X2) =
0.52
Since was estimated, X 2 has a chisquare distribution on 5 df and P (X 2 > 4.2) = 0.52.
Hence it is likely the length of life of the light bulbs is distributed exponentially.
123
Example 6.4
Show how to test the hypothesis that a sample comes from a normal distribution.
Solution: If the parameters are not specified they must be estimated by
= x,
2 =
n
X
(xi x)2 /n
i=1
where n is the size of the sample.

The xaxis is then partitioned to form k categories and the number of observations
counted in each. A suggested way is as follows
1. Define suitable probabilities for which quantiles can be calculated. These should allow for expected frequencies > 5 so if n = 100 (say), then bins corresponding to p =
0.125, 0.25, . . . , 0.875 will have sufficient resolution yet retain expected frequencies > 5.
2. Calculate the quantiles for these probabilities using qnorm().
3. Calculate the observed frequencies for bins defined by the quantiles with hist(plot=F, ).
4. Do a 2 goodness-of-fit test using the observed frequencies and the proposed probabilities
for each bin.
For example, Figure 6.2 depicts a histogram, observed frequencies and the postulated
normal distribution. The bins are chosen such that the expected probability under the
normal distribution for each bin interval is 81 .
Figure 6.2: Partition with Equal Probabilities in Each Category
0.12
Density
0.10
0.08
0.06
0.04
0.02
0.00
0
10
15
124
The following R script implements the above steps.

#
read the data
rx <- scan( )
# this bit is virtual
#
proposed normal distribution
xbar <- 10; s <- 4
#
select suitable probabilities
probs <- seq(0.125,0.875,0.125)
#
quantiles corresponding to these probabilities
qx <- qnorm(p=probs,mean=xbar,sd=s)
#
use hist(plot=F, ...) to get observed bin frequencies
#
note use of min(rx) and max(rx) to complete bin definition
histo <- hist(rx,breaks=c(min(rx),qx,max(rx)),plot=F)$counts
obsv.freq <- histo$counts
# the chi square test
CHI.test <- chisq.test(obsv.freq,p=rep(1/8,8) )
print(CHI.test)
data: obsv.freq
Since two parameters are estimated, the degrees of freedom for the 2 is k r 1 =
8 2 1 = 5.
The above output does has not taken into account that 2 parameters have been estimated so the correct 2 test needs to be done.
corrected.df <- length(obsv.freq)-3
pchisq(q=CHI.test$statistic,df=corrected.df,lower.tail=F)
X-squared
0.52
For this example, P (2 > 4.8) = 0.52.
6.3
125
Contingency Tables
Suppose a random sample of n individuals is classified according to 2 attributes, say A

(rows) and B (columns). We wish to determine whether 2 such attributes are associated
or not. Such a two-way table of frequencies is called a contingency table.
Example 6.5
Suppose a random sample of 300 oranges was classified according to colour (light, medium,
dark) and sweetness (sweet or not sweet) then the resulting table
sweet
not sweet
light medium dark

115
55
30 200
35
45
20 100
150
100
50 300
is an example of a contingency table.

For such a table it is of interest to consider the hypothesis H0 : colour and sweetness
are independent, against the alternative H1 : colour and sweetness are associated.
6.3.1
Method
Assume that in the population there is a probability pij that an individual selected at
random will fall in both categories Ai and Bj . The probabilities are shown in the following
table.
A1
A2
Sums
B1
p11
p21
p.1
P (B1 )
B2
p12
p22
p.2
P (B2 )
B3
Sum
p13
p1. = P (A1 )
p23
p2. = P (A2 )
p.3
p.. = 1
P (B3 )
The probabilities in the margins can be interpreted as follows:

pi. = P (item chosen at random will be in category Ai ) =
pij
p.j = P (item chosen at random will be in category Bj ) =
pij
for i = 1, 2 and j = 1, 2, 3. If the categories are independent then

P (item is in both categories A and B) = P ( item in category A)P (item in category B)
126
The hypothesis can now be written as

H0 : pij = pi. p.j , i = 1, 2; j = 1, 2, 3.
(6.2)
Let the random variable N

(out of n) in category Ai Bj and nij be its
ij be the numberP
P
observed value. Let ni. = j nij and n.j = i nij so the table of observed frequencies is
n11
n21
n.1
n12
n22
n.2
n13
n23
n.3
n1.
n2.
n.. = n
Now the set of random variables {Nij } have a multinomial distribution and if the pij
are postulated, the expected frequencies will be eij = E(Nij ) = npij and
k
2
3
X
(oij eij )2 X X (Nij npij )2
X =
=
.
eij
npij
i=1 j=1
l=1
2
Under the hypothesis of independence, X 2 becomes

2 X
3
X
(Nij npi. p.j )2
npi. p.j
i=1 j=2
and will be distributed approximately as 2 on df where = 6 1.

Usually the pi. , p.j are not known and have to be estimated from the data. Now p1. ,
p2. , . . . , p.3 are parameters in a multinomial distribution and the mles are
p1. = n1. /n, p2. = n2. /n
p.1 = n.1 /n, p.2 = n.2 /n, p.3 = n.3 /n.
Under H and using the mles, the expected frequencies take the form
eij = n
pij = n
and X 2 becomes
2
X =
ni. n.j
n n
ni. n.j 2
n
ni. n.j
n
2 X
3
X
Nij
i=1 j=1
(6.3)
Now consider degrees of the freedom. Once three of the expected frequencies eij , have been
determined the other expected frequencies can be determined from the marginal totals since
they are assumed fixed. Thus the degrees of freedom is given by 6 1 3 = 2.
In the more general case of r rows and c columns, the number of parameters to be estimated is (r 1)+(c1) so the degrees of freedom is rc1(r 1+c1) = (r 1)(c1).
127
Example 6.5(cont)
Test the hypothesis H0 :colour and sweetness are independent.
Solution: The expected frequencies are:
sweet
not sweet
light
100
50
medium dark
66.67
33.33
33.33
16.67
The hypothesis can be stated as: P(sweet orange) is the same whether the orange is light,
medium or dark.
Then,
X (oi ei )2
(115 100)2
(20 16.67)2
x2 =
=
+ +
= 13.88.
ei
100
16.67
The probability of getting 2 at least as large as 13.9 is
> pchisq(q=13.9,df=2,lower.tail=F)
[1] 0.00096
which indicates that the data suggest strongly (p < 0.001) that colour and sweetness
are not independent.
Computer Solution: The observed values are entered into an data frame and the xtabs
command used to make the 2-way table.
In making this 2-way table, the 2 test of independence of rows and columns is also
calculated and saved in the summary.
#__________ Oranges.R __________
Oranges <- expand.grid(sweet=c("Y","N"), colour=c("light","medium","dark"))
Oranges$frequencies <- c(115,35, 55,45, 30,20)
orange.tab <- xtabs(frequencies ~ sweet + colour ,data=Oranges)
print(summary(orange.tab))
colour
sweet light medium dark
Y
115
55
30
N
35
45
20
Call: xtabs(formula = frequencies ~ sweet + colour, data = Oranges)
Number of cases in table: 300
Number of factors: 2
Test for independence of all factors:
Chisq = 14, df = 2, p-value = 0.001
128
The Rcmdr menus are also very convenient for getting the chi2 test of independence
for factors in contingency tables.
Choose Statistics Contingency tables Enter and analyze two-way table
Change the numbers of rows and columns and provide row and column names.
A script similar to the above is generated and the output is the 2 test
light medium dark
sweet
115
55
30
not sweet
35
45
20
> .Test <- chisq.test(.Table, correct=FALSE)
Pearsons Chi-squared test
data: .Table
129
Special Case: 2 2 Contingency Table
6.4
While the 2 2 table can be dealt with as indicated in 6.3 for an r c contingency table, it
is sometimes treated as a separate case because the x2 statistic can be expressed in a simple
form without having to make up a table of expected frequencies. Suppose the observed
frequencies are as follows:
A1
A2
B1
a
c
a+c
B2
b
a+b
d
c+d
b+d
n
Under the hypothesis that the methods of classification are independent, the expected
frequencies are
A1
A2
B1
(a + b)(a + c)
n
(a + c)(c + d)
n
B2
(a + b)(b + d)
n
(c + d)(b + d)
n
The X 2 statistic is then given by,

2

2
2
2
(a+b)(b+d)
(a+c)(c+d)
(c+d)(b+d)
a (a+b)(a+c)
b
n
n
n
n
x2 =
+
+
+
(a+c)(a+b)
(a+b)(b+d)
(a+c)(c+d)
(c+d)(b+d)
n
(ad bc)2
=
n
=
1
1
1
1
+
+
+
(a + c)(a + b) (a + b)(b + d) (a + c)(c + d) (c + d)(b + d)
(ad bc)2 .n
, on simplification.
(a + b)(a + c)(b + d)(c + d)
[Note that the number of degrees of freedom is (r-1)(c-1) where r = c = 2.]
Yates Correction for Continuity

The distribution of the random variables {Nij } in a 2 2 contingency table is necessarily
discrete whereas the chi-square distribution is continuous. It has been suggested that the
approximation may be improved by using as the statistic,
Xc2 =
(|ad bc| 21 n)2 .n

,
(a + b)(a + c)(c + d)(b + d)
(6.4)
130
which is distributed approximately as chi-square on 1 df. The 12 n is known as Yates

continuity correction and (6.4) arises by increasing or decreasing the observed values in
the contingency table by 12 as follows.
If ad < bc replace a by a + 12 , d by d + 12 , b by b 21 , c by c
1
2
Then we have, for the table of observed frequencies:

a + 12
c 12
b 12
d + 12
If ad > bc the + and signs are reversed.

P
Note that the marginal totals are as before. Writing out (oij eij )2 /eij leads to (6.4).
However, there is no general agreement among statisticians that Yates continuity correction
is useful as it does not necessarily improve the approximation and may be worse.
Example 6.6
A class of 200 students was classified as
in the accompanying frequency table.
Test the hypothesis that the pass rate
is the same for males and females.
Male
Female
Passed
70
35
105
Failed
75
20
95
145
55
200
Solution: Now
x2 =
(ad bc)2 n
(1400 2625)2 200
=
= 3.77.
(a + b)(c + d)(a + c)(b + d)
105 95 145 55
But if the continuity correction is used, we get x2c = 3.2. Since P (21 > 3.2) = 0.07, our
result is not significant and we conclude that there is no significant difference between the
proportions of male and female students passing the examination.
Computer Solution
#______________ Class.R _________
Class <- expand.grid(Gender=c("M","F"),Grade=c("P","F") )
Class$freq <- c(70,35, 75,20)
Two.way <- xtabs(freq ~ Gender + Grade,data=Class)
print(chisq.test(Two.way,correct=F))
print(chisq.test(Two.way,correct=T))
Pearsons Chi-squared test
data: Two.way
Pearsons Chi-squared test with Yates continuity correction
data: Two.way
Note that in all this we assume that n individuals are chosen at random, or we have n
independent trials, and then we observe in each trial which of the r c events has occurred.
6.5
131
Fishers Exact Test
The method for 2 2 contingency tables in 6.4 is really only appropriate for n large and
the method described is this section, known as Fishers exact test should be used for
smaller values of n, particularly if a number of the expected frequencies are less than 5.
(A useful rule of thumb is that no more than 10% of expected frequencies in a table should
be less than 5 and no expected frequency should be less than 1.)
Consider now all possible 2 2 contingency tables with the same set of marginal totals,
say a + c, b + d, a + b and c + d, where a + b + c + d = n.
A1
A2
B1
a
c
a+c
B2
b
a+b
d
c+d
b+d
n
We can think of this problem in terms of the hypergeometric distribution as follows. Given
n observations which result in (a + c) of type B1 [and (b + d) of type B2 ]; (a + b) of type
A1 [and (c + d) of type A2 ], what is the probability that the frequencies in the 4 cells will
be
a b
c d
This is equivalent to considering a population of size n consisting of 2 types: (a + c)
B1 s and (b + d) B2 s. If we choose a sample of size a + b, we want to find the probability
that the sample will consist of a B1 s and b B2 s. That is

a+c b+d
(a + c)!(b + d)!(a + b)!(c + d)!
a
b
=
,
(6.5)
P (a B1 s, b B2 s) =
n
a! b! c! d! n!
a+b
Now if the methods of classification are independent, the expected number of type A1 B1
is (a+b)(a+c)
. Fishers exact test involves calculating the probability of the observed set
n
of frequencies and of others more extreme, that is, further from the expected value. The
hypothesis H is rejected if the sum of these probabilities is significantly small. Due to the
calculations involved it is really only feasible to use this method when the numbers in the
cells are small.
132
Example 6.7
Two batches of experimental animals were exposed to infection under comparable conditions. One batch of 7 were inoculated and the other batch of 13 were not. Of the
inoculated group, 2 died and of the other group 10 died. Does this provide evidence of the
value of inoculation in increasing the chances of survival when exposed to infection?
Solution: The table of observed frequencies is
Not inoculated
Inoculated
Died
10
2
12
Survived
3
5
8
13
7
20
The expected frequencies, under the hypothesis that inoculation has no effect are
Not inoculated
Inoculated
Died
7.8
4.2
12
Survived
5.2
2.8
8
13
7
20
Note that e21 = 4.2(= 127

) and the remaining expected frequencies are calculated by
20
subtraction from the marginal totals.
Now the number in row 1, column 1 (10) is greater than the expected value for that cell.
Those more extreme in this direction are 11, 12 with the corresponding tables
11 2
1 6
12 1
0 7
Using (6.5) to find the probability of the observed frequencies or others more extreme in
the one direction, we have
P =
=
=
=
12
X
12

8
20
/
x
13 x
13
x=10

12 8
12 8
8
20
+
+
/
10 3
11 2
1
13
404
7752
.052.
Thus, if H0 is true, the probability of getting the observed frequencies or others more
extreme in the one direction , is about 5 in 100. If we wished to consider the alternative
as twosided, we would need to double this probability.
133
Note that if the chi-square approximation (6.4) is used for a 2 2 contingency table,
then that accounts for deviations from expectation in both directions since the deviations
are squared. If we had used (6.4) in the above example we would expect to get a probability
of about .10. Carrying out the calculations we get x2 = 2.65 and from chi-square tables,
P (W > 2.65) is slightly more than 0.10 (where W 21 ).
Computer Solution
#____________ Fisher.R ___________
Infection <- expand.grid(Inoculated=c("N","Y"),Survive=c("N","Y") )
Infection$freq <- c(10,2, 3,5)
Ftab <- xtabs(freq ~ Inoculated + Survive,data=Infection)
print(fisher.test(Ftab))
Fishers Exact Test for Count Data
data: Ftab
p-value = 0.06233
alternative hypothesis: true odds ratio is not equal to 1
0.74 117.26
sample estimates:
odds ratio
7.3
6.6
134
Parametric Bootstrap-X 2
The theme of the previous sections was whether the distribution of observed counts could
be considered as random samples from a multinomial distribution with known probabilities
and total sample size. The test statistic was a measure of the difference between observed
counts and the counts expected from the hypothesised multinomial distribution. This
statistic was regarded as coming from a 2 distribution,
X2 =
X
X O E)2
E
2
We can get a non-parametric estimation of the distribution of X 2 under H0 and then

compare the observed X 2 to decide whether it could have risen with a reasonable probability
(p > 0.05) if H0 were true.
We term this procedure parametric bootstrap because the random sample is drawn from
a parametric distribution, multinomial in this case, although the distribution of the test
statistic X 2 is determined non-parametrically from the bootstrap samples.
The steps are:2
(i) Calculate the test statistic from the observed data, Xobs
.
(ii) Determine the probabilities, ij , under H0 .

(iii) Make bootstrap samples and calculate X 2 for each sample.
(for j in 1:nBS)
1. Sample from the multinomial distribution with sample size N and probabilities.
2. Calculate Xj2 and record this statistic (i.e. save it in a vector).
(iv) Plot the empirical distribution function of X 2 and estimate the 95% quantile.
2
(v) Compare Xobs
with the distribution of X 2 |H0 .
135
Example 6.8
Revisit Example 6.1 where observed counts of faces from 120 throws were:i
oi
1 2 3 4
15 27 18 12
5
25
6
23
H0 : 1 = 2 = 3 = 4 = 5 = 6 =
1
6
The bootstrap distribution of X 2 is found by

(i) sampling from a multinomial(size=120,p=rep(1,6)/6) distribution and
(ii) calculating X 2 for each sample.
Obs <- c(15,27,18,12,25,23)
Total <- sum(Obs)
p0 <- rep(1,6)/6
X2.obs <- as.numeric(chisq.test(Obs,p=p0)$statistic)
nBS <- 100
X2 <- numeric(nBS)
for (j in 1:nBS){
simdata <- rmultinom(size=Total,p=p0,n=1)
X2[j] <- chisq.test(simdata,p=p0)$statistic
}
# end of the j loop
Q <- quantile(X2,prob=0.95 )
plot(ecdf(X2),las=1)
The results are shown in Figure 6.3. The plot indicates that P (X 2 > 8.8|H0 ) = 0.15
compared with 0.12 for the 2 test.
Figure 6.3: Bootstrap distribution of X 2 for testing unbiassedness of die throws.
1.0
0.95
0.8
0.6
0.4
0.2
0.0
0
X2
obs
10
X2
0.95
15
136
Example 6.9
The bootstrap test of independence of factors in a contingency table is illustrated using
the data in Example 6.5.
sweet
not sweet
light medium dark

115
55
30
35
45
20
The hypothesis of independence is H0 : pij = pi. p.j and the marginal probabilities are
estimated from the data,
pi. =
ni.
(row probabilities)
N
p.j =
n.j
(column probabilities)
N
The cell probabilities can be calculated by matrix multiplication,

p11 p12 p13
p1.
p.1 p.2 p.3 =
p21 p22 p23
p2.
rnames <- c("sweet","not sweet")
cnames <- c("light","medium","dark")
Oranges <- expand.grid(Sweetness=rnames,Colour=cnames)
Oranges$counts <- c(115,35, 55,45, 30,20)
Ftab <- xtabs(counts ~ Sweetness + Colour ,data=Oranges)
X2.obs <- summary(Ftab)$statistic
#____________ bootstrap ___________
nBS <- 1000
X2 <- numeric(nBS)
Total <- sum(Ftab)
col.p <- margin.table(Ftab,margin=2)/Total
row.p <- margin.table(Ftab,margin=1)/Total
prob.indep <- row.p %*% t(col.p)
#
matrix multiplication of row & column probabilities
for (j in 1:nBS){
simdata <- matrix(rmultinom(prob=prob.indep,size=Total,n=1),nrow=length(rnames),ncol=length(cnames) )
X2[j] <- chisq.test(simdata)$statistic }
# end of j loop
Q <- quantile(X2,prob=0.95 )
plot(ecdf(X2),las=1)
Figure 6.4 displays the bootstrap distribution function of X 2 |H0 with the observed
value of X 2 and the 95%ile. This shows that P (X 2 > 14|H0 ) < 0.01 as before.
Comment These examples do not show any advantage of the bootstrap over the parametric 2 tests. However, understanding of the technique is a platform for Bayesian MonteCarlo Markov Chain methods (later on).
A Bayesian analysis is not presented here because the setting for that is called log-linear
models which require some more statistical machinery. This will be encountered in the unit
on Linear Models.
137
Figure 6.4: Bootstrap distribution of X 2 for testing independence of factors in Oranges

contingency table.
1.0
0.95
0.8
0.6
0.4
0.2
0.0
X2
0.95
10
X2
obs
15
Chapter 7
7.1
Analysis of Variance
Introduction
In chapter 5 the problem of comparing the population means of two normal distributions
was considered when it was assumed they had a common (but unknown) variance 2 . The
hypothesis that 1 = 2 was tested using the two sample ttest. Frequently, experimenters
face the problem of comparing more than two means and need to decide whether the
observed differences among the sample means can be attributed to chance, or whether they
are indicative of real differences between the true means of the corresponding populations.
The following example is typical of the type of problem we wish to address in this chapter.
Example 7.1
Suppose that random samples of size 4 are taken from three (3) large groups of students
studying Computer Science, each group being taught by a different method, and that these
students then obtain the following scores in an appropriate test.
Method A
Method B
Methods C
71
90
72
75
80
77
65 69
86 84
76 79
The means of these 3 samples are respectively 70, 85 and 76, but the sample sizes are
very small. Does this data indicate a real difference in effectiveness of the three teaching
methods or can the observed differences be regarded as due to chance alone?
Answering this and similar questions is the object of this chapter.
7.2
The Basic Procedure
Let 1 , 2 , 3 be the true average scores which students taught by the 3 methods should
get on the test. We want to decide on the basis of the given data whether or not the
hypothesis
H : 1 = 2 = 3 against A : the i are not all equal.
is reasonable.
138
CHAPTER 7. ANALYSIS OF VARIANCE
139
The three samples can be regarded as being drawn from three (possibly different) populations. It will be assumed in this chapter that the populations are normally distributed
and have a common variance 2 . The hypothesis will be supported if the sample means
are all nearly the same and the alternative will be supported if the differences among the
sample means are large. A precise measure of the discrepancies among the Xs is required
and the most obvious measure is their variance.
Two Estimates of 2
Since each population is assumed to have a common variance the first estimate of 2 is
obtained by pooling s21 , s22 , s23 where s2i is the ith sample variance. Recalling that we
have x1 = 70, x2 = 85, x3 = 76, then
s21 =
Similarly, s22 =
52
,
3
52
(71 70)2 + (75 70)2 + (65 70)2 + (69 70)2
= .
3
3
s23 =
26
.
3
Pooling sample variances, we then have

s2 =
1 s21 + 2 s22 + 3 s23

= 14.444.
1 + 2 + 3
Since this estimate is obtained from within each individual sample it will provide an
unbiased estimate of 2 whether the hypothesis of equal means is true or false since it
measures variation only within each population.
The second estimate of 2 is now found using the sample means. If the hypothesis
that 1 = 2 = 3 is true then the sample means can be regarded as a random sample
from a normally distributed population with common mean and variance 2 /4 (since
2
X N (, n )) where n is the sample size). Then if H is true we obtain,
d
sx2 = Var(X)
=
3
X
(xi x)2 /(3 1) =
i=1
(70 77)2 + (85 77)2 + (76 77)2

2
which on evaluation is 57.

But s2x = 57 is an estimate of 2 /4, so 4 57 = 228 is an estimate of 2 , the common
variance of the 3 populations (provided the hypothesis is true). If the hypothesis is not
true and the means are different then this estimate of 2 will be inflated as it will also be
affected by the difference (spread) between the true means. (The further apart the true
means are the larger we expect the estimate to be.)
We now have 2 estimates of 2 ,
X
X
i s2i /
i = 14.444 and n s2x = 228.
i
If the second estimate (based on variation between the sample means) is much larger
than the first estimate (which is based on variation within the samples, and measures
140
variation that is due to chance alone) then it provides evidence that the means do differ
and H should be rejected. In that case, the variation between the sample means would be
greater than would be expected if it were due only to chance.
The comparison of these 2 estimates of 2 will now be put on a rigorous basis in 7.2
where it is shown that the two estimates of 2 can be compared by an F-test. The method
developed here for testing H : 1 = 2 = 3 is known as Analysis of Variance (often
abreviated to AOV).
7.3
Single Factor Analysis of Variance
Consider the set of random variables {Xij } where j = 1, 2, . . . , ni and i = 1, 2, . . . , k, and

their observed values below, where xij is the jth observation in the ith group.
Group
1
2
..
.
Observations
x11 x12 . . . x1n1
x21 x22 . . . x2n2
..
.
Totals
T1 = x1.
T2 = x2.
..
.
Means
x1.
x2.
..
.
No. of observations
n1
n2
..
.
xk1 xk2 . . . xknk
Tk = xk.
xk.
nk
Notation
xi. =
ni
X
xij = Ti = Total for ith group
j=1
xi. = Ti /ni = mean of the ith group

k
X
X
Ti =
xij
T =
i=1
n =
x.. =
k
X
i,j
ni = total number of observations
i=1
ni
k X
X
xij /n = T /n = grand mean.
i=1 j=1
Note:When a dot replaces a subscript it indicates summation over that subscript.

In example 7.1 the variation was measured in two ways. In order to do this the total
sum of squares of deviations from the (grand) mean must be partitioned (that is, split up)
appropriately . Theorem 7.1 shows how the total sum of squares can be partitioned.
141
Theorem 7.1 (Partitioning of the Sum of Squares)

ni
k X
X
(xij x.. ) =
i=1 j=1
Proof
X
(xij xi. ) +
i=1 j=1
(xij x.. )2 =
i,j
ni
k X
X
k
X
ni (xi. x.. )2
(7.1)
i=1
(xij xi. + xi. x.. )2
i,j
[(xij xi. )2 + (xi. x.. )2 + 2(xij xi. )(xi. x.. )]
i,j
(xij xi. )2 +
i,j
ni (xi. x.. )2 + 2
(xi. x.. )
(xij xi. )
{z
=0
Notes on the Partitioning:

1. SST =
i,j (xij
x.. )2 = Total sum of squares (of deviations from the grand mean)
P
2. Notice that j (xij xi. )2 is just the total sum of squares of deviations from the
mean in the ith sample and summing over these from all k groups then gives
X
SSW =
(xij xi. )2
i,j
This sum of squares is only affected by the variability of observations within each
sample and so is called the within subgroups sum of squares.
3. The third term is the sum of squares obtained from the deviations of the sample
means from the overall (grand) mean and depends on the variability between the
sample means. That is,
X
SSB =
ni (xi. x.. )2
i
and is called the between subgroups sum of squares.

4. We can think of Theorem 7.1 as a decomposition of a sum of squares into 2 parts,
that is, SST = SSB + SSW , where SSB results from variation between the groups
and SSW results from variation within the groups. It is these parts or sources of
variation that will be compared with each other.
Our aim now is to relate equation (7.1) to random variables and the hypothesis-testing
problem.
142
Assume {Xij , j = 1, . . . , ni } are distributed N(i , 2 ) and are mutually independent

where i = 1, 2, . . . , k. Consider the hypothesis
H : 1 = 2 = . . . = k (= say),
(7.2)
(that is, there is no difference between group means) and the alternative,
A : the i are not all equal.
Note, this is not the same as saying 1 6= 2 6= . . . 6= k .
We will now find the probability distributions of SST , SSB , SSW under H.
Distributions of SST , SSW and SSB under H

P
Assume that H is true and that all n (= ki=1 ni ) random variables are normally distributed
with mean and variance 2 (that is, Xij N (, 2 ) for all i and j).
(i) If
the same population then
Pall the data2 is treated as a single group coming from
2
i,j (xij x.. ) /(n 1) is an unbiased estimate of . Then, using the result from
2
chapter 3 that S
2 ,
2
1 X
(Xij X .. )2 2n1 .
2 i,j
(7.3)
P
(ii) If only the ith group is considered then s2i = j (xij xi. )2 /(ni 1) is also an unbiased
estimate of 2 . The k unbiased estimates, s21 , s22 , . . . , s2k , can then be pooled to obtain
another unbiased estimate of 2 , that is
X
XX
s2 =
(xij xi. )2 /(
ni k).
i
Thus, as in (i) above,

1 X
(Xij X i. )2 2nk .
2 i,j
(7.4)
(iii) Now X i. N (, 2 /ni ), i = 1, 2, . . . k, so that ni X i. N ( ni , 2 ). Regarding
n1 X 1. , n2 X 2. , . . . , nk X k. as a random sample from N ( ni , 2 ), the sample

P
variance is ki=1 ni (X i. X .. )2 /(k 1) and is another unbiased estimate of 2 . Then,
again as in (i) above,
k
1 X
ni (X i. X .. )2 2k1 .
(7.5)
2 i=1
143
It can be shown that the random variables in (7.4) and (7.5) are independent. (The proof
is not given as it is beyond the scope of this unit.) Thus from (7.1) we have
k
1 X
1 X
1 X
2
2
(Xij X .. ) = 2
(Xij X i. ) + 2
ni (X i. X .. )2
2
i,j
i,j
i=1
which means we have expressed a chi-square rv on n 1 df as the sum of two independent

chi-square rvs on n k and k 1 dfs respectively. Note that the degrees of freedom of
the rv on the RHS add to the degrees of freedom for the rvs on the LHS.
Thus, from (7.3), (7.4), (7.5) it follows that, if H is true we have 3 estimates of 2 :
P
2
(i)
i,j (xij x.. ) /(n 1) which is based on the total variation in the whole sample;
P
2
(ii)
i,j (xij xi. ) /(nk) which is based on the variation occuring within each subsample;
P
2
(iii)
i ni (xi. x.. ) /(k 1) which is based on the variation of the subsample means from
the whole-sample mean.
If H is true, from (7.4) and (7.5) and the definition of F (equation 4.4)
P
ni (X i. X .. )2 /(k 1)
Fk1,nk .
Pi
2 /(n k)
X
)
(X
ij
i.
i,j
That is if
SSB /(k 1)
Fk1,nk .
SSW /(n k)
(7.6)
We can summarize the results in an Analysis of Variance Table as follows.

Source of Variation Sum of Squares
P
Between gps
n (X X .. )2
Pi i i.
Within gps
(X X i. )2
Pi,j ij
2
Total
i,j (Xij X .. )
df
k1
nk
n1
Mean Square
SSB /(k 1)
SSW /(n k)
F
SSB /(k1)
SSW /(nk)
Note that the term mean square (ms) is used to denote (sums of squares)/df.
Method of Computation
In order to calculate the sums of squares in the AOV table it is convenient to express the
sums of squares in a different form.
144
Total SS
SST =
(xij x.. ) =
i,j
X
x2ij
T2
==
x2ij
.
n
n
i,j
(7.7)
T2
.
n
(7.8)
P
x2ij
i,j
where
P
2
i,j xij is called the raw sum of squares and
T2
is called the correction term.
n
Between Groups SS
SSB =
X T2
i
ni
since
SSB =
ni (xi. x.. )2
ni x2i. 2 x..
ni xi. + x2..
ni
| i{z }
n
X
i
T2
T X
T2
Ti + n 2
ni i2 2
ni
n i
n
X T2
i
ni
T2
n
The same correction term is used here as appeared in the calculation of SST .
Within Groups SS
Since, SST = SSB + SSW , SSW is found by subtracting SSB from SST . Similarly the df
for within groups is found by subtracting k 1 from n 1.
Testing the Hypothesis

We have so far considered the distributions of the various sums of squares assuming the
hypothesis of equal means to be true. The expected values of these sums of squares are
now found when H is not true. Recall that
1 X
X i. =
Xij
ni j
145
X .. =
1X
Xij =
n i,j
X .. =
1X
ni X i .
n i
The latter can also be written as
Note that E(X i. ) = i , i = 1, 2, . . . , k, and

1X
E(X .. ) =
ni i = , say .
n i
(7.9)
(7.10)
Theorem 7.2
With SSW and SSB as defined earlier.
(a) E(SSW ) = (n k) 2 .
(b) E(SSB ) = (k 1) 2 +
ni (i )2 .
Proof of (a)
E(SSW ) = E
XX
i
(Xij X i. )2
(Xij X i. )2
k
X
E(ni 1)Si2 , where Si2 is the sample variance of
i=1
the ith group, since Xij N (i , 2 )

=
k
X
(ni 1) 2
i=1
Thus
!
E
(Xij X i. )
= (n k) 2 .
i,j
Proof of (b)
X
E(SSB ) = E[
ni (X i. X .. )2 ]
i
X
X
2
2
= E[
ni X i. 2X ..
ni X i. +nX .. ]
i
| i {z
nX
X
2
2
= E[
ni X i. n X .. ]
i
X
i
ni [Var(X i. ) + (E(X i. ))2 ] n[Var(X .. ) + (E(X .. ))2 ]
(7.11)
146
Now E(X i. ) = i and E(X .. ) = from (7.10).

Also, Var(X i. ) = 2 /ni and Var(X .. ) = 2 /n. So

2

X 2
X
2
2
E(SSB ) =
ni
+ i n
+ = k 2 +
ni 2i 2 n2 .
ni
n
i
i
That is,
!
E
ni (X i. X)2..
= (k 1) 2 +
ni (i )2
(7.12)
Note that sometimes Theorem 7.2 is stated in terms of the expected mean squares instead
of expected sums of squares.
These results are summarized in the table below.
Source of Variation
Between gps
Within gps
Total
Sum of Squares(SS)
df
SSB
k1
SSW
nk
SST
n1
Mean Square(MS)
SSB /(k 1)
SSW /(n k)
E(Mean
Square)
P
2
n
i (i )
2 + i k1
2
Now if H is true, that is, if 1 = 2 = . . . = k , then (as defined in (7.10)) = then

X
ni (i )2 = 0, and
i
SSB /(k 1)
Fk1,nk .
SSW /(n k)
However, if H is not true and the i are not all equal then
X
ni (i )2 /(k 1) > 0,
i
and the observed value of the F ratio will tend to be large so that large values of F will
tend to caste doubt of the hypothesis of equal means. That is if
SSB /(k 1)
> F (k 1, n k),
SSW /(n k)
where F (k 1, n k) is obtained from tables. The significance level , is usually taken
as 5%, 1% or .1%.
Note: The modern approach is to find the probability that the observed value of F (or
one larger) would have been obtained by chance under the assumption that the hypothesis
is true and use this probability to make inferences. That is find
P (F Fobserved )
147
and if it is small (usually less than 5%) claim that it provides evidence that the hypothesis
is false. The smaller the probability the stronger the claim we are able to make. To use this
approach ready access to a computer with suitable software is required. With tables we can
only approximate this procedure since exact probabilities cannot in general be obtained.
Comments
1. SSW /(n k), the within groups mean square, provides an unbiased estimate of 2
whether or not H is true.
2. When finding the F ratio in an AOV, the between groups mean square always forms
the numerator. This is because its expected value is always greater than or equal to
the expected value of the within groups mean square (see 7.12). This is one case where
one doesnt automatically put the larger estimate of variance in the numerator. If H is
true, both SSB and SSW are estimates of 2 and in practice either one may be the larger.
However small values of F always support the hypothesis so that if F < 1 it is always
non-signuficant.
Example 7.2
Suppose we have 4 kinds of feed (diets) and it is desired to test whether there is any
significant difference in the average weight gains by certain animals fed on these diets.
Twenty (20) animals were selected for the experiment and allocated randomly to the diets,
5 animals to each. The weight increases after a period of time were as follows.
P
Diet
Observations
Ti = j xij xi ni
A
7 8 8 10 12
45
9.0 5
B
5 5 6 6 8
30
6.0 5
C
7 6 8 9 10
40
8.0 5
D
5 7 7 8 8
35
7.0 5
T = 150
20
Solution: Let the random variable Xij be the weight of the jth animal receiving the ith
diet where Xij , j = 1, . . . , 5 N (i , 2 ).
Test the hypothesis that all diets were equally effective, that is
H : 1 = 2 = 3 = 4 (= , say).
148
Calculations
Total SS = SST =
x2ij
i,j
Between diets SS
= SSB =
T2
1502
= 72 + . . . + 82
= 63
n
20
(Ti2 /ni )
i
2
T2
n
from (7.8)
452 30
402 352 1502
+
+
+
= 25
5
5
5
5
20
Within diets SS = SSW = 63 25 = 38.
=
The Analysis of Variance Table is as below.

Source of Variation
Between diets
Within diets
Total
SS df MS
F
25
3 8.333 3.51*
38 16 2.375
63 19
The 5% critical value of F3,16 is 3.24, so the observed value of 3.51 is significant at the
5% level. Thus there is some reason to doubt the hypothesis that all the diets are equally
effective level and conclude that there is a significant difference in weight gain produced
by at least one of the diets when compared to the other diets.
Computer Solution:
#___________ Diets.R ________
Feed <- expand.grid(unit=1:5,Diet=LETTERS[1:4])
Feed$wtgain <- c(7,8,8,10,12,5,5,6,6,8,7,6,8,9,10,5,7,7,8,8)
weight.aov <- aov(wtgain ~ Diet,data=Feed)
print(summary(weight.aov) )
Diet
Residuals
---
Df Sum Sq Mean Sq F value Pr(>F)

3
25.0
8.3
3.51
0.04
16
38.0
2.4
R gives a P value of 0.04 which indicates significance at the 4% level confirming the result
(P < 5%) obtained above.
7.4
Estimation of Means and Confidence Intervals
Having found a difference between the means our job is not finished. We want to try and find
exactly where the differences are. First we want to estimate the means and their standard errors.
149
It is useful to find confidence intervals for these means. The best estimate for i , the mean of the
ith group, is given by
P
j xij
xi =
, for i = 1, 2, . . . , k,
ni
where ni = number observations in the ith group. A 100(1 )% confidence interval for i is
then
s
xi t,/2
ni
where s2 is the estimate of 2 given by the within groups (residual) mean square (in the AOV
table) and is thus on = n k degrees of freedom.
For straightforward data such as these, the means and their standard errors are calculated
with model.tables(),
print(model.tables(weight.aov,type="means",se=T))
Tables of means
Grand mean
7.5
Diet
A B C D
9 6 8 7
Standard errors for differences of means
Diet
0.9747
replic.
5
Note
the output specifies that this is the standard error of the differences of means where
s.e.m. = 2 s.e.
and it is this number

Therefore s.e. = sni = s.e.m.
. The standard error in the above case is 0.97
2
2
which is multiplied by t,/2 to derive confidence limits.
7.5
Assumptions Underlying the Analysis of Variance
The assumptions required for the validity of the AOV procedure are that:
(i) each of the k samples is from a normal population;
(ii) each sample can be considered as being drawn randomly from one of the populations;
(iii) samples are independent of each other;
(iv) all k populations have a common variance (homogeneity of variance).
If these assuptions are violated then conclusions made from the AOV procedure may be
incorrect. We need to be able to verify whether the assumptions are valid.
Assumption (i) may be tested using a chi-square goodnessoffit test (Chapter 6), while careful
planning of the experiment should ensure (ii) and (iii) holding.
There are several tests for (iv) three of which follow.
7.5.1
150
Tests for Equality of Variance
The F-max Test

For a quick assessment of heterogeneity in a set of sample variances we may compute the ratio of
the largest sample variance to the smallest. This ratio is referred to as Fmax . That is, the Fmax
statistic is defined by
2
2
Fmax = Smax
/Smin
.
The distribution depends on k and i where k is the number of sample variances being compared
and i is the df of the ith s2i . It is not the same as the ordinary Fdistribution (except when
k = 2) which was the ratio of 2 independent sample variances. Clearly wed expect a greater
difference between the largest and smallest of k (= 6, say) estimates of 2 than between 2 chosen
at random.
While tables are available we will not use them. If Fmax is small then it would seem there is
no problem with the equality of variance assumption and no further action is required. If Fmax
is large enough to cause doubt then use either Levenes Test or Bartletts test.
Bartletts Test
Let S12 , S22 , . . . , Sk2 be sample variances based on samples of sizes n1 , n2 , . . . , nk . The samples
are assumed to be drawn from normal distributions with variances 12 , 22 , . . . , k2 respectively.
Define
P
P
( i i ) loge S 2 i i loge Si2
hP
i
(7.13)
Q=
1
1
P1
1 + 3(k1)
i ( i )
i i
P
P
2
2
where S = i i Si / i .
Then under the hypothesis
H : 12 = 22 = . . . = k2 ,
Q is distributed approximately as 2k1 . The approximation is not very good for small ni .
The hypothesis is tested by calculating Q from (7.13) and comparing it with wk1, found
from tables of the chi-square distribution.
Example 7.3
Suppose we have 5 sample variances, 15.9, 6.1, 21.0, 3.8, 30.4, derived from samples of sizes 7, 8,
7, 6, 7 respectively. Test for the equality of the population variances.
Solution: Fmax = s2max /s2min = 30.4/3.8 = 8.0.
This is probably large enough to require further checking.
For Bartletts test, first pool the sample variances to get S 2 .
P
i S 2
2
S = P i = 15.5167
i
Then from (7.13) we obtain Q = 7.0571 which is distributed approximately as a chi-square on 4
df. This is non-significant (P = 0.13 using R). Hence we conclude that the sample variances are
151
compatible and we can regard them as 5 estimates of the one population variance, 2 .
It should be stressed that in both these tests the theory is based on the assumption that the k
random samples are from normal populations. If this is not true, a significant value of Fmax or Q
may indicate departure from normality rather than heterogeneity of variance. Tests of this kind
are more sensitive to departures from normality than the ordinary AOV. Levenes test (which
follows) does appear to be robust to the assumption of normality, particularly when medians
(instead of means as were used when the test was first proposed) are used in its definition.
Levenes Test
Let Vij = |Xij i | where i is the median of the ith group, i = 1, 2, . . . , k, j = 1, 2, . . . , ni . That
is Vij is the absolute deviation of the jth observation in the ith group from the median of the ith
group. To test the hypothesis, 12 = 22 = = k2 against the alternative they are not all equal
we carry out a oneway AOV using the Vij as the data.
This procedure has proven to be quite robust even for small samples and performs well even if
the original data is not normally distributed.
Use R to test for homogeneity of variance for the data in Example 7.2.
Solution:
Bartletts Test
print(bartlett.test(wtgain ~ Diet,data=Feed) )
Bartlett test of homogeneity of variances
data: wtgain by Diet
Bartletts K-squared = 1.3, df = 3, p-value = 0.7398
152
Levenes Test The first solution derives Levenes test from first principles.
#Calculate the medians for each group.
attach(Feed,pos=3)
med <- tapply(wtgain,index=Diet,FUN=median)
med <- rep(med, rep(5,4) )
> med
A A A A A B B B B B C C C C C D D D D D
8 8 8 8 8 6 6 6 6 6 8 8 8 8 8 7 7 7 7 7
# Find v, the absolute deviations of each observation from the group median.
v <- abs(wtgain-med)
# Analysis of variance using v (Levenes Test).
levene <- aov(v~diet)
summary(levene)
diet
Residuals
Df Sum Sq Mean Sq F value Pr(>F)

3 1.350
0.450 0.3673 0.7776
16 19.600
1.225
There is a function levene.test() which is part of the car library of R. It would be necessary
to download this library from CRAN to use the function.
library(car)
print(levene.test(Feed$wtgain,Feed$Diet))
Levenes Test for Homogeneity of Variance
Df F value Pr(>F)
group 3
0.37
0.78
16
Bartletts Test, (P = 0.744), and Levenes test, (P = 0.7776), both give non-significant re2 = 2 = 2 = 2 .
sults, so there appears no reason to doubt the hypothesis, A
B
C
D
7.6
Estimating the Common Mean
If an AOV results in a non-significant F-ratio, we can regard the k samples as coming from
populations with the same mean (or coming from the same population). Then it isPdesirable
to find bothPpoint and interval estimates of . Clearly the best estimate of is x = i,j xij /n
where n = i ni . A 100(1 )% confidence interval for is
s
x t,/2 ,
n
where s2 is the estimate of 2 given by the within gps mean square (in the AOV table) and is
thus on = n k degrees of freedom.
Chapter 8
8.1
Simple Linear Regression
Introduction
Frequently an investigator observes two variables X and Y and is interested in the relationship
between them. For example, Y may be the concentration of an antibiotic in a persons blood and
X may be the time since the antibiotic was administered. Since the effectiveness of an antibiotic
depends on its concentration in the blood, the objective may be to predict how quickly the
concentration decreases and/or to predict the concentration at a certain time after administration.
In problems of this type the value of Y will depend on the value of X and so we will observe
the random variable Y for each of n different values of X, say x1 , x2 , . . . , xn , which have been
determined in advance and are assumed known. Thus the data will consist of n pairs, (x1 , y1 ),
(x2 , y2 ), . . . , (xn , yn ). The random variable Y is called the dependent variable while X is called
the independent variable or the predictor variable. (Note this usage of the word independent has
no relationship to the probability concept of independent random variables.) It is important to
note that in simple linear regression problems the values of X are assumed to be known and so
X is not a random variable.
The aim is to find the relationship between Y and X. Since Y is a random variable its value
at any one X value cannot be predicted with certainty. Different determinations of the value of
Y for a particular X will almost surely lead to different values of Y being obtained. Our initial
aim is thus to predict the E(Y ) for a given X. In general there are many types of relationship
that might be considered but in this course we will restrict our attention to the case of a straight
line. That is we will assume that the mean value of Y is related to the value of X by
Y = E(Y ) = + X
(8.1)
where the parameters and are constants which will in general be unknown and will need to
be determined from the data. Equivalently,
Y = + X +
(8.2)
where is a random variable and is the difference between the observed value of Y and its expected
value. is called the error or the residual and is assumed to have mean 0 and variance 2 for all
values of X.
Corresponding to xi the observed value of Y will be denoted by yi and they are then related
by
yi = + xi + i , i = 1, 2, . . . , n,
(8.3)
A graphical representation is given in figure 8.1.
153
CHAPTER 8. SIMPLE LINEAR REGRESSION
154
Figure 8.1: Simple Linear Regression
(x , y )
i i
* (x ,+ x)
i
Note: ( = y x )
i
Now and are unknown parameters and the problem is to estimate them from the sample
of observed values (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). Two methods of estimation, (least squares and
maximum likilihood), will be considered.
8.2
Estimation of and .
It is easy to recognize that is the intercept of the line with the yaxis and is the slope.
A diagram showing the n points {(xi , yi ), i = 1, 2, . . . , n} is called a scatter plot, or scatter
diagram. One simple method to obtain approximate estimates of the parameters is to plot the
observed values and draw in roughly the line that best seems to fit the data from which the
intercept and slope can be obtained. This method while it may sometimes be useful has obvious
deficiencies. A better method is required.
One such method is the method of least squares.
Method of Least Squares

One approach to the problem of estimating and is to minimise the sum of squares of the
errors from the fitted line. That is the value of
n
X
i=1
2i
n
X
(yi xi )2
(8.4)
i=1
is minimised by differentiating with respect to and and putting the resulting expressions
equal to zero. The results are stated in Theorem 8.1.
155
Theorem 8.1
The Least Squares Estimates of and are
P
(xi x)yi
= Pi
2
i (xi x)
(8.5)
= y bx.
(8.6)
yi = 0 + (xi x) + i ,
(8.7)
= 0 x,
(8.8)
Proof
For convenience we will rewrite (8.3) as
where
and the minimization will be with respect to 0 and . From (8.7)
n
X
2i
i=1
n
X
[yi 0 (xi x)]2 .
i=1
Thus, taking partial derivative,
Pn
Pn
2
i=1 i
0
2
i=1 i
= 2
= 2
n
X
i=1
n
X
[yi 0 (xi x)]

[yi 0 (xi x)](xi x).
i=1
Equating the last two expressions to zero we obtain

n
X
i x)] = 0
[yi
0 (x
(8.9)
i x)](xi x) = 0
[yi
0 (x
(8.10)
i=1
n
X
i=1
where
0 and are the solutions of the equations. Equations (8.9) and (8.10) are referred to as
the normal equations.
From (8.9) we have
n
n
X
X
yi = n
0 +
(xi x) .
|i=1 {z
i=1
=0
So
0
i yi
= y.
156
Then from (8.10) we have

n
X
yi (xi x) =
0
i=1
(xi x) +
|i
(xi x)2 ,
{z
=0
giving
P
(xi x)yi
= Pi
.
2
i (xi x)
Then using (8.8) the estimate of is
= y bx.
Comments
1. No assumptions about the distribution of the errors, i , were made (or needed) in the proof
of Theorem 8.1. The assumptions will be required to derive the properties of the estimators
and for statistical inference.
2. For convenience, the least squares estimators,
,
0 and will sometimes be denoted by a,
a0 and b. This should not cause confusion.
3. The estimators of and derived here are the Best Linear Unbiased Estimators (known
as the BLUEs) in that they are
(i) linear combinations of y1 , y2 , . . . , yn ,
(ii) unbiased;
(iii) of all possible linear estimators they are best in the sense of having minimum variance.
Method of Maximum Likelihood

Assume that i N (0, 2 ) , i = 1, 2, . . . , n and that they are mutually independent. Then the
Yi are independent and are normally distributed with means + Xi and common variance 2 .
The likelihood for the errors is given by
"
L=
n
Y
1
e
2
i=1
1 i 2
2
.
#
Since i = yi 0 (xi x
) the likelihood can be written as
2
n
Y
1
e
L=
2
i=1
1
2
yi 0 (xi x
2 3
5
157
Logging the likelihood gives

n
X
) 2
1 yi 0 (xi x
log(L) = n. log( 2)
2
i=1
Differentiating log(L) with respect to 0 , and 2 and setting the resultant equations equal to
zero gives
n h
X
i
ix
yi
0 (x
) = 0
i=1
n
X
h
i
ix
(xi x
) yi
0 (x
) = 0
i=1
Pn
n +
i=1 (yi
i )2
0 x
= 0
The first two of these equations are just the normal equations, (8.9) and (8.10) obtained previously
by the method of least squares. Thus the maximum likelihood estimates of and are identical
to the estimates (8.5) and (8.6) obtained by the method of least squares. The maximum likelihood
estimate of 2 ,
Pn
i )2
(yi
0 x
2
= i=1
.
n
This estimate is biased.
Comments
x) is called the regression line of Y on X.
1. The fitted line, E(Y ) =
0 + (x
2. The regression line passes through the point (x, y).
3. In our notation we will not distinguish between the random variables
and and their
observed values.
4. The estimate of 2 can be written as
2
Pn
2
i=1 ei
where
i
ei = yi
x
(8.11)
is an estimate of the true error (residual), .

5. For the purpose of calculation, it is often convenient to use an alternate form of (8.5),
P P
P
xi yi
xi yi
Pn
=
,
(8.12)
P 2 ( xi )2
xi
n
158
where all sums are over i = 1, 2, . . . , n. To verify that these are equivalent, note that
X
X
X
(xi x)(yi y) =
(xi x)yi y
(xi x)
i
|i
xi yi x
{z
=0
yi
Pi P
xi yi
xi yi
.
n
Mean, Variance and Covariance of Regression Coefficients

Theorem 8.2
If
0 and are the least squares estimates of the regression coefficients 0 and then if 1 , 2 , . . . , n
are independently distributed with mean zero and variance 2 then,
= ,
E(
0 ) = 0 and E()
(8.13)
2
=P
Var(
0 ) = 2 /n, and Var()
,
(xi x)2
(8.14)
= 0.
cov(
0 , )
(8.15)
Proof First write

0 and as linear functions of the random variables Yi . First,
0 =
1
1
1
Y1 + Y2 + . . . + Yn
n
n
n
so that
E(
0 ) =
1X
X
1X 0
E(Yi ) =
( + (xi x)) = 0 +
(xi x) .
n
n
n
i
i
| i {z
}
=0
Secondly,
x1 x
x2 x
xn x
= P
Y1 + P
Y2 + . . . + P
Yn
2
2
(xi x)
(xi x)
(xn x)2
giving
=
E()

1
P
(x1 x)(0 + (x1 x)) + . . . + (xn x)(0 + (xn x))
2
(xi x)
X
1
0X
P
(xi x) +
(xi x)2 = .
2
(xi x)
|
{z
}
i
=0
159
Next,
2
1
.n 2 =
and
2
n
n
1
P
[(x1 x)2 2 + . . . + (xn x)2 2 ]
[ i (xi x)2 ]2
X
2
2
2
P
P
(x
.
x)
=
i
[ i (xi x)2 ]2
(xi x)2
Var(
0) =
=
Var()
=
To show that a0 and b are uncorrelated, first write i = Yi E(Yi ), then

cov(Yi , Yj ) = E[(Yi E(Yi ))(Yj E(Yj ))]
= E(i j )

0
if i 6= j, since the i are independent
=
2 if i = j.
Since
0 and are linear forms in the Yi we then have
= 2 P 1
cov(
0 , )
[(x1 x) + . . . + (xn x)] = 0.
{z
}
n i (xi x)2 |
=0
We then have the following corrollary to Theorem 8.2.

Corollary 8.2.1

1
x2
E(
) = and Var(
) = 2
+P
(xi x)2
n
and using the results of Theoerem 8.2 we have
Proof: Since
= 0 x
E(
) = 0 x = ,
and
= 2
Var(
) = Var(0 ) + Var(x)
8.3
x2
1
+P
(xi x)2
n
Estimation of 2
Theorem 8.3
Assuming that E(Y ) = + X and Var(Y ) = 2 , then
2 =
X
1
(yi y b(xi x))2
(n 2)
i
is an unbiased estimate of 2 .
(8.16)
160
Proof We will need the following:

(i)

Y1 Y2
1
Yi+1
Yn
Var(Yi Y ) = Var
. . . + (1 )Yi
...
n
n
n
n
n
1
1
= (n 1) 2 2 + (1 )2 2
n
n
1
= (1 ) 2
n
(8.17)
(ii) Yi = + xi + i , and
P
P
j j
i Yi
Y =
= + x +
so that
n
n
P
j j
Yi Y = (xi x) + i
. Then, E(Yi Y ) = (xi x) + 0.
To prove the theorem, write

X
X
X
X
i x))2 =
(yi y (x
(yi y)2 2
(xi x)(yi y) + 2
(xi x)2
i
(yi y) 22
2
(xi x) + 2
2
(yi y)
(xi x)2
i
2
(xi x) .
(8.18)
Consider (8.18) in terms of random variables. For the RHS,

X
X
X
E(Yi Y )2 =
[Var(Yi Y ) + (E(Yi Y ))2 ]
E
(Yi Y )2 =
i
1 2
) + 2 (xi x)2 ]
n
i
X
= (n 1) 2 + 2
(xi x)2 .
=
[(1
Also
E(2
X
i
(xi x)2 ) =
+ (E())
2]
(xi x)2 [Var()
= 2 +
(xi x)2 2 .
So, from (8.18),

E
i x))2 = (n 1) 2 2 = (n 2) 2 .
(Yi Y (x
Thus
given by (8.16) is an unbiased estimate of 2 .
8.4
161
and Y
Inference about ,
So far we have not assumed any particular probability distribution of the i or equivalently, the
Yi . To find confidence intervals for 0 , and Y let us now assume that the i are normally
and independently distributed. (with means 0 and variances 2 .) Since
Yi = 0 + (xi x) + i = + xi + i = Yi + ,
it follows that the Yi are independently distributed N( + xi , 2 ). Since
0 and are linear
0
then each of
combinations of the Yi , and both
and
Yi are linear combinations of
and ,
0
0
,
, and
Yi are normally distributed. The means and variances of
and are given in
Theorem 8.2. Means and variances for
are given in Corollary 8.2.1.
Now it can be shown that
0 and are independent of
2 given in (8.16), so hypotheses about
these parameters may be tested and confidence intervals for them found in the usual way, using
the tdistribution. Thus to test the hypothesis, H: = 0 , we use the fact that:
0
q
tn2 .
Var()
A 100(1 )% CI for is given by:
tn2,/2
Var()
Similarly, to test H: = 0 we use
0
p
tn2 .
Var(
)
A 100(1 )% CI for can be found using:
tn2,/2
For
Yi
Var(
)
E (
Yi ) = 0 + (xi x)
(8.19)
and,
since cov(
=0
0 , )
Var(
Yi ) = Var(
0 ) + (xi x)2 Var()
2
(xi x)2 2
=
+P
2
n
i (xi x)

1
(xi x)2
2
(8.20)
=
+ P
2
n
i (xi x)
p
so that a 100(1 %) confidence interval for it is given by
Yi tn2,/2 Var(
Yi ). That is
s
1
(xi x)2
x) tn2,/2
0 + (x
+ P
.
(8.21)
2
n
i (xi x)
162
Comment
i x) = ei is an estimate of the (true) error, so the firtst term is
Notice that in (8.18), yi y (x
P
called the Error Sum of Squares (Error SS).
The first term on the right hand side, (yi y)2 , is
P
the total variation in y (Total SS), and 2 (xi x)2 is the Sum of Squares due to deviations from
the regression line, which we will call the Regression SS. Thus in words, (8.18) can be expressed
in the form
Error SS = Total SS Regression SS.
This information can be summarised in an Analysis of Variance Table, similar in form to that
used in the single factor analysis of variance.
Source
Regression
df
1
Error SS
n2
Total
n1
SS
P
2 (xi x)2
P
P
(yi y)2 2 (xi x)2
P
(yi y)2
MS
P
2 (xi x)2
i
X
1 hX
(yi y)2 2
(xi x)2
n2
It can be shown that the ratio of the Regression MS to the Error MS has an F distribution on
1 and n 2 df and provides a test of the hypothesis, H: = 0 which is equivalent to the t test
above.
Question: Why should you expect these two tests to be equivalent?
Example 8.1
The following data refer to age (x) and bloodpressure (y) of 12 women.
x
y
56
147
42
125
72
160
36
118
63
149
47
128
55
150
49
145
38
115
42
140
68
152
60
155
Assuming that Y has a normal distribution with mean + x and variance 2 ,

(i) find the least squares estimates in the regression equation;
(ii) test the hypothesis that the slope of the regression line is zero;
(iii) find 95% confidence limits for the mean blood pressure of women aged 45.
Solution: We have
n = 12.
x = 628,
y = 1684,
xy = 89894,
x2 = 34416,
x) are given by
(i) Regression coefficients in the equation
Y =
0 + (x
0 = y = 1684/12 = 140.33
1057552
89894
1764.667
12
= 1.138.
=
=
394384
1550.667
34416
12
The regression equation is
Y = 140.33 + 1.138(x 52.333) = 80.778 + 1.138x.
y 2 = 238822, and
163
(ii) To test the hypothesis = 0 we need to calculate

P
X
X
( y)2
= 238822 236321.33 = 2500.67
(yi y)2 =
y2
n
X
2
(xi x)2 = 1.295 1550.667 = 2008.18
i
2500.67 2008.19
= 49.26.
10
2 =
Hence
0
1.138
=
= 6.4.
estimated sd of b
49.25/ 1550.667
Comparing this with the critical value from the tdistribution on 10 degrees of freedom,
we see that our result is significant at the .1% level.
(iii) For x = 45,
Y = 80.778 + 1.138 45 = 132.00. Now the 95% confidence limits for the
mean blood pressure of women aged 45 years is
s

1
(45 52.33)2
132.00 2.228 49.25
+
= 132.00 5.37.
12
1550.67
Computer Solution: Assuming the data for age (x) and blood pressure (y) is in the text file,
bp.txt,
age
56
42
72
36
63
47
55
49
38
42
68
60
bloodpressure
147
125
160
118
149
128
150
145
115
140
152
155
#__________ bp.R __________

options(digits=2)
bp <- read.table("bp.txt",header=T)
bp.lm <- lm(bloodpressure ~ age,data=bp)
print(anova(bp.lm))
bp.summ <- summary(bp.lm)
print(bp.summ)
print(confint(bp.lm))
VB <- bp.summ$sigma^2 * bp.summ$cov.unscaled
print(VB)
print(sqrt(diag(VB)) )
newdata <- data.frame(age=c(45,60) )
preds <- predict(bp.lm,new=newdata,interval="confidence")
newdata <- cbind(newdata,preds)
print(newdata)
In order the outputs to be interpreted are

1. AOV
print(anova(bp.lm))
Analysis of Variance Table
164
Response: y
Df Sum Sq Mean Sq F value
Pr(>F)
1 2008.20 2008.20 40.778 7.976e-05 ***
10 492.47
49.25
age
Residuals
--Signif. codes:
***
0.001
**
0.01
0.05
0.1
2. We extract the coefficients for the regression line by using the summary() command.
> summary(bp.lm)
Call:
lm(formula = bloodpressure ~ age, data = bp)
Residuals:
Min
1Q Median
-9.02 -4.35 -3.09
3Q
6.11
Max
11.43
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
80.778
9.544
8.46 7.2e-06 ***
age
1.138
0.178
6.39 8.0e-05 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 7 on 10 degrees of freedom
Multiple R-Squared: 0.803,
Adjusted R-squared: 0.783
F-statistic: 40.8 on 1 and 10 DF, p-value: 7.98e-05
You will notice that the summary command provides us with t-tests of the hypotheses
= 0 and = 0 as well as the residual standard error (s) and R2 . The F -test of the
hypothesis = 0 is also reported and of course is identical to the F -test in the AOV table.
3. Confidence intervals for the regression coefficients
print(confint(bp.lm))
2.5 % 97.5 %
(Intercept) 59.51 102.0
age
0.74
1.5
4. the variance-covariance of the regression coefficients may be needed for further work.
VB <- bp.summ$sigma^2 * bp.summ$cov.unscaled
print(VB)
165
print(sqrt(diag(VB)) )
(Intercept)
age
(Intercept)
91.1 -1.662
age
-1.7 0.032
A quick check shows the connection between the variance-covariance matrix of the regression coefficients and the standard errors of the regression coefficients.
The diagonal of the matrix is (91.1, 0.032) and the square root of these numbers gives
(9.54, 0.18) which are the s.e.s of the regression coefficients.
5. We can now use our model to predict the blood pressure of 45 and 60 year old subjects.
When the model is ftted, there are estimates of the regression coefficients, i.e.
= 80.8 and
= 1.14. Given a new value of x (say x = 45), the predicted value is y = 80.8 + 1.14 45.
The standard error and CI for this predicted value is also able to be calculated.
This is achieved by supplying a new data frame of explanatory variables and calculating
predictions with predict() and appropriate arguments.
newdata <- data.frame(age=c(45,60) )
preds <- predict(bp.lm,new=newdata,interval="confidence")
newdata <- cbind(newdata,preds)
print(newdata)
age fit lwr upr
1 45 132 127 137
2 60 149 144 155
(You should make sure you can match this output up with the calculations made in example
8.1. For example
2 = 49.2 = Error MS. Also, from the AOV table, the F value is 40.78 = 6.392 ,
where 6.39 is the value of the t statistic for testing the hypothesis = 0).)
The fitted model and the code that does it looks like this:160
plot(bloodpressure ~ age,data=bp,las=1)
abline(lsfit(bp$age,bp$bloodpressure))
bloodpressure
150
140
130
120
35 40 45 50 55 60 65 70
age
8.5
166
Correlation
Recall that in a bivariate normal distribution the correlation coefficient, is defined by

=
cov(X, Y )
and 1 1.
X Y
In practice, is an unknown parameter and has to be estimated from a sample. Consider

(x1 , y1 ), . . . (xn , yn ) as a random sample of n pairs of observations and define
P
(xi x)(yi y)
.
(8.22)
r = pP i
P
2
2
i (yi y)
i (xi x)
This is called Pearsonss correlation coefficient, and it is the maximum likelihood estimate of
in the bivariate normal distribution.
We will consider testing the hypothesis H : = 0 against 1sided or 2sided alternatives,
using r. It can be shown that, if H is true, and a sample of n pairs is taken from a bivariate
normal distribution, then
r n2
tn2 .
(8.23)
1 r2
Alternatively, a table of percentage points of the distribution of Pearsons correlation coefficient
r when = 0, may be used.
Example 8.2
Suppose a sample of 18 pairs from a bivariate normal distribution yields r = .32, test H0 : = 0
against H1 : > 0.
Solution: Now
.32 4
r n2
=
= 1.35.
2
1 .1024
1r
The probability of getting a value at least as large as 1.35 is determined from the t-distribution
on 18 degrees of freedom,
> pt(df=16,q=1.35,lower.tail=F)
[1] 0.098
so there is no reason to reject H.
167
Example 8.3
Suppose a sample of 10 pairs from a bivariate normal distribution yields r = .51. Test H : = 0
against A : > 0.
Solution: The critical value of t is

> qt(df=8,p=0.05,lower.tail=F)
[1] 1.9
A value of r which leads to t < 1.9 will be interpreted as insufficient evidence that > 0.
t
. The 5% (for a 1tailed
The critical value of r is found by inverting equation (8.23), r = d+t
2
test) critical value of r is .55. Our observed value is not as large as this, so the hypothesis of zero
correlation cannot be rejected at the 5% significance level.
Use R to find the correlation between the age and bloodpressure of the 12 women in Example
8.1 and test the hypothesis, H0 : = 0 against the alternative, H1 : =
6 0.
Solution: To find the correlation:
> cor(age,bp)
[1] 0.8961394
Notice that in the regression output, R2 = .803 which is just 0.8962 . Now calculate
r n2
.896 10
= 6.39
=
1 0.8962
1 r2
Notice this is exactly the same tvalue as obtained in the R output (and in Example 8.1) for
testing the hypothesis, H: = 0. These tests are equivalent. That is if the test of = 0 is
significant (non-significant), the test of = 0 will also be significant (non-significant) with exactly
the same P value.

Notes

Uploaded by

Copyright:

Available Formats

Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes

Uploaded by

Copyright:

Available Formats

School of Mathematics, Statistics and Computer Science

Connection between Hypothesis testing

6 Analysis of Count Data

8 Simple Linear Regression

There are 3 broad categories of statistical inference

is a parametric model where the parameters are

Strictly speaking, H(X1 , X2 , . . . , Xn ) is a statistic and H(x1 , x2 , . . . , xn ) is the observed

Suppose that we have a random sample X1 , X2 , . . . , Xn from a distribution with mean

3. S = S 2 is called the sample standard deviation.

Computer Exercise 1.1

The summary statistics are output in the Output window.

Estimation by the Method of Moments

and its observed value is denoted by

moments, M1 , M2 , . . ., are random variables whose means are 01 , 02 , . . .. Since the

#_________ UniformMoment.R _____

(You should do the exercise and obtain a plot for yourself).

Estimation by the Method of Maximum Likelihood

Differentiating with respect to p, we have

Solution: Write the likelihood as:

These equations become, respectively,

Thus, if X is distributed N (, 2 ), the MLEs of and 2 derived from a sample of size n

(Xi X)2 /n.

Figure 1.1: L() = 1/n

If an estimator is biased, the bias is given by

Computer Exercise 1.4

moment.estimates[i] <- 2*mean(ru)

Definition 1.6 Consistency n is a consistent estimator of if

We then say that n converges in probability to as n . Equivalently,

Computer Exercise 1.5

to a single random variable by use of a statistic = H(X1 , X2 , . . . , Xn ). The question of

Examples of Estimators and their Properties

(ii) p when the Xi are distributed bin(1, p);

(Xi X)2 /(n 1) .

We make the following comments.

where the RHS can readily be seen to be

(iv) For any distribution that has a fourth moment,

Clearly lim Var(S 2 ) = 0, so from Theorem 1.2, S 2 is a consistent estimator of 2 .

Properties of Maximum Likelihood Estimators

(iv) The MLE is invariant under functional transformations. That is, if

which is equivalent to the event

X 1.96 < < X + 1.96 .

(X 1.96 , X + 1.96 ) contains . A confidence interval (CI) has to be interpreted

x 1.96 < < 1.96

Commonly used values of are 0.1, 0.05, 0.01.

which, on rearrangement, gives the appropriate CI for 1 2 . That is,

is distributed approximately N (0, 1) .

for large n, the distribution of X is approximately normal, so

Rearrangement of the inequality in (1.24) to give an inequality for , is similar to that

Fundamental results from probability

P (Hn |E)P (E) = P (EHn ) = P (Hn )P (E|Hn )

Bayes theorem for random variables

The constant of proportionality is

Post is prior likelihood

Suppose we are interested in the values of k unknown quantities,

The following example illustrates the principles.

Denote the joint density of the observations by p(x|). For X U (0, ),

The likelihood of the parameter is

When X U (0, ), a convenient prior distribution for is

is the Greek letter pronounced xi.

#_____ UniformMoment.R _