MLE Dan Bayesian Estimation From Walpole Book
MLE Dan Bayesian Estimation From Walpole Book
MLE Dan Bayesian Estimation From Walpole Book
σ2
which simplifies to 3.425 < σ12 < 56.991. Taking square roots of the confidence
2
limits, we find that a 98% confidence interval for σ1 /σ2 is
σ1
1.851 < < 7.549.
σ2
Since this interval does not allow for the possibility of σ1 /σ2 being equal to 1, we
were correct in assuming that σ1 = σ2 or σ12 = σ22 in Example 9.12.
Exercises
9.71 A manufacturer of car batteries claims that the 9.76 Construct a 90% confidence interval for σ in Ex-
batteries will last, on average, 3 years with a variance ercise 9.13 on page 283.
of 1 year. If 5 of these batteries have lifetimes of 1.9,
2.4, 3.0, 3.5, and 4.2 years, construct a 95% confidence 9.77 Construct a 98% confidence interval for σ1 /σ2
interval for σ 2 and decide if the manufacturer’s claim in Exercise 9.42 on page 295, where σ1 and σ2 are,
that σ 2 = 1 is valid. Assume the population of battery respectively, the standard deviations for the distances
lives to be approximately normally distributed. traveled per liter of fuel by the Volkswagen and Toyota
mini-trucks.
9.72 A random sample of 20 students yielded a mean
of x̄ = 72 and a variance of s2 = 16 for scores on a 9.78 Construct a 90% confidence interval for σ12 /σ22 in
college placement test in mathematics. Assuming the Exercise 9.43 on page 295. Were we justified in assum-
scores to be normally distributed, construct a 98% con- ing that σ12 = σ22 when we constructed the confidence
fidence interval for σ 2 . interval for μ1 − μ2 ?
9.73 Construct a 95% confidence interval for σ 2 in 9.79 Construct a 90% confidence interval for σ12 /σ22
Exercise 9.9 on page 283. in Exercise 9.46 on page 295. Should we have assumed
σ12 = σ22 in constructing our confidence interval for
9.74 Construct a 99% confidence interval for σ 2 in μI − μII ?
Exercise 9.11 on page 283.
2 2
9.80 Construct a 95% confidence interval for σA /σB
9.75 Construct a 99% confidence interval for σ in Ex- in Exercise 9.49 on page 295. Should the equal-variance
ercise 9.12 on page 283. assumption be used?
L(x1 , x2 , . . . , xn ; θ) = f (x1 , x2 , . . . , xn ; θ)
= f (x1 , θ)f (x2 , θ) · · · f (xn , θ)
is the joint distribution of the random variables, often referred to as the likelihood
function. Note that the variable of the likelihood function is θ, not x. Denote by
x1 , x2 , . . . , xn the observed values in a sample. In the case of a discrete random
variable, the interpretation is very clear. The quantity L(x1 , x2 , . . . , xn ; θ), the
likelihood of the sample, is the following joint probability:
P (X1 = x1 , X2 = x2 , . . . , Xn = xn | θ),
p · p · q = p2 q = p2 − p3 ,
Quite often it is convenient to work with the natural log of the likelihood
function in finding the maximum of that function. Consider the following example
dealing with the parameter μ of a Poisson distribution.
e−μ μx
f (x|μ) = , x = 0, 1, 2, . . . .
x!
Suppose that a random sample x1 , x2 , . . . , xn is taken from the distribution. What
is the maximum likelihood estimate of μ?
Solution : The likelihood function is
n
xi
,
n
e−nμ μi=1
L(x1 , x2 , . . . , xn ; μ) = f (xi |μ) = -n .
i=1 i=1 xi !
Now consider
n ,
n
ln L(x1 , x2 , . . . , xn ; μ) = −nμ + xi ln μ − ln xi !
i=1 i=1
∂ ln L(x1 , x2 , . . . , xn ; μ)
n
xi
= −n + .
∂μ i=1
μ
Solving for μ̂, the maximum likelihood estimator, involves setting the derivative to
zero and solving for the parameter. Thus,
n
xi
μ̂ = = x̄.
i=1
n
The second derivative of the log-likelihood function is negative, which implies that
the solution above indeed is a maximum. Since μ is the mean of the Poisson
distribution (Chapter 5), the sample average would certainly seem like a reasonable
estimator.
The following example shows the use of the method of maximum likelihood for
finding estimates of two parameters. We simply find the values of the parameters
that maximize (jointly) the likelihood function.
Example 9.21: Consider a random sample x1 , x2 , . . . , xn from a normal distribution N (μ, σ). Find
the maximum likelihood estimators for μ and σ 2 .
310 Chapter 9 One- and Two-Sample Estimation Problems
Hence,
∂ ln L
n
xi − μ
=
∂μ i=1
σ2
and
1
n
∂ ln L n
= − + (xi − μ)2 .
∂σ 2 2σ 2 2(σ 2 )2 i=1
1
n
μ̂ = xi = x̄,
n i=1
which is a pleasing result since x̄ has played such an important role in this chapter
as a point estimate of μ. On the other hand, the maximum likelihood estimator of
σ 2 is
1
n
σ̂ 2 = (xi − x̄)2 .
n i=1
Checking the second-order partial derivative matrix confirms that the solution
results in a maximum of the likelihood function.
It is interesting to note the distinction between the maximum likelihood esti-
mator of σ 2 and the unbiased estimator S 2 developed earlier in this chapter. The
numerators are identical, of course, and the denominator is the degrees of freedom
n−1 for the unbiased estimator and n for the maximum likelihood estimator. Max-
imum likelihood estimators do not necessarily enjoy the property of unbiasedness.
However, they do have very important asymptotic properties.
Example 9.22: Suppose 10 rats are used in a biomedical study where they are injected with cancer
cells and then given a cancer drug that is designed to increase their survival rate.
The survival times, in months, are 14, 17, 27, 18, 12, 8, 22, 13, 19, and 12. Assume
9.14 Maximum Likelihood Estimation (Optional) 311
1
10
ln L(x1 , x2 , . . . , x10 ; β) = −10 ln β − xi .
β i=1
Setting
1
10
∂ ln L 10
=− + 2 xi = 0
∂β β β i=1
implies that
1
10
β̂ = xi = x̄ = 16.2.
10 i=1
Evaluating the second derivative of the log-likelihood function at the value β̂ above
yields a negative value. As a result, the estimator of the parameter β, the popula-
tion mean, is the sample average x̄.
The following example shows the maximum likelihood estimator for a distribu-
tion that does not appear in previous chapters.
Example 9.23: It is known that a sample consisting of the values 12, 11.2, 13.5, 12.3, 13.8, and
11.9 comes from a population with the density function
θ
θ+1 , x > 1,
f (x; θ) = x
0, elsewhere,
n
Setting 0 = ∂ ln L
∂θ = n
θ − ln(xi ) results in
i=1
n
θ̂ =
n
ln(xi )
i=1
6
= = 0.3970.
ln(12) + ln(11.2) + ln(13.5) + ln(12.3) + ln(13.8) + ln(11.9)
Since the second derivative of L is −n/θ2 , which is always negative, the likelihood
function does achieve its maximum value at θ̂.
Exercises
9.81 Suppose that there are n trials x1 , x2 , . . . , xn servations from a Weibull distribution with parameters
from a Bernoulli process with parameter p, the prob- α and β and density function
ability of a success.
That is, the probability of r suc-
cesses is given by nr pr (1 − p)n−r . Work out the max- β
Conditional Perspective
Recall that in Chapters 9 through 17, all statistical inferences were based on the
fact that the parameters are unknown but fixed quantities, apart from those in
Section 9.14, in which the parameters were treated as variables and the maximum
likelihood estimates (MLEs) were calculated conditioning on the observed sample
data. In Bayesian statistics, not only are the parameters treated as variables as in
MLE calculation, but also they are treated as random.
Because the observed data are the only experimental results for the practitioner,
statistical inference is based on the actual observed data from a given experiment.
Such a view is called a conditional perspective. Furthermore, in Bayesian concepts,
since the parameters are treated as random, a probability distribution can be
specified, generally by using the subjective probability for the parameter. Such a
distribution is called a prior distribution and it usually reflects the experimenter’s
prior belief about the parameter. In the Bayesian perspective, once an experiment
is conducted and data are observed, all knowledge about the parameter is contained
in the actual observed data and in the prior information.
Bayesian Applications
Although Bayes’ rule is credited to Thomas Bayes, Bayesian applications were
first introduced by French scientist Pierre Simon Laplace, who published a paper
on using Bayesian inference on the unknown binomial proportions (for binomial
distribution, see Section 5.2).
Since the introduction of the Markov chain Monte Carlo (MCMC) computa-
tional tools for Bayesian analysis in the early 1990s, Bayesian statistics has become
more and more popular in statistical modeling and data analysis. Meanwhile,
methodology developments using Bayesian concepts have progressed dramatically,
and they are applied in fields such as bioinformatics, biology, business, engineer-
ing, environmental and ecology science, life science and health, medicine, and many
others.
Definition 18.1: The distribution of θ, given x, which is called the posterior distribution, is given
by
f (x|θ)π(θ)
π(θ|x) = ,
g(x)
Example 18.1: Assume that the prior distribution for the proportion of defectives produced by a
machine is
p 0.1 0.2
π(p) 0.6 0.4
Denote by x the number of defectives among a random sample of size 2. Find the
posterior probability distribution of p, given that x is observed.
Solution : The random variable X follows a binomial distribution
2 x 2−x
f (x|p) = b(x; 2, p) = p q , x = 0, 1, 2.
x
The marginal distribution of x can be calculated as
g(x) = f (x|0.1)π(0.1) + f (x|0.2)π(0.2)
2
= [(0.1)x (0.9)2−x (0.6) + (0.2)x (0.8)2−x (0.4)].
x
Hence, for x = 0, 1, 2, we obtain the marginal probabilities as
x 0 1 2
g(x) 0.742 0.236 0.022
The posterior probability of p = 0.1, given x, is
f (x|0.1)π(0.1) (0.1)x (0.9)2−x (0.6)
π(0.1|x) = = x 2−x
,
g(x) (0.1) (0.9) (0.6) + (0.2)x (0.8)2−x (0.4)
and π(0.2|x) = 1 − π(0.1|x).
Suppose that x = 0 is observed.
f (0 | 0.1)π(0.1) (0.1)0 (0.9)2−0 (0.6)
π(0.1|0) = = = 0.6550,
g(0) 0.742
and π(0.2|0) = 0.3450. If x = 1 is observed, π(0.1|1) = 0.4576, and π(0.2|1) =
0.5424. Finally, π(0.1|2) = 0.2727, and π(0.2|2) = 0.7273.
The prior distribution for Example 18.1 is discrete, although the natural range
of p is from 0 to 1. Consider the following example, where we have a prior distri-
bution covering the whole space for p.
712 Chapter 18 Bayesian Statistics
Example 18.2: Suppose that the prior distribution of p is uniform (i.e., π(p) = 1, for 0 < p <
1). Use the same random variable X as in Example 18.1 to find the posterior
distribution of p.
Solution : As in Example 18.1, we have
2 x 2−x
f (x|p) = b(x; 2, p) = p q , x = 0, 1, 2.
x
The marginal distribution of x can be calculated as
1
1
2
g(x) = f (x|p)π(p) dp = px (1 − p)2−x dp.
0 x 0
The integral above can be evaluated at each x directly as g(0) = 1/3, g(1) = 1/3,
and g(2) = 1/3. Therefore, the posterior distribution of p, given x, is
2 x
x p (1 − p)
2−x
2 x
π(p|x) = =3 p (1 − p)2−x , 0 < p < 1.
1/3 x
The posterior distribution above is actually a beta distribution (see Section 6.8)
with parameters α = x + 1 and β = 3 − x. So, if x = 0 is observed, the posterior
distribution of p is a beta distribution with parameters (1, 3). The posterior mean
1
is μ = 1+3 = 14 and the posterior variance is σ 2 = (1+3)(1)(3) 3
2 (1+3+1) = 80 .
π(θ|x) ∝ f (x|θ)π(θ),
where the symbol “∝” stands for is proportional to. In the calculation of the
posterior distribution above, we can leave the factors that do not depend on θ out
of the normalization constant, i.e., the marginal density g(x).
Example 18.3: Suppose that random variables X1 , . . . , Xn are independent and from a Poisson
distribution with mean λ. Assume that the prior distribution of λ is exponential
with mean 1. Find the posterior distribution of λ when x̄ = 3 with n = 10.
Solution : The density function of X = (X1 , . . . , Xn ) is
n
xi
,
n xi
−λ λ λi=1
f (x|λ) = e = e−nλ -
n ,
xi !
i=1 xi !
i=1
Referring to the gamma distribution in Section 6.6, we conclude that the posterior
n
1
distribution of λ follows a gamma distribution with parameters 1 + xi and n+1 .
n i=1 n
x +1
i i=1 ix +1
Hence, we have the posterior mean and variance of λ as i=1 n+1 and (n+1) 2 .
10
So, when x̄ = 3 with n = 10, we have i=1 xi = 30. Hence, the posterior
distribution of λ is a gamma distribution with parameters 31 and 1/11.
From Example 18.3 we observe that sometimes it is quite convenient to use
the “proportional to” technique in calculating the posterior distribution, especially
when the result can be formed to a commonly used distribution as described in
Chapters 5 and 6.
Example 18.4: Suppose that x = 1 is observed for Example 18.2. Find the posterior mean and
the posterior mode.
Solution : When x = 1, the posterior distribution of p can be expressed as
To find the posterior mode, we need to obtain the value of p such that the posterior
distribution is maximized. Taking derivative of π(p) with respect to p, we obtain
6 − 12p. Solving for p in 6 − 12p = 0, we obtain p = 1/2. The second derivative is
−12, which implies that the posterior mode is achieved at p = 1/2.
Bayesian methods of estimation concerning the mean μ of a normal population
are based on the following example.
Example 18.5: If x̄ is the mean of a random sample of size n from a normal population with
known variance σ 2 , and the prior distribution of the population mean is a normal
distribution with known mean μ0 and known variance σ02 , then show that the
posterior distribution of the population mean is also a normal distribution with
714 Chapter 18 Bayesian Statistics
from Section 8.5. Completing the squares for μ yields the posterior distribution
2
1 μ − μ∗
π(μ|x) ∝ exp − ,
2 σ∗
where
%
∗ nx̄σ02 + μ0 σ 2 ∗ σ02 σ 2
μ = , σ = .
nσ02 + σ 2 nσ02 + σ 2
Definition 18.2: The interval a < θ < b will be called a 100(1 − α)% Bayesian interval for θ if
a
∞
α
π(θ|x) dθ = π(θ|x) dθ = .
−∞ b 2
Example 18.6: Supposing that X ∼ b(x; n, p), with known n = 2, and the prior distribution of p
is uniform π(p) = 1, for 0 < p < 1, find a 95% Bayesian interval for p.
716 Chapter 18 Bayesian Statistics
and
1
0.025 = 3(1 − p)2 dp = (1 − b)3 .
b
The solutions to the above equations result in a = 0.0084 and b = 0.7076. There-
fore, the probability that p falls into (0.0084, 0.7076) is 95%.
For the normal population and normal prior case described in Example 18.5,
the posterior mean μ∗ is the Bayes estimate of the population mean μ, and a
100(1−α)% Bayesian interval for μ can be constructed by computing the interval
μ∗ − zα/2 σ ∗ < μ < μ∗ + zα/2 σ ∗ ,
which is centered at the posterior mean and contains 100(1 − α)% of the posterior
probability.
Example 18.7: An electrical firm manufactures light bulbs that have a length of life that is ap-
proximately normally distributed with a standard deviation of 100 hours. Prior
experience leads us to believe that μ is a value of a normal random variable with a
mean μ0 = 800 hours and a standard deviation σ0 = 10 hours. If a random sample
of 25 bulbs has an average life of 780 hours, find a 95% Bayesian interval for μ.
Solution : According to Example 18.5, the posterior distribution of the mean is also a normal
distribution with mean
(25)(780)(10)2 + (800)(100)2
μ∗ = = 796
(25)(10)2 + (100)2
and standard deviation
%
(10)2 (100)2 √
σ∗ = = 80.
(25)(10)2 + (100)2
The 95% Bayesian interval for μ is then given by
√ √
796 − 1.96 80 < μ < 796 + 1.96 80,
or
778.5 < μ < 813.5.
Hence, we are 95% sure that μ will be between 778.5 and 813.5.
On the other hand, ignoring the prior information about μ, we could proceed
as in Section 9.4 and construct the classical 95% confidence interval
100 100
780 − (1.96) √ < μ < 780 + (1.96) √ ,
25 25
or 740.8 < μ < 819.2, which is seen to be wider than the corresponding Bayesian
interval.