MLE Dan Bayesian Estimation From Walpole Book

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

9.

14 Maximum Likelihood Estimation (Optional) 307

σ2
which simplifies to 3.425 < σ12 < 56.991. Taking square roots of the confidence
2
limits, we find that a 98% confidence interval for σ1 /σ2 is
σ1
1.851 < < 7.549.
σ2
Since this interval does not allow for the possibility of σ1 /σ2 being equal to 1, we
were correct in assuming that σ1 = σ2 or σ12 = σ22 in Example 9.12.

Exercises

9.71 A manufacturer of car batteries claims that the 9.76 Construct a 90% confidence interval for σ in Ex-
batteries will last, on average, 3 years with a variance ercise 9.13 on page 283.
of 1 year. If 5 of these batteries have lifetimes of 1.9,
2.4, 3.0, 3.5, and 4.2 years, construct a 95% confidence 9.77 Construct a 98% confidence interval for σ1 /σ2
interval for σ 2 and decide if the manufacturer’s claim in Exercise 9.42 on page 295, where σ1 and σ2 are,
that σ 2 = 1 is valid. Assume the population of battery respectively, the standard deviations for the distances
lives to be approximately normally distributed. traveled per liter of fuel by the Volkswagen and Toyota
mini-trucks.
9.72 A random sample of 20 students yielded a mean
of x̄ = 72 and a variance of s2 = 16 for scores on a 9.78 Construct a 90% confidence interval for σ12 /σ22 in
college placement test in mathematics. Assuming the Exercise 9.43 on page 295. Were we justified in assum-
scores to be normally distributed, construct a 98% con- ing that σ12 = σ22 when we constructed the confidence
fidence interval for σ 2 . interval for μ1 − μ2 ?

9.73 Construct a 95% confidence interval for σ 2 in 9.79 Construct a 90% confidence interval for σ12 /σ22
Exercise 9.9 on page 283. in Exercise 9.46 on page 295. Should we have assumed
σ12 = σ22 in constructing our confidence interval for
9.74 Construct a 99% confidence interval for σ 2 in μI − μII ?
Exercise 9.11 on page 283.
2 2
9.80 Construct a 95% confidence interval for σA /σB
9.75 Construct a 99% confidence interval for σ in Ex- in Exercise 9.49 on page 295. Should the equal-variance
ercise 9.12 on page 283. assumption be used?

9.14 Maximum Likelihood Estimation (Optional)


Often the estimators of parameters have been those that appeal to intuition. The
estimator X̄ certainly seems reasonable as an estimator of a population mean μ.
The virtue of S 2 as an estimator of σ 2 is underscored through the discussion of
unbiasedness in Section 9.3. The estimator for a binomial parameter p is merely a
sample proportion, which, of course, is an average and appeals to common sense.
But there are many situations in which it is not at all obvious what the proper
estimator should be. As a result, there is much to be learned by the student
of statistics concerning different philosophies that produce different methods of
estimation. In this section, we deal with the method of maximum likelihood.
Maximum likelihood estimation is one of the most important approaches to
estimation in all of statistical inference. We will not give a thorough development of
the method. Rather, we will attempt to communicate the philosophy of maximum
likelihood and illustrate with examples that relate to other estimation problems
discussed in this chapter.
308 Chapter 9 One- and Two-Sample Estimation Problems

The Likelihood Function


As the name implies, the method of maximum likelihood is that for which the like-
lihood function is maximized. The likelihood function is best illustrated through
the use of an example with a discrete distribution and a single parameter. Denote
by X1 , X2 , . . . , Xn the independent random variables taken from a discrete prob-
ability distribution represented by f (x, θ), where θ is a single parameter of the
distribution. Now

L(x1 , x2 , . . . , xn ; θ) = f (x1 , x2 , . . . , xn ; θ)
= f (x1 , θ)f (x2 , θ) · · · f (xn , θ)

is the joint distribution of the random variables, often referred to as the likelihood
function. Note that the variable of the likelihood function is θ, not x. Denote by
x1 , x2 , . . . , xn the observed values in a sample. In the case of a discrete random
variable, the interpretation is very clear. The quantity L(x1 , x2 , . . . , xn ; θ), the
likelihood of the sample, is the following joint probability:

P (X1 = x1 , X2 = x2 , . . . , Xn = xn | θ),

which is the probability of obtaining the sample values x1 , x2 , . . . , xn . For the


discrete case, the maximum likelihood estimator is one that results in a maximum
value for this joint probability or maximizes the likelihood of the sample.
Consider a fictitious example where three items from an assembly line are
inspected. The items are ruled either defective or nondefective, and thus the
Bernoulli process applies. Testing the three items results in two nondefective items
followed by a defective item. It is of interest to estimate p, the proportion non-
defective in the process. The likelihood of the sample for this illustration is given
by

p · p · q = p2 q = p2 − p3 ,

where q = 1 − p. Maximum likelihood estimation would give an estimate of p for


which the likelihood is maximized. It is clear that if we differentiate the likelihood
with respect to p, set the derivative to zero, and solve, we obtain the value
2
p̂ = .
3
Now, of course, in this situation p̂ = 2/3 is the sample proportion defective
and is thus a reasonable estimator of the probability of a defective. The reader
should attempt to understand that the philosophy of maximum likelihood estima-
tion evolves from the notion that the reasonable estimator of a parameter based
on sample information is that parameter value that produces the largest probability
of obtaining the sample. This is, indeed, the interpretation for the discrete case,
since the likelihood is the probability of jointly observing the values in the sample.
Now, while the interpretation of the likelihood function as a joint probability
is confined to the discrete case, the notion of maximum likelihood extends to the
estimation of parameters of a continuous distribution. We now present a formal
definition of maximum likelihood estimation.
9.14 Maximum Likelihood Estimation (Optional) 309

Definition 9.3: Given independent observations x1 , x2 , . . . , xn from a probability density func-


tion (continuous case) or probability mass function (discrete case) f (x; θ), the
maximum likelihood estimator θ̂ is that which maximizes the likelihood function

L(x1 , x2 , . . . , xn ; θ) = f (x; θ) = f (x1 , θ)f (x2 , θ) · · · f (xn , θ).

Quite often it is convenient to work with the natural log of the likelihood
function in finding the maximum of that function. Consider the following example
dealing with the parameter μ of a Poisson distribution.

Example 9.20: Consider a Poisson distribution with probability mass function

e−μ μx
f (x|μ) = , x = 0, 1, 2, . . . .
x!
Suppose that a random sample x1 , x2 , . . . , xn is taken from the distribution. What
is the maximum likelihood estimate of μ?
Solution : The likelihood function is

n
xi
,
n
e−nμ μi=1
L(x1 , x2 , . . . , xn ; μ) = f (xi |μ) = -n .
i=1 i=1 xi !

Now consider

n ,
n
ln L(x1 , x2 , . . . , xn ; μ) = −nμ + xi ln μ − ln xi !
i=1 i=1

∂ ln L(x1 , x2 , . . . , xn ; μ) 
n
xi
= −n + .
∂μ i=1
μ

Solving for μ̂, the maximum likelihood estimator, involves setting the derivative to
zero and solving for the parameter. Thus,

n
xi
μ̂ = = x̄.
i=1
n

The second derivative of the log-likelihood function is negative, which implies that
the solution above indeed is a maximum. Since μ is the mean of the Poisson
distribution (Chapter 5), the sample average would certainly seem like a reasonable
estimator.
The following example shows the use of the method of maximum likelihood for
finding estimates of two parameters. We simply find the values of the parameters
that maximize (jointly) the likelihood function.

Example 9.21: Consider a random sample x1 , x2 , . . . , xn from a normal distribution N (μ, σ). Find
the maximum likelihood estimators for μ and σ 2 .
310 Chapter 9 One- and Two-Sample Estimation Problems

Solution : The likelihood function for the normal distribution is


 n  2 
1 1  xi − μ
2
L(x1 , x2 , . . . , xn ; μ, σ ) = exp − .
(2π)n/2 (σ 2 )n/2 2 i=1 σ

Taking logarithms gives us


 2
1
n
n n xi − μ
ln L(x1 , x2 , . . . , xn ; μ, σ ) = − ln(2π) − ln σ 2 −
2
.
2 2 2 i=1 σ

Hence,
 
∂ ln L 
n
xi − μ
=
∂μ i=1
σ2

and
1 
n
∂ ln L n
= − + (xi − μ)2 .
∂σ 2 2σ 2 2(σ 2 )2 i=1

Setting both derivatives equal to 0, we obtain



n 
n
xi − nμ = 0 and nσ 2 = (xi − μ)2 .
i=1 i=1

Thus, the maximum likelihood estimator of μ is given by

1
n
μ̂ = xi = x̄,
n i=1

which is a pleasing result since x̄ has played such an important role in this chapter
as a point estimate of μ. On the other hand, the maximum likelihood estimator of
σ 2 is
1
n
σ̂ 2 = (xi − x̄)2 .
n i=1

Checking the second-order partial derivative matrix confirms that the solution
results in a maximum of the likelihood function.
It is interesting to note the distinction between the maximum likelihood esti-
mator of σ 2 and the unbiased estimator S 2 developed earlier in this chapter. The
numerators are identical, of course, and the denominator is the degrees of freedom
n−1 for the unbiased estimator and n for the maximum likelihood estimator. Max-
imum likelihood estimators do not necessarily enjoy the property of unbiasedness.
However, they do have very important asymptotic properties.

Example 9.22: Suppose 10 rats are used in a biomedical study where they are injected with cancer
cells and then given a cancer drug that is designed to increase their survival rate.
The survival times, in months, are 14, 17, 27, 18, 12, 8, 22, 13, 19, and 12. Assume
9.14 Maximum Likelihood Estimation (Optional) 311

that the exponential distribution applies. Give a maximum likelihood estimate of


the mean survival time.
Solution : From Chapter 6, we know that the probability density function for the exponential
random variable X is

1 −x/β
e , x > 0,
f (x, β) = β
0, elsewhere.

Thus, the log-likelihood function for the data, given n = 10, is

1
10
ln L(x1 , x2 , . . . , x10 ; β) = −10 ln β − xi .
β i=1

Setting

1 
10
∂ ln L 10
=− + 2 xi = 0
∂β β β i=1

implies that

1 
10
β̂ = xi = x̄ = 16.2.
10 i=1

Evaluating the second derivative of the log-likelihood function at the value β̂ above
yields a negative value. As a result, the estimator of the parameter β, the popula-
tion mean, is the sample average x̄.
The following example shows the maximum likelihood estimator for a distribu-
tion that does not appear in previous chapters.

Example 9.23: It is known that a sample consisting of the values 12, 11.2, 13.5, 12.3, 13.8, and
11.9 comes from a population with the density function

θ
θ+1 , x > 1,
f (x; θ) = x
0, elsewhere,

where θ > 0. Find the maximum likelihood estimate of θ.


Solution : The likelihood function of n observations from this population can be written as
,
n
θ θn
L(x1 , x2 , . . . , x10 ; θ) = = -n ,
i=1
xθ+1
i
( i=1 xi )θ+1

which implies that



n
ln L(x1 , x2 , . . . , x10 ; θ) = n ln(θ) − (θ + 1) ln(xi ).
i=1
/ /

312 Chapter 9 One- and Two-Sample Estimation Problems


n
Setting 0 = ∂ ln L
∂θ = n
θ − ln(xi ) results in
i=1

n
θ̂ = 
n
ln(xi )
i=1
6
= = 0.3970.
ln(12) + ln(11.2) + ln(13.5) + ln(12.3) + ln(13.8) + ln(11.9)

Since the second derivative of L is −n/θ2 , which is always negative, the likelihood
function does achieve its maximum value at θ̂.

Additional Comments Concerning Maximum Likelihood Estimation


A thorough discussion of the properties of maximum likelihood estimation is be-
yond the scope of this book and is usually a major topic of a course in the theory
of statistical inference. The method of maximum likelihood allows the analyst to
make use of knowledge of the distribution in determining an appropriate estima-
tor. The method of maximum likelihood cannot be applied without knowledge of the
underlying distribution. We learned in Example 9.21 that the maximum likelihood
estimator is not necessarily unbiased. The maximum likelihood estimator is unbi-
ased asymptotically or in the limit; that is, the amount of bias approaches zero as
the sample size becomes large. Earlier in this chapter the notion of efficiency was
discussed, efficiency being linked to the variance property of an estimator. Maxi-
mum likelihood estimators possess desirable variance properties in the limit. The
reader should consult Lehmann and D’Abrera (1998) for details.

Exercises

9.81 Suppose that there are n trials x1 , x2 , . . . , xn servations from a Weibull distribution with parameters
from a Bernoulli process with parameter p, the prob- α and β and density function
ability of a success.
 That is, the probability of r suc-
cesses is given by nr pr (1 − p)n−r . Work out the max-  β

imum likelihood estimator for the parameter p. αβxβ−1 e−αx , x > 0,


f (x) =
0, elsewhere,
9.82 Consider the lognormal distribution with the
density function given in Section 6.9. Suppose we have for α, β > 0.
a random sample x1 , x2 , . . . , xn from a lognormal dis-
(a) Write out the likelihood function.
tribution.
(b) Write out the equations that, when solved, give the
(a) Write out the likelihood function.
maximum likelihood estimators of α and β.
(b) Develop the maximum likelihood estimators of μ
and σ 2 . 9.85 Consider a random sample of x1 , . . . , xn from a
uniform distribution U (0, θ) with unknown parameter
9.83 Consider a random sample of x1 , . . . , xn coming θ, where θ > 0. Determine the maximum likelihood
from the gamma distribution discussed in Section 6.6. estimator of θ.
Suppose the parameter α is known, say 5, and deter-
mine the maximum likelihood estimation for parameter 9.86 Consider the independent observations
β. x1 , x2 , . . . , xn from the gamma distribution discussed
in Section 6.6.
9.84 Consider a random sample of x1 , x2 , . . . , xn ob- (a) Write out the likelihood function.
710 Chapter 18 Bayesian Statistics

However, in many situations, the preceding probability interpretations cannot


be applied. For instance, consider the questions “What is the probability that
it will rain tomorrow?” “How likely is it that this stock will go up by the end
of the month?” and “What is the likelihood that two companies will be merged
together?” They can hardly be interpreted by the aforementioned approaches, and
the answers to these questions may be different for different people. Yet these
questions are constantly asked in daily life, and the approach used to explain these
probabilities is called subjective probability, which reflects one’s subjective opinion.

Conditional Perspective
Recall that in Chapters 9 through 17, all statistical inferences were based on the
fact that the parameters are unknown but fixed quantities, apart from those in
Section 9.14, in which the parameters were treated as variables and the maximum
likelihood estimates (MLEs) were calculated conditioning on the observed sample
data. In Bayesian statistics, not only are the parameters treated as variables as in
MLE calculation, but also they are treated as random.
Because the observed data are the only experimental results for the practitioner,
statistical inference is based on the actual observed data from a given experiment.
Such a view is called a conditional perspective. Furthermore, in Bayesian concepts,
since the parameters are treated as random, a probability distribution can be
specified, generally by using the subjective probability for the parameter. Such a
distribution is called a prior distribution and it usually reflects the experimenter’s
prior belief about the parameter. In the Bayesian perspective, once an experiment
is conducted and data are observed, all knowledge about the parameter is contained
in the actual observed data and in the prior information.

Bayesian Applications
Although Bayes’ rule is credited to Thomas Bayes, Bayesian applications were
first introduced by French scientist Pierre Simon Laplace, who published a paper
on using Bayesian inference on the unknown binomial proportions (for binomial
distribution, see Section 5.2).
Since the introduction of the Markov chain Monte Carlo (MCMC) computa-
tional tools for Bayesian analysis in the early 1990s, Bayesian statistics has become
more and more popular in statistical modeling and data analysis. Meanwhile,
methodology developments using Bayesian concepts have progressed dramatically,
and they are applied in fields such as bioinformatics, biology, business, engineer-
ing, environmental and ecology science, life science and health, medicine, and many
others.

18.2 Bayesian Inferences


Consider the problem of finding a point estimate of the parameter θ for the pop-
ulation with distribution f (x| θ), given θ. Denote by π(θ) the prior distribution
of θ. Suppose that a random sample of size n, denoted by x = (x1 , x2 , . . . , xn ), is
observed.
18.2 Bayesian Inferences 711

Definition 18.1: The distribution of θ, given x, which is called the posterior distribution, is given
by

f (x|θ)π(θ)
π(θ|x) = ,
g(x)

where g(x) is the marginal distribution of x.

The marginal distribution of x in the above definition can be calculated using


the following formula:
⎧
⎨ f (x|θ)π(θ), θ is discrete,
g(x) = θ∞
⎩ f (x|θ)π(θ) dθ, θ is continuous.
−∞

Example 18.1: Assume that the prior distribution for the proportion of defectives produced by a
machine is
p 0.1 0.2
π(p) 0.6 0.4
Denote by x the number of defectives among a random sample of size 2. Find the
posterior probability distribution of p, given that x is observed.
Solution : The random variable X follows a binomial distribution
 
2 x 2−x
f (x|p) = b(x; 2, p) = p q , x = 0, 1, 2.
x
The marginal distribution of x can be calculated as
g(x) = f (x|0.1)π(0.1) + f (x|0.2)π(0.2)
 
2
= [(0.1)x (0.9)2−x (0.6) + (0.2)x (0.8)2−x (0.4)].
x
Hence, for x = 0, 1, 2, we obtain the marginal probabilities as
x 0 1 2
g(x) 0.742 0.236 0.022
The posterior probability of p = 0.1, given x, is
f (x|0.1)π(0.1) (0.1)x (0.9)2−x (0.6)
π(0.1|x) = = x 2−x
,
g(x) (0.1) (0.9) (0.6) + (0.2)x (0.8)2−x (0.4)
and π(0.2|x) = 1 − π(0.1|x).
Suppose that x = 0 is observed.
f (0 | 0.1)π(0.1) (0.1)0 (0.9)2−0 (0.6)
π(0.1|0) = = = 0.6550,
g(0) 0.742
and π(0.2|0) = 0.3450. If x = 1 is observed, π(0.1|1) = 0.4576, and π(0.2|1) =
0.5424. Finally, π(0.1|2) = 0.2727, and π(0.2|2) = 0.7273.
The prior distribution for Example 18.1 is discrete, although the natural range
of p is from 0 to 1. Consider the following example, where we have a prior distri-
bution covering the whole space for p.
712 Chapter 18 Bayesian Statistics

Example 18.2: Suppose that the prior distribution of p is uniform (i.e., π(p) = 1, for 0 < p <
1). Use the same random variable X as in Example 18.1 to find the posterior
distribution of p.
Solution : As in Example 18.1, we have
 
2 x 2−x
f (x|p) = b(x; 2, p) = p q , x = 0, 1, 2.
x
The marginal distribution of x can be calculated as
1   1
2
g(x) = f (x|p)π(p) dp = px (1 − p)2−x dp.
0 x 0

The integral above can be evaluated at each x directly as g(0) = 1/3, g(1) = 1/3,
and g(2) = 1/3. Therefore, the posterior distribution of p, given x, is
2 x  
x p (1 − p)
2−x
2 x
π(p|x) = =3 p (1 − p)2−x , 0 < p < 1.
1/3 x

The posterior distribution above is actually a beta distribution (see Section 6.8)
with parameters α = x + 1 and β = 3 − x. So, if x = 0 is observed, the posterior
distribution of p is a beta distribution with parameters (1, 3). The posterior mean
1
is μ = 1+3 = 14 and the posterior variance is σ 2 = (1+3)(1)(3) 3
2 (1+3+1) = 80 .

Using the posterior distribution, we can estimate the parameter(s) in a popu-


lation in a straightforward fashion. In computing posterior distributions, it is very
helpful if one is familiar with the distributions in Chapters 5 and 6. Note that
in Definition 18.1, the variable in the posterior distribution is θ, while x is given.
Thus, we can treat g(x) as a constant as we calculate the posterior distribution of
θ. Then the posterior distribution can be expressed as

π(θ|x) ∝ f (x|θ)π(θ),

where the symbol “∝” stands for is proportional to. In the calculation of the
posterior distribution above, we can leave the factors that do not depend on θ out
of the normalization constant, i.e., the marginal density g(x).

Example 18.3: Suppose that random variables X1 , . . . , Xn are independent and from a Poisson
distribution with mean λ. Assume that the prior distribution of λ is exponential
with mean 1. Find the posterior distribution of λ when x̄ = 3 with n = 10.
Solution : The density function of X = (X1 , . . . , Xn ) is

n
xi
,
n xi
−λ λ λi=1
f (x|λ) = e = e−nλ -
n ,
xi !
i=1 xi !
i=1

and the prior distribution is

π(θ) = e−λ , for λ > 0.


18.2 Bayesian Inferences 713

Hence, using Definition 18.1 we obtain the posterior distribution of λ as



n
xi 
n
λi=1 xi
π(λ|x) ∝ f (x|λ)π(λ) = e−nλ -
n e−λ ∝ e−(n+1)λ λi=1 .
xi !
i=1

Referring to the gamma distribution in Section 6.6, we conclude that the posterior

n
1
distribution of λ follows a gamma distribution with parameters 1 + xi and n+1 .
n i=1 n
x +1
i i=1 ix +1
Hence, we have the posterior mean and variance of λ as i=1 n+1 and (n+1) 2 .
10
So, when x̄ = 3 with n = 10, we have i=1 xi = 30. Hence, the posterior
distribution of λ is a gamma distribution with parameters 31 and 1/11.
From Example 18.3 we observe that sometimes it is quite convenient to use
the “proportional to” technique in calculating the posterior distribution, especially
when the result can be formed to a commonly used distribution as described in
Chapters 5 and 6.

Point Estimation Using the Posterior Distribution


Once the posterior distribution is derived, we can easily use the summary of the
posterior distribution to make inferences on the population parameters. For in-
stance, the posterior mean, median, and mode can all be used to estimate the
parameter.

Example 18.4: Suppose that x = 1 is observed for Example 18.2. Find the posterior mean and
the posterior mode.
Solution : When x = 1, the posterior distribution of p can be expressed as

π(p|1) = 6p(1 − p), for 0 < p < 1.

To calculate the mean of this distribution, we need to find


1  
1 1 1
6p2 (1 − p) dp = 6 − = .
0 3 4 2

To find the posterior mode, we need to obtain the value of p such that the posterior
distribution is maximized. Taking derivative of π(p) with respect to p, we obtain
6 − 12p. Solving for p in 6 − 12p = 0, we obtain p = 1/2. The second derivative is
−12, which implies that the posterior mode is achieved at p = 1/2.
Bayesian methods of estimation concerning the mean μ of a normal population
are based on the following example.

Example 18.5: If x̄ is the mean of a random sample of size n from a normal population with
known variance σ 2 , and the prior distribution of the population mean is a normal
distribution with known mean μ0 and known variance σ02 , then show that the
posterior distribution of the population mean is also a normal distribution with
714 Chapter 18 Bayesian Statistics

mean μ∗ and standard deviation σ ∗ , where


%
∗ σ2 σ 2 /n ∗ σ02 σ 2
μ = 2 0 2 x̄ + 2 μ0 and σ = .
σ0 + σ /n σ0 + σ 2 /n nσ02 + σ 2
Solution : The density function of our sample is
  2 
1
n
1 xi − μ
f (x1 , x2 , . . . , xn | μ) = n/2 n
exp − ,
(2π) σ 2 i=1 σ

for −∞ < xi < ∞ and i = 1, 2, . . . , n, and the prior is


  2 
1 1 μ − μ0
π(μ) = √ exp − , − ∞ < μ < ∞.
2πσ0 2 σ0

Then the posterior distribution of μ is


  n  2  2 
1  xi − μ μ − μ0
π(μ|x) ∝ exp − +
2 i=1 σ σ0
  
1 n(x̄ − μ)2 (μ − μ0 )2
∝ exp − + ,
2 σ2 σ02
due to

n 
n
(xi − μ)2 = (xi − x̄)2 + n(x̄ − μ)2
i=1 i=1

from Section 8.5. Completing the squares for μ yields the posterior distribution
  2 
1 μ − μ∗
π(μ|x) ∝ exp − ,
2 σ∗

where
%
∗ nx̄σ02 + μ0 σ 2 ∗ σ02 σ 2
μ = , σ = .
nσ02 + σ 2 nσ02 + σ 2

This is a normal distribution with mean μ∗ and standard deviation σ ∗ .


The Central Limit Theorem allows us to use Example 18.5 also when we select
sufficiently large random samples (n ≥ 30 for many engineering experimental cases)
from nonnormal populations (the distribution is not very far from symmetric), and
when the prior distribution of the mean is approximately normal.
Several comments need to be made about Example 18.5. The posterior mean
μ∗ can also be written as
σ02 σ 2 /n
μ∗ = x̄ + 2 μ0 ,
σ02 2
+ σ /n σ0 + σ 2 /n
which is a weighted average of the sample mean x̄ and the prior mean μ0 . Since both
coefficients are between 0 and 1 and they sum to 1, the posterior mean μ∗ is always
18.2 Bayesian Inferences 715

between x̄ and μ0 . This means that the posterior estimation of μ is influenced by


both x̄ and μ0 . Furthermore, the weight of x̄ depends on the prior variance as
well as the variance of the sample mean. For a large sample problem (n → ∞),
the posterior mean μ∗ → x̄. This means that the prior mean does not play any
role in estimating the population mean μ using the posterior distribution. This
is very reasonable since it indicates that when the amount of data is substantial,
information from the data will dominate the information on μ provided by the prior.
On the other hand, when the prior variance is large (σ02 → ∞), the posterior mean
μ∗ also goes to x̄. Note that for a normal distribution, the larger the variance,
the flatter the density function. The flatness of the normal distribution in this
case means that there is almost no subjective prior information available on the
parameter μ before the data are collected. Thus, it is reasonable that the posterior
estimation μ∗ only depends on the data value x̄.
Now consider the posterior standard deviation σ ∗ . This value can also be
written as
%
σ02 σ 2 /n
σ∗ = .
σ02 + σ 2 /n

It is obvious that the value σ ∗ is smaller than both σ0 and σ/ n, the prior stan-
dard deviation and the standard deviation of x̄, respectively. This suggests that
the posterior estimation is more accurate than both the prior and the sample data.
Hence, incorporating both the data and prior information results in better pos-
terior information than using any of the data or prior alone. This is a common
phenomenon in Bayesian inference. Furthermore, to compute μ∗ and σ ∗ by the for-
mulas in Example 18.5, we have assumed that σ 2 is known. Since this is generally
not the case, we shall replace σ 2 by the sample variance s2 whenever n ≥ 30.

Bayesian Interval Estimation


Similar to the classical confidence interval, in Bayesian analysis we can calculate a
100(1 − α)% Bayesian interval using the posterior distribution.

Definition 18.2: The interval a < θ < b will be called a 100(1 − α)% Bayesian interval for θ if
a ∞
α
π(θ|x) dθ = π(θ|x) dθ = .
−∞ b 2

Recall that under the frequentist approach, the probability of a confidence


interval, say 95%, is interpreted as a coverage probability, which means that if an
experiment is repeated again and again (with considerable unobserved data), the
probability that the intervals calculated according to the rule will cover the true
parameter is 95%. However, in Bayesian interval interpretation, say for a 95%
interval, we can state that the probability of the unknown parameter falling into
the calculated interval (which only depends on the observed data) is 95%.

Example 18.6: Supposing that X ∼ b(x; n, p), with known n = 2, and the prior distribution of p
is uniform π(p) = 1, for 0 < p < 1, find a 95% Bayesian interval for p.
716 Chapter 18 Bayesian Statistics

Solution : As in Example 18.2, when x = 0, the posterior distribution is a beta distribution


with parameters 1 and 3, i.e., π(p|0) = 3(1 − p)2 , for 0 < p < 1. Thus, we need to
solve for a and b using Definition 18.2, which yields the following:
a
0.025 = 3(1 − p)2 dp = 1 − (1 − a)3
0

and
1
0.025 = 3(1 − p)2 dp = (1 − b)3 .
b

The solutions to the above equations result in a = 0.0084 and b = 0.7076. There-
fore, the probability that p falls into (0.0084, 0.7076) is 95%.
For the normal population and normal prior case described in Example 18.5,
the posterior mean μ∗ is the Bayes estimate of the population mean μ, and a
100(1−α)% Bayesian interval for μ can be constructed by computing the interval
μ∗ − zα/2 σ ∗ < μ < μ∗ + zα/2 σ ∗ ,
which is centered at the posterior mean and contains 100(1 − α)% of the posterior
probability.

Example 18.7: An electrical firm manufactures light bulbs that have a length of life that is ap-
proximately normally distributed with a standard deviation of 100 hours. Prior
experience leads us to believe that μ is a value of a normal random variable with a
mean μ0 = 800 hours and a standard deviation σ0 = 10 hours. If a random sample
of 25 bulbs has an average life of 780 hours, find a 95% Bayesian interval for μ.
Solution : According to Example 18.5, the posterior distribution of the mean is also a normal
distribution with mean
(25)(780)(10)2 + (800)(100)2
μ∗ = = 796
(25)(10)2 + (100)2
and standard deviation
%
(10)2 (100)2 √
σ∗ = = 80.
(25)(10)2 + (100)2
The 95% Bayesian interval for μ is then given by
√ √
796 − 1.96 80 < μ < 796 + 1.96 80,
or
778.5 < μ < 813.5.
Hence, we are 95% sure that μ will be between 778.5 and 813.5.
On the other hand, ignoring the prior information about μ, we could proceed
as in Section 9.4 and construct the classical 95% confidence interval
   
100 100
780 − (1.96) √ < μ < 780 + (1.96) √ ,
25 25
or 740.8 < μ < 819.2, which is seen to be wider than the corresponding Bayesian
interval.

You might also like