Catagorical Data Analysis

Categorical Data Analysis
CHAPTER ONE
1. INTRODUCTION TO CATEGORICAL DATA
Learning objectives:
After the completion of this chapter, the students will be able to:
 Define categorical variables,
 Distinguish the probability distributions most often used for categorical data,
 Know what is likelihood function and maximum likelihood estimation, in conducting
statistical inference for discrete data.
Introduction
Virtually every research project categorizes some of its observations into neat, little distinct bins:
gender of individual (male, female), marital status (broken, not broken), attitude of an individual
towards something (agree, neutral, disagree), diagnosis regarding breast cancer based on a
mammogram categories (normal, benign, probably benign, suspicious, and malignant), race of
patient (black, white) presence of heart disease (yes, no), and so on.
Statisticians have devised a number of ways to analyze and explain categorical data. This course
presents explanations of each of the different methods.
1.1. Categorical Response Data
Let us first define categorical data. A categorical variable has a measurement scale consisting of a
set of categories. Categorical variables are often referred to as qualitative, in which distinct
categories differ in quality, not in quantity. A qualitative explanatory variable is called a factor and
its categories are called the levels for the factor. A quantitative explanatory variable is sometimes
called a covariate.
 Categorical scales are pervasive in the social sciences for measuring attitudes and opinions.
 Categorical scales also occur frequently in the health sciences, for measuring responses
such as whether a patient survives an operation (yes, no), severity of an injury (none, mild,
moderate, severe), and stage of a disease (initial, advanced).
 They frequently occur in the behavioral sciences (e.g., categories “schizophrenia,”
“depression,” “neurosis” for diagnosis of type of mental illness)
We can classify categorical variables as nominal and ordinal variables.
1
1.1.1. Nominal-Ordinal Scale Distinction

Categorical variables are usually classified as being of two basic types: nominal and ordinal.
Nominal variables involve categories that have no particular order such as hair color, race, or
clinic site, Religious affiliation (categories Orthodox, Catholic, Jewish, Protestant, Muslim,
other), transport to work (car, bus, bicycle, walk, . . . )
While the categories associated with ordinal variables have some inherent ordering (i.e.,
variables do have ordered categories).
Examples of ordinal variables: response to a medical treatment (excellent, good, fair, poor);
temperature of the day rated as very cold, cold, hot and very hot.
As conclusion, categorical data analysis studies the statistical relationship between an
explanatory variable(X) and a categorical response variable(Y), a variable consisting of a set of
categories. One example of such a categorical response variable is the disease status that has
two categories: presence or absence. Explanatory variables may be categorical or continuous or
both.
1.1.2. Response–Explanatory Variable Distinction

 Explanatory/predictor/independent variable/control variable/regressors
- are variables whose values are used to estimate the values of dependent variable.
- are measurements that are free to vary in response to other variables called
explanatory variables or predictor variables or independent variables.
 Response/outcome/dependent variable/experimental variable
- is a variable whose values are estimated by the independent variable.
For example, does life-after death depends on the explanatory variables? Suh as gender,
age, educational status, religion, etc.
1.2. Probability Distributions for Categorical Data

Inferential statistical analyses require assumptions about the probability distribution of the
response variable. For example, regression and Analysis of Variance (ANOVA) models for
continuous data, the normal distribution plays a central role. The binomial distribution (and its
multinomial distribution generalization) plays the role that the normal distribution does for
2
continuous response. In this section we review the three key distributions for categorical responses:
binomial, multinomial, and Poisson.
1.2.1 Binomial Distribution

When a random process or experiment called a trial can result in only one of two mutually
exclusive outcomes, such as success or failure, dead or alive, sick or well, the trial is called a
Bernoulli trial.
Let the random variable Y be the number of ‘successes’ in n independent Bernoulli trials in which
the probability of success, π, is the same in all trials.
Let y1, y2 , . . . , yn denote responses for n independent and identical trials such that P(Yi=1)= π
and P(Yi=0)= 1- π. We use the generic labels ‘‘success’’ and ‘‘failure’’ for outcomes 1 and 0. An
identical trials means that the probability of success π is the same for each trial. Independent trials
mean that the {Yi} are independent random variables. These are often called Bernoulli trials. The
n
total number of successes, Y   Yi , has the binomial distribution with index n and parameter π,
i 1
denoted by bin(n, π)
The probability mass function for the possible outcomes y for Y is
n
P(Y  y )  P( y )     y 1    ,
n y
y  0,1,2,..., n
 y
where the binomial coefficient (appears in the statement of a familiar theorem from
n n!
algebra (x+y)n)    .
 y   n  y !
Since E Yi   E Yi 2   1*   0*(1   )   and var Yi    (1   )

n
Then the binomial distribution for Y   Yi has mean n , and variance n (1   ) .
i 1
Properties of binomial distribution

1. The sum of the probabilities of the binomial distribution is unity.
Proof:
For a binomial distribution the probability function is given by
3
n
c y  y (1   )n y x  0,1,2,........, n
P(Y=y) = ,
n n
 p(Y  y)
y 0
 n
c y y (1   )n  y
Now, = y 0
 n c0 0 (1   )n0  n c1 1 (1   )n1  ...  n cn n (1   )n n

 (  1   )n  1
Mean of the binomial distribution
For a binomial distribution the probability function is given by
P(Y=y) = c x  (1   ) , y  0,1, 2,........, n ,

n y n y
Now, the mean of the Binomial distribution is
n
E (Y )   y P(Y  y )
y 0
y
y 0
n
c y y (1   )n  y
=
n
n!
 y y !(n  y)! y
(1   )n y
= y 0
n
n(n  1)!
 y y( y  1)!(n  y)! y
(1   )n y
= y 0
n
n!
n  y  y (1   )n  y
= y 1 y !(n  y )!
n
= n  n 1
c y 1  y 1 (1   )n  y
y 1
= n
4
Variance of the Binomial distribution:
The variance of the Binomial distribution is
V (Y )  E (Y 2 )  [ E (Y )]2
= E (Y 2 )  (n )2 …………….. (1) [ E (Y )  n ]
n (1   ) since mean = n and variance = n (1   ) , mean > variance
2. Measure of skew ness of Binomial distribution
3 2
1  
23
(1  2 ) 2
=
n (1   )
Note: The Binomial distribution is called Symmetric Binomial Distribution
(1  2 )2
3. The measure of skewness of B.D is 1 
n (1   )
1 1
If   , the distribution is symmetric, if   , the distribution is positively skewed and
2 2
1
if   , the distribution is negatively skewed
2
Under certain conditions the B.D approaches to Poisson and Normal distributions.
Hence, the binomial distribution is always symmetric when π = 0.50. For fixed n, it
becomes more skewed as π moves toward 0 or 1. For fixed π, it becomes more bell-shaped
as n increases
4. When n is large, it can be approximated by a normal distribution with   n and
 2  n (1   ) guideline is that the expected number of outcomes of the two types, n

and n(1   ) , should both be at least about . For  = 0.50 this requires only n ≥ 10, whereas
5
π = 0.10 (or  = 0.90) requires n ≥ 50. When  gets nearer to 0 or 1, larger samples are
needed before a symmetric, bell shape occurs.
1.2.2 The Multinomial distribution

When each trial has more than two possible outcomes, the joint distribution of the counts of
outcomes in the various categories is a multinomial distribution.
The multinomial distribution is a multivariate generalization of the binomial distribution
introduced earlier from only 2 outcomes to c outcomes where c>2.
For example, political party affiliation (Conservative, Democratic, Liberal), cereal shelf-
placement in a grocery store (Bottom, middle, or top). When the trials are independent with the
same category probabilities for each trial, the distribution of counts in the various categories is the
multinomial.
Let c denote the number of outcome categories. We denote their probabilities by
c
{π1, π2, . . . , πc}, where 
i 1
i  1 . For n independent observations, the multinomial
probability that n1 fall in category 1, n2 fall in category 2, . . . , nc fall in category c,

c
c
n!
where n   ni , equals p(Y1  n1,Y2  n2,....,Yc  nc ) 
i 1

n1 ! n2 !....nc ! i1
 i ni
The binomial distribution is the special case with c = 2 categories.

The multinomial is a multivariate distribution. The marginal distribution of the count in any
particular category is binomial. For category j , the count nj has mean
nπj and standard deviation n j (1- ij ) . Most methods for categorical data assume the binomial
distribution for a count in a single category and the multinomial distribution for a set of counts in
several categories.
Example: Suppose that we have a fair die and we roll it twenty times. Let us define Yi = the
number of times the die lands up with the face having i dots on it, i = 1, ..., 6. It is not hard to see
that individually Yi has the Binomial (20, 1/6) distribution, for every fixed i = 1, ..., 6. It is clear,
6
however, that Yi must be exactly twenty in this example and hence we conclude that Y1, ..., Y6
i1
6
are not all free-standing random variables. Suppose that one has observed Y1 = 2, Y2 = 4, Y3 = 4,
Y5 = 2 and Y6 = 3, then Y4 must be 5. But, when we think of Y4 alone, its marginal distribution
is Binomial (20, 1/6) and so by itself it can take any one of the possible values 0, 1,2, ..., 20. On
the other hand, if we assume that Y1 = 2, Y2 = 4, Y3 = 4, Y5 = 2 and Y6 = 3, then Y4 has to be 5.
That is to say that there is certainly some kind of dependence among these random variables Y1,
..., Y6. What is the exact nature of this joint distribution?
Example 2 Roll of a dice Taking the dice example explained at the beginning, we can say that the
number of results described by each face follows the multinomial distribution of parameters ((1/6,
1/6, 1/6, 1/6, 1/6, 1/6), n).
If we say that the face 6 comes out 1 times, 5 comes out 2 times, 4 comes out 4 times, 3 comes out
0 times, 2 comes out 2 times and 1 comes out 3 times, we have n = 12.
Solution:
Suppose you roll a fair die 12 times (12 trials), First, assume (n1, n2, n2, n3, n4, n5&,n6) is a
multinomial random variable with parameters p1= p2 = . . . =p6 ’ = 1/6 and n =12.
3 2 0 4 2 1
 n  6 ni 12!  1   1   1   1   1   1   0.0000764
p(3,2,0,4,2,1)    i              
3!2!0!4!2!1!  6   6   6   6   6   6 
 n1, n2 , n3 , n4 , n5, n6  i1
e.g.2
Suppose you roll a fair die 6 times (6 trials), First, assume (n1, n2, n2, n3, n4, n5&,n6) is a
multinomial random variable with parameters p1= p2 = . . . =p6 ’ = 1/6 and n =6.
What is the probability of that each face is seen exactly once? This is simply

f 1, 1, 1, 1, 1, 1 | 6, 1 6 , 1 6 , 1 6 , 1 6 , 1 6 , 1 6  
6!  1 
6   =
5
1!  6  324
’
What is the probability that exactly four 1's and two 2's occur? Then,
4 2

f 4, 2, 0, 0, 0, 0 | 6, 1 6, 1 6, 1 6, 1 6, 1 6, 1 6 
6!
 1 1
6     =
5
4!2!  0!  6   6  15552
hardly a high probability.
What is the probability of getting exactly two 3's two 4's and two 5's? Try this and get familiar
with the notation and use of the probability function. You can see why such a tool might be useful
7
if you were a gambler and wanted to know something quantitative about “the odds" of various
outcomes. Hopefully, your answer will be about 5/2592.
1.2.3 The Poisson distribution

In binomial distribution the random variable is the number of successes in a set number of trials,
whereas a Poisson random variable is the number of successes in an interval of time or specific
region of space. Therefore, experiments yielding numerical values of a random variable Y, the
number of successes (observations) occurring during a given time interval (or in a specified region)
are often called Poisson experiments. For example, the number of cars arriving at a service station
in 1 hour (the interval of time is 1 hour), the number of typing errors per page, etc.
A Poisson experiment has the following properties:

i. The number of successes in any interval is independent of the number of successes in other
interval.
ii. The probability of a single success occurring during a short interval is proportional to the
length of the time interval and does not depend on the number of successes occurring
outside this time interval.
iii. The probability of more than one success in a very small interval is negligible.
iv. The probability of a success in an interval is the same for all equal-size intervals.
The probability that a Poisson random variable assumes a value of Y in a specific interval is
μy e−μ
P(Y=y) = , y=0,1,2,…
y!
Where: P(Y=y)= the probability of Y occurrences in an interval
μ = expected value or mean number of success (occurrence)
Y= the number of success in the interval
 The variance and mean of a Poisson random variable is equal to μ= expected number of
success (occurrence).
Example: Assume that billing clerks rarely make errors in data entry on the billing statements.
Many statements have no mistakes; some have one, a very few have two mistakes; rarely will a
statement have three mistakes; and so on. A random sample of 1000 statements revealed 300
errors. What is the probability of no mistakes appearing in a statement  = 300/1000=0.3
(0.3)0 e−0.3
Solution: P(Y=0) = = 0.7408
0!
1.3 Statistical inference for a proportion

As quantitative variable were summarized with sums and averages, categorical variables are
summarized with counts and proportions. In practice, the parameter values for the binomial and
8
multinomial distributions, population proportions are unknown. Using sample data, we estimate
the parameters. When we take a sample to estimate the population proportion, you follow the same
process as we do when taking a sample to estimate the population mean. We use the maximum
likelihood method for estimating the binomial parameter.
Consider the distribution of sample proportions presence of heart disease in hospital. Assume that
its population proportion is p = 0.75 and its standard deviation is √32 ∗ 0.75 ∗ 0.25 = √6. Suppose
you randomly select the following sample of 32 responses: Y Y N Y Y Y Y N Y Y Y Y Y Y N Y
YNYYYNYYNYYNYNYY
Compute the sample proportion, p, for the number of Y’s in this sample. How far does it lie from
the population proportion? What is the probability of selecting another sample with a proportion
greater than the one you selected?
The proportion of Y responses in this sample is p=24/32 = 0.75
1.3.1 Likelihood Function and Maximum Likelihood Estimation

The parametric approach to statistical modeling assumes a family of probability distributions, such
as the binomial, for the response variable.
Definition: The likelihood function is the probability of the observed data, expressed as a function
of the parameter value. For a particular family, we can substitute the observed data into the formula
for the probability function and then view how that probability depends on the unknown parameter
value. For example, in n = 10 trials, suppose a binomial count equals y = 0. From the binomial
formula with parameter π, the probability of this outcome equals:
 10 
P(Y  0)  P(0)     0 1    ,
10 0
y0
0
10! 0
 1     1   
10 10
=
0!10!
The probability of the observed data, expressed as a function of the parameter, is called the
likelihood function. With y = 0 successes in n = 10 trials, the binomial likelihood function is
L( )  1    . It is defined for π between 0 and 1. From the likelihood function, when Y=0 we
10
get the following likelihood.
9
π 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.9 1

L(π) 0.348678 0.107374 0.028248 0.006047 0.000977 0.000105 1E-07 1E-10 0
Definition: the maximum likelihood estimate of a parameter is the parameter value for which the
probability of the observed data takes its greatest value. It is the parameter value at which the
likelihood function takes its maximum. Suppose β be a parameter and its ML estimate β̂. The
likelihood function is ℓ(β) and the log-likelihood function is L(β)=log(ℓ(β)). For many models,
L(β) has concave shape and β̂ is the point at which the derivative equals zero. The ML estimate
is then the solution of the likelihood equation, ∂L(β)/∂β=0.
Figure 1.1 shows that the likelihood function p(π) = (1 − π)10 has its maximum at π = 0.0. Thus,
when n = 10 trials have y = 0 successes, the maximum likelihood estimate of π equals 0.0. This
means that the result y = 0 in n = 10 trials is more likely to occur when π = 0.00 than when π equals
any other value. In general, for the binomial outcome of y successes in n trials, the maximum
likelihood estimate of π equals p = y/n. This is the sample proportion of successes for the n trials.
If we observe y = 6 successes in n = 10 trials, then the maximum likelihood estimate of π equals p
= 6/10 = 0.60.
The following are plots of the likelihood function when n = 10 with y=0,y=1,…y=10 number of
successes
10
N.B. In this frequentist parametric inference framework, we call the “best” estimates the maximum
likelihood estimates of the parameters because they are the parameter values that make the
observed data the most likely to have happened.
Likelihood:
For any of the known probability distributions, the probability of observing data
Yi, given a parameter value π, is:P(Yi| π)
The symbol p=ˆ (“pi-hat”) is used to represent the sample proportion. Before we observe the
data, the value of the ML estimate is unknown. The estimate is then a variate having some sampling
distribution. We refer to this variate as an estimator and its value for observed data as an estimate.
Estimators based on the method of maximum likelihood are popular because they have good large-
sample behavior. Most importantly, it is not possible to find good estimators that are more precise,
in terms of having smaller large-sample standard errors. Also, large-sample distributions of ML
estimators are usually approximately normal. The estimators reported in this text use this method.
Normal Approximation to the Binomial Distribution)
Suppose that Y1, ..., Yn are iid Bernoulli(π), 0 < π < 1, n ≥ 1.
11
n
We know that U n  Y1 is distributed as Binomial(n, p), n ≥1. Apply the CLT to show that
i1
U n  n

 N (0,1) as n  
n (1  n )
In other words, for practical problems, the Binomial(n, p) distribution can be approximated by the
N(n π, n π (1- π)) distribution, for large n and fixed 0 < π<1
If both n𝝅 and n(1-𝝅) are greater than 5, the z distribution is appropriate.
1.3.2 Significance Test about a Binomial Proportion

For the binomial distribution, we now use the ML estimator in statistical inference for the
parameter π. The ML estimator is the sample proportion, p=y/n. The sampling distribution of the
sample proportion p has mean and standard error, respectively
 (1   )
E(P )   and se( p)= .
n
As the number of trials n increases, the standard error of p decreases toward zero; that is, the
sample proportion tends to be closer to the parameter value π. The sampling distribution of p is
approximately normal for large n. This suggests large-sample inferential methods for π.
Consider the null hypothesis;
H0: π = π0 that the parameter equals some fixed value, π0. The test statistic
p 0
z
 0 (1   0 )
n
has a large-sample standard normal (N(0, 1)) null distribution (this is the reference distribution for
the test). Note that we used the null standard error for the test.
If H0 is not rejected, we conclude that π0 is a plausible value for the normal population proportion.
For the two-sided alternative hypothesis above, we use the two-tailed probability Pr(|𝑍| > |𝑧|/H0
).
Definition: p-value =probability of results at least as extreme as observed (if null were true).
12
1.3.3 Confidence Intervals for a Binomial Proportion

A significance test merely indicates whether a particular value for a parameter (such as 0.50) is
plausible. We learn more by constructing a confidence interval to determine the range of plausible
values. Let se(p) denote the estimated standard error of p. A large sample 100(1 − α)% confidence
interval for π has the formula;
p(1  p)
P  z .se( p) with se(p) ....................(1.3) where z denotes the standard
2 n 2
normal percentile having right-tail probability equal to z .

2
Example:
We can be 95% confident that the population proportion of Americans in 2002 who favored
legalized abortion for married pregnant women who do not want more children is between 0.415
and 0.481.
Formula (1.3) is simple. Unless π is close to 0.50, however, it does not work well unless n is very
large. Consider its actual coverage probability, that is, the probability that the method produces an
interval that captures the true parameter value. This may be quite a bit less than the nominal value
(such as 95%). It is especially poor when π is near 0 or 1.
We can make use of confidence interval to make two sided confidence interval. If (1-α)100%
confidence interval consists of all values π0 for the null hypothesis parameter that are judged
plausible in a two sided α level of significance test.
1.4 More on Statistical Inference for Discrete Data: Wald, Likelihood-Ratio and Score
Inference
1.4.1 Wald, Likelihood-Ratio and Score Inference

Let β denote an arbitrary parameter. Consider a significance test of H0: β = β0 (such as H0: β = 0,
for which β0 = 0).
Let SE denote the standard error of ˆ , evaluated by substituting the ML estimate for the unknown
parameter in the expression for the true standard error. (For example, for the binomial parameter
p(1  p)
π, se( ˆ )  When H0 is true, the test statistic is
n
13
ˆ   0
z=
se( ˆ )
has approximately a standard normal distribution. Equivalently, z2 has approximately a chi-
squared distribution with df = 1. This type of statistic, which uses the standard error evaluated at
the ML estimate, is called a Wald statistic. The z or chi-squared test using this test statistic is called
a Wald test.
An alternative test uses the likelihood function through the ratio of two maximizations of it: (1)
the maximum over the possible parameter values that assume the null hypothesis, (2) the maximum
over the larger set of possible parameter values, permitting the null or the alternative hypothesis
to be true. Let ℒ0 denote the maximized value of the likelihood function under the null hypothesis,
and let ℒ1 denote the maximized value more generally. For instance, when there is a single
parameter β, ℒ0 is the likelihood function calculated at β0, and ℒ1 is the likelihood function
calculated at the ML estimate ˆ . Then ℒ1 is always at least as large as ℒ0, because ℒ1 refers to
maximizing over a larger set of possible parameter values.
 
The likelihood-ratio test statistic equals: 2log 
0

 1
If the maximized likelihood is much larger when the parameters are not forced to satisfy H0, then
0  
the ratio is far below 1. The test statistic  2log 
0
 must be nonnegative, and relatively
1  1
0  
small values of yield large values of 2log  0  and strong evidence against H0. The reason
1  1 
for taking the log transform and doubling is that it yields an approximate chi-squared sampling
distribution. UnderH0: β = β0, the likelihood ratio test statistic has a large-sample chi-squared
distribution with df = 1.
The third possible test is called the score test. It finds standard errors under the assumption that
the null hypothesis holds.
14
For example, the z test for a binomial parameter that uses the standard error
 0 (1   0 )
se( ˆ )  is a score test.
n
Example: Wald, Score, and Likelihood-Ratio Inference for Binomial Parameter

Based on the following information, illustrate the Wald, likelihood-ratio, and score tests by testing
H0: π = 0.50 against Ha: π≠0.50 for clinical trial that has nine successes in the first 10 trials. The
sample proportion is p=.9 and n=10
Solution
Wald test:
p(1  p) 0.9(1  0.9)

H0: π = 0.50, the estimated standard error is se( p)    0.095.
n 10
0.9  0.5
 z=  4.22
0.095
The corresponding chi-squared statistic is (4.22)2 = 17.8 (df = 1). The P-value <0.001.
The score test of H0: π = 0.50, the null standard error is
 0 (1   0 ) 0.5(1  0.5)
se( P )    0.158
n 10
(0.90 - 0.50)
The Z statistics is z= = 2.53
0.158
The corresponding chi-squared statistic is (2.53)2 = 6.4 (df=1). The P-value=0.011.
The likelihood-ratio test:

When H0: π = 0.50 is true, the binomial probability of the observed result of nine successes
10!
L0  (0.50)9 (0.50)1  0.00977.
9!1!
The likelihood-ratio test compares this to the value of the likelihood function at the ML estimate
10!
of p =0.90, which is L1  (0.90)9 (0.10)1  0.3874.
9!1!
15
   0.00977   7.36
The likelihood-ratio test statistic equals: 2log  0   2log  
 1   0.3874 
From the chi-squared distribution with df=1, this statistic has P-value=0.007.
When the sample size is small to moderate, the Wald test is the least reliable of the three tests. We
should not trust it for such a small n as in this example (n = 10). Likelihood-ratio inference and
score-test based inference are better in terms of actual error probabilities coming close to matching
nominal levels. A marked divergence in the values of the three statistics indicates that the
distribution of the ML estimator may be far from normality. In that case, small-sample methods
are more appropriate than large-sample methods.
1.4.2 Small-sample binomial inference

For inference about a proportion, the large-sample two-sided z score test and the confidence
interval based on that test (using the null hypothesis standard error) perform reasonably well when
nπ ≥ 5 and n(1 − π) ≥ 5.
When π0 is not near 0.50 the normal P-value approximation is better for the test with a two-sided
alternative than for a one-sided alternative; a probability that is “too small” in one tail tends to be
approximately counter-balanced by a probability that is “too large” in the other tail.
For small sample sizes, it is safer to use the binomial distribution directly (rather than a normal
approximation) to calculate P-values.
To illustrate, consider testing H0: π = 0.50 against Ha: π > 0.50 for the example of a clinical trial
to evaluate a new treatment, when the number of successes y = 9 in n = 10 trials. The exact P-
value, based on the right tail of the null binomial distribution with π = 0.50, is
10! 10!
P(Y  9) = p(Y=9)+p(Y=10)= (0.50)9 (0.50)1 + (0.50)10 (0.50)0 = 0.011
9!1! 10!0!
For the two sided alternative Ha: π≠0.50,
the P-value is P(Y ≥ 9 or Y ≤ 1) = 2 × P(Y ≥ 9) = 0.021
Unfortunately, with discrete probability distributions, small-sample inference using the ordinary
P-value is conservative. This means that when H0 is true, the P-value is ≤0.05 (thus leading to
rejection of H0 at the 0.05 significance level) not exactly 5% of the time, but typically less than 5%
16
of the time. Because of the discreteness, it is usually not possible for a P-value to achieve the
desired significance level exactly.
Then, the actual probability of type I error is less than 0.05.
Mid P-value which adds only half the probability of the observed result to the probability of the
more extreme results is on the average, less conservative than tests using the ordinary P-value.
Exercise
1. Let us assume that Y is a student taking a statistics course. Unfortunately, Y is not a good
student. Y does not read the textbook before class, does not do homework, and regularly
misses class. Y intends to rely on luck to pass the next quiz. The quiz consists of 10 multiple
choice questions. Each question has five possible answers, only one of which is correct. Y
plans to guess the answer to each question.
a. What is the probability that Y gets no answers correct?
b. What is the probability that Y gets two answers correct?
CHAPTER TWO
2 CONTINGENCY TABLES
 Construct and interpret probability structure for contingency tables,

 Recognize independence or describe an association between two qualitative variables.
2.1 Introduction
There are many situations in quantitative linguistic analysis where you will be interested in the
possibility of association between two categorical variables. In this case, you will often want to
represent your data as a contingency table. Contingency tables show frequencies produced by
cross-classifying observations.
17
In presenting data using contingency tables, it is usual to put the independent variable as row and
dependent variable as column. Suppose there are two categorical variables, denoted by X and Y.
Let I denote the number of categories of X and J the number of categories of Y. A rectangular table
having I rows for the categories of X and J columns for the categories of Y has cells that display
the IJ possible combinations of outcomes.
A table of this form that displays counts of outcomes in the cells is called a contingency table. A
table that cross classifies two variables is called a two-way contingency table; one that cross
classifies three variables is called a three-way contingency table, and so forth .A two-way table
with I rows and J columns is called an I ×J.
Data layout for IXJ contingency tables the cells are presented by counts or frequencies
Y
X 1 2 . . . J Total
1 n11 n12 n1J n1+
. . . . .
. . . . .
. . . . .
I nI1 nI2 nIJ nI+
Total n+1 n+2 n+J n
2.2 Probability Structure for Contingency Tables: Joint, Marginal, and Conditional
Probabilities
For a single categorical variable, we can summarize the data by counting the number of
observations in each category. The sample proportions in the categories estimate the category
probabilities.
Probabilities for contingency tables can be of three types – joint, marginal, or conditional.
Suppose first that a randomly chosen subject from the population of interest is classified on X and
Y. Let 𝜋𝑖𝑗 = P(X = i, Y = j) denote the probability that (X, Y ) falls in the cell in row i and column
I J
j . The probabilities {𝜋𝑖𝑗 } form the joint distribution of X and Y if that satisfy  
i 1 j 1
ij 1
18
The marginal distributions are the row and column totals of the joint probabilities. We denote these
by {𝜋𝑖+ } for the row variable and {𝜋+𝑗 } for the column variable, where the subscript “+” denotes
the sum over the index it replaces.
Pr(X = i) = 𝜋𝑖+ =∑𝑗 𝜋𝑖𝑗 = 𝜋𝑖1 + 𝜋𝑖2 + …+ 𝜋𝑖𝐽

Pr(Y = j) = 𝜋+𝑗 = ∑𝑖 𝜋𝑖𝑗 = 𝜋1𝑗 + 𝜋2𝑗 + …+ 𝜋𝐼𝑗
{𝜋𝑖+ } form the marginal distribution of X.

{𝜋+𝑗 } form the marginal distribution of Y.
For a 2X2 contingency table 𝜋1+ = 𝜋11 + 𝜋12 and 𝜋+1 = 𝜋11 + 𝜋21
These satisfy ∑𝑖 𝜋𝑖+ =∑𝑗 𝜋+𝑗 =∑𝑗 ∑𝑖 𝜋𝑖+ =1
Joint distribution of X and Y consists of the set of the 𝜋𝑖𝑗 ′𝑠:
Each marginal distribution refers to a single variable.
We use similar notation for samples, withn roman P in place of Greek 𝜋. For example, {𝑃𝑖𝑗 } are
cell proportions in a sample joint distribution. We denote the cell counts by {𝑛𝑖𝑗 }. The marginal
frequencies are the row totals {𝑛𝑖+ } and the column totals {𝑛+𝑗 }, and n=∑𝑖,𝑗 𝑛𝑖𝑗 denotes the total
sample size. The sample cell proportions relate to the cell counts by = 𝑃𝑖𝑗 =𝑛𝑖𝑗 /𝑛.
Conditional Distribution is the distribution of one variable at given levels of the other.
When one variable is a response and the other is an explanatory variable, we focus on the
distribution of the response variable conditional on the explanatory variable.
A conditional distribution of Y given X is denoted by P(Y/X) and refers to the probability
distribution of Y when we restrict attention to a fixed level of X.
19
P(Y/X)= 𝜋𝑗/𝑖 =𝜋𝑖𝑗 /𝜋𝑖+
2.2.1 Independence
Situation: One response variable and the other is an explanatory variable.
Two variables are said to be statistically independent if the population conditional distributions of
Y are identical at each level of X. When two variables are independent, the probability of any
particular column outcome j is the same in each row. I.e. the conditional probabilities of responses
given levels of the explanatory variable should be equal, and they should equal the marginal
probabilities over levels of the explanatory variable
When both variables are response variables, we can describe their relationship using their joint
distribution, or the conditional distribution of Y given X, or the conditional distribution of X given
Y. Statistical independence is, equivalently, the property that all joint probabilities equal the
product of their marginal probabilities,
𝜋𝑖𝑗 = 𝜋𝑖+ 𝜋+𝑗 for i = 1, . . . , I and j = 1, . . . , J
The following table refers to a study that investigated the relationship between smoking and lung
cancer.
Cancer cases
Smoking status Yes No
Smoker 180 172
Non-smoker 90 346
In the table above lung cancer is a response variable and smoking status is an explanatory variable.
We therefore study the conditional distributions of cancer cases, given smoking status. For
smokers, the proportion of “yes” responses was 180/352 = 0.714 and the proportion of “no”
responses was 0.286. The proportions (0.714, 0.286) form the sample conditional distribution of
cancer cases. For non-smokers, the sample conditional distribution is (0.206, 0.794).
Had response and explanatory variables been independent, then the conditional probabilities of
responses given levels of the explanatory variable should have been equal.
But the conditional probability of the cancer cases is not the same at each level of smoking status
indicating that there is association between smoking status and cancer cases
20
2.2.2 Binomial and Multinomial Sampling

In multinomial sampling, only the total number of observations, n, is fixed by design. The margins
are free to vary.
In independent binomial sampling one margin is fixed by design while the other(s) is free to vary.
Example: The study on the effectiveness of vitamin C preventing colds.
If the response variable has more than two categories, then we have independent multinomial
sampling.
2.3 Comparing Proportions in Two-By-Two Tables

Ways to study and analyze the relationship between two variables includes.
1. Differences of Proportions
2. Relative risk
3. Odds Ratios
2.3.1 Differences of Proportions (Risk Difference)

Consider a 2X2 table;
Y
S F
X 1 𝜋1 1-𝜋1
2 𝜋2 1-𝜋2
In the population,
 1 = probability of “success” given row 1 and 1  1 = probability of “failure” given row 1.
 2 = probability of “success” given row 2 and 1   2 = probability of “failure” given row 2.
These are conditional probabilities.
The difference between the two distribution of successes, 𝜋1 − 𝜋2 , is difference of
proportions. Comparison on failures is equivalent to comparison on successes.
Some properties of difference of proportions
21
1. The difference of proportions falls between -1 and +1.

2. If variables are independent, then 𝜋1 − 𝜋2 =0
The sample difference in proportion, ( p1  p2 ) estimates the population difference in
population proportion, 𝜋1 − 𝜋2 .
2.3.1.1 Confidence Interval for Differences of Proportion
For simplicity, we denote the sample sizes for the two groups ( that is, the row totals ,
𝑛1+ and 𝑛2+ ) by 𝑛1 and 𝑛2 . When the counts in the two rows are independent binomial
samples, the estimated standard error of 𝑃1 − 𝑃2 is
A large-sample (when the conditions n11  5 and n 2 2  5 are satisfied) 100(1 − α)% (Wald)
confidence interval for 1   2 is
For small samples the actual coverage probability is closer to the nominal confidence level if you
add 1.0 to every cell of the 2 × 2 table before applying this formula.
For significance test of H0: π1 = π2, a z test statistic divides (p1 − p2) by a pooled standard error,
SE, that applies under H0.
i.e. p1  p2 p1  p2

SE  p1  p2  p1 1  p1  p2 1  p2 

n1 n2
Example 2.1
Aspirin and Heart Attacks
The following table is a report on the relationship between aspirin use and myocardial infarction
(heart attacks) which was a five-year randomized study testing whether regular intake of aspirin
reduces mortality from cardiovascular disease. Every other day, the male physicians participating
in the study took either one aspirin tablet or a placebo. The study was “blind” – the physicians in
the study did not know which type of pill they were taking.
Table: Cross Classification of Aspirin Use and Myocardial Infarction
22
We treat the two rows in Table 1 as independent binomial samples. Of the 𝑛1 = 11,034 physicians
taking placebo, 189 suffered myocardial infarction (MI) during the study, a proportion of 𝑃1 =
189/11,034 = .0171. Of the 𝑛2 = 11,037 physicians taking aspirin, 104 suffered MI, a proportion
of 𝑃2 = 0.0094. The sample difference of proportions is 0.0171 − 0.0094 = 0.0077.
From equation 2.1, this difference has an estimated standard error of p1-p2 is
A 95% confidence interval for the true difference 𝜋1 -𝜋2 is 0.0077±1.96(0.0015) which is (0.005,
0.011). Since this interval contains only positive values, we conclude that 𝜋1 -𝜋2 >0. That is, 𝜋1 >𝜋2 .
For males, taking Aspirin appears to result in diminished risk of heart attack.
Example 2: The following table cross classify cold status versus treatment type (vitamin and
placebo)
Then a 95% CI for the difference of proportion, 𝜋1 -𝜋2

Point estimate: 𝑃1 -𝑃2 = (.12 − .22) = −0.10
p2 (1-p2 ) p 2 (1  p 2 )
An Estimate of the standard error of (𝑃1 -𝑃2 ) is SE= +
n1 n2
So, a large sample (1-α)100% confidence interval for , 𝜋1 -𝜋2 is
23
 p 1-p2   Z /2 SE  p 1-p2  .
Then a 95% CI for the difference of proportion is -0.10  1.96(.045)  (-0.19,-0.01)
A 95% confidence interval for the true difference interval contains only negative values, we
conclude that , 𝜋1 -𝜋2 < 0, that is, , 𝜋1 -𝜋2 . Then, taking vitamin C appears to result in a diminished
risk of developing cold
2.3.2 Relative Risk

For 2 × 2 tables, the relative risk (also called risk ratio) of a “success” is the ratio of the
probabilities.
1 
i.e RR  where 0  1
2 2
Note: For observed data, estimate using observed proportions, p1 .

p2
The distribution of p1 is highly skewed, unless 𝑛1 and 𝑛1 are large.

p2
p1
The formulas for computing standard errors, etc. of sampling distribution of are complex.
p2
A relative risk of 1.00 occurs when , 𝜋1 =𝜋2 , that is, when the response is independent of the
group. Two groups with sample proportions, 𝑃1 and 𝑃2 have a sample relative risk of , 𝑃1 / 𝑃2 .
For aspirin example above, the sample relative risk is , 𝑃1 / 𝑃2 = 0.0171/0.0094 = 1.82.
The sample proportion of MI cases was 82% higher for the group taking placebo. The sample
difference of proportions of 0.008 makes it seem as if the two groups differ by a trivial amount.
By contrast, the relative risk shows that the difference may have important public health
implications. Using the difference of proportions alone to compare two groups can be misleading
when the proportions are both close to zero.
As the sampling distribution of the sample relative risk is highly skewed unless the sample sizes
ˆ )  log p1
are quite large, let we work on logarithmic scale as log( RR has an improved bell shaped
p2
̂ , as an estimate of log 𝑅𝑅, is SE(log 𝑅𝑅
sampling distribution. The standard error of log 𝑅𝑅 ̂)
1 1 1 1 1  p1 1  p2
     
n1 p1 n1 n2 p2 n2 n1 p1 n2 2
24
𝑃
A large sample confidence interval for the log of the relative risk is log( 1⁄𝑃 ) ± 𝑍𝛼/2 𝑆𝐸
2
 ˆ )  Z SE (log( RR
 log( RR  /2
ˆ )),log( RR ˆ ))   L,U 
ˆ )  Z SE (log( RR
 /2 
Confidence interval for RR can be obtained by exponentiation the end points of the above CI as in
𝐿 𝑈
CI(RR)=((𝑒 , 𝑒 )
Example: the 95% confident interval (Cross Classification of Aspirin Use and Myocardial
Infarction) the true relative risk is (1.43, 2.30). We can be 95% confident that, after 5 years, the
proportion of MI cases for male physicians taking placebo is between 1.43 and 2.30 times the
proportion of MI cases for male physicians taking aspirin. This indicates that the risk of MI is at
least 43% higher for the placebo group.
2.3.3 The odds ratio

The odds ratio is another measure of association for 2 × 2 contingency tables. It occurs as a
parameter in the most important type of model for categorical data.
The odd ratio is the ratio of two odds.
For a probability of success,𝜋, the odds of success in general is defined to be;
Odds= 𝜋/(1 − 𝜋).
Consider a 2X2 table.
The odds of success in the first row: odds1= 𝜋1 /(1 − 𝜋1 ),
The odds of success in the second row: odds2= 𝜋2 /(1 − 𝜋2 ).
For instance, if 𝜋=0.75, then the odds of success equal 0.75/0.25=3.
The values odds are non-negative.
Odds>1.0 if success is more likely than a failure.
Odds<1 if success is less likely than a failure.
Odds=1 if success and failure are equally likely.
For example, when odds=4, a success is four times as likely as failure. For every four
successes there will be one failure. When odds=1/4, a failure is four times as likely as
success. We then expect to observe one success for every four failures.
The probability of success itself is the function of the odds.
i.e., 𝜋=odds/(1+odds)
25
for instance, odds=3, then 𝜋=3/(3+1)=0.75 and 1- 𝜋=0.25.

Definition of Odds Ratio
In 2X2 tables, within row 1 the odds of success are odds1 𝜋1 /(1 − 𝜋1 ), and while within row
2 the odds of success are odds2=𝜋2 /(1 − 𝜋2 ). The ratio of the two odds is the odds ratio.
Let 𝜃 denotes the odds ratio and defined by
𝜋 /(1− 𝜋 )
𝜃=odds1/odds2= 𝜋1 /(1− 𝜋1 ) .
2 2
For the sample data
The odds ratio is also known as the cross-product ratio, because it equals the ratio of the
products 𝜋11 𝜋22 and 𝜋12 𝜋21 of cell probabilities from diagonal opposite cells.
𝑛11 𝑛22
i.e., 𝜃 = 𝑛12 𝑛21
If the conditional distributions of the column variable on the rows are the same, then the two
variables are statistically independent.
i.e., 𝜋1 = 𝜋2
1 − 𝜋1 = 1 − 𝜋2
𝜋1/ (1 − 𝜋1 ) = 𝑜𝑑𝑑𝑠1 = 𝑜𝑑𝑑𝑠2 = 𝜋2 /(1 − 𝜋2 )
The odds ratio (𝜃)=odds1/odds2=1
E.g., suppose 𝜃=3 ⟹ individuals in row 1 are three times more likely to have a success than
those in row 2. Or the odds of success in row 1 is three time greater than in row 2.
When 𝜃=0.3 ⟹ individuals in row 2 are more likely to have a success than those in row 2. Or
the odds of success in row 1 is 0.3 times the odds in row 2. The odds of success in row 2 is
(1/0.3) =3.33 times greater than the odds in row 1.
Example: Revisit a table that shows the cross classification of Aspirin Use and Myocardial once.
26
By considering the two rows as independent binomial sampling, for the physicians taking
n11 189
placebo, the estimated odds of MI equal  = 0.0174.
n12 10,845
Since 0.0174=1.74/100, the value 0.0174 means there were 1.74 “MI cases” for every 100 “Non-
MI cases”. The estimated odds equal 104/10,933 = 0.0095 for those taking aspirin, or 0.95 “MI
cases” for every 100 “Non-MI cases”.
The estimated odds ratio (𝜃̂)=0.017/0.0095=1.83.
Or 𝜃̂=(189*10933)/(104*10845)=1.83
Interpretation:
 The estimated odds of MI for male physicians taking placebo equal 1.83 times the
estimated odds for male physicians taking aspirin.
 Or individuals who use placebo were 1.83 times more likely to develop MI as compared
to those who take Aspirin.
 Or the estimated odds of MI were 83% higher for placebo group.
Properties of odds rato

1. Odds ratios on each side of 1 reflect certain types of associations. The further θ from the
stronger the association. An odds ratio of 4 is farther from independence than an odds
ratio of 2, and an odds ratio of 0.25 is farther from independence than an odds ratio of
0.50.
2. Odds ratios are multiplicative symmetric around 1. An association with θ=4 is of the same
strength as one with odds ratio equal to (1/4)=0.25
3. The odds ratio 𝜃̂=0 or ∞ if any nij=0, relative risk is better measure of association. The
=(n11+0.5)(n22+0.5)
slightly amended estimator 𝜃̂= (n12+0.5)(n21+0.5) correspond to adding ½ to each cell count,
is preferred when any cell counts are very small. The SE formula replaces {nij} by {nij+0.5}
4. When the order of the rows is reversed or the order of the columns is reversed, the new
value of θ is the inverse of the original value. This ordering is usually arbitrary, so whether
27
we get 4.0 or 0.25 for the odds ratio is merely a matter of how we label the rows and
columns.
5. The odds ratio does not change value when the table orientation reverses so that the rows
become the columns and the columns become the rows.
6. The same value occurs when we treat the columns as the response variable and the rows as
the explanatory variable, or the rows as the response variable and the columns as the
explanatory variable. Thus, it is unnecessary to identify one classification as a response
variable in order to estimate θ. By contrast, the relative risk requires this, and its value also
depends on whether it is applied to the first or to the second outcome category.
7. Odds ratios do not depend on the marginal distributions of either variable. Odds ratios only
depend on cell probabilities (proportions or counts) and not on marginal values.
8. Sampling distribution of 𝜃̂ is skewed to the right. Normal approximation is good only if
‘n’ is very large. However, the log odds ratio (log𝜃̂) is symmetric about zero.
2.3.3.1 Inference for Odds Ratios and Log Odds Ratios

Unless the sample size is extremely large, the sampling distribution of the odds ratio is highly
skewed. When θ = 1, for example, 𝜃̂ cannot be much smaller than θ (since 𝜃̂ ≥ 0), but it could be
much larger with non-negligible probability.
Because of this skewness, statistical inference for the odds ratio uses an alternative but equivalent
measure – its natural logarithm, log(θ). Independence corresponds to log(θ) = 0. That is, an odds
ratio of 1.0 is equivalent to a log odds ratio of 0.0. An odds ratio of 2.0 has a logarithm of odds
ratio of 0.7. The log odds ratio is symmetric about zero, in the sense that reversing rows or
reversing columns changes its sign.
Two values for log(θ) that are the same except for sign, such as log(2.0) = 0.7 and log(0.5) = −0.7,
represent the same strength of association. Doubling a log odds ratio corresponds to squaring an
odds ratio. For instance, log odds ratios of 2(0.7) = 1.4 and 2(−0.7) = −1.4 correspond to odds
ratios of 22 = 4 and 0.52 = 0.25.
The sample log odds ratio log ˆ  has a less skewed sampling distribution that is bell-shaped. Its
approximating normal distribution has a mean of log θ and a standard error of
28
1 1 1 1
SE=   
n11 n12 n21 n22
The SE decreases as the cell counts increase.
Because the sampling distribution is closer to normality for log𝜃̂ than 𝜃̂, it is better to construct
confidence intervals for logθ. Transform back (that is, take antilogs, using the exponential
function, discussed below) to form a confidence interval for θ.
A large sample confidence interval for logθ is log𝜃̂ ± 𝑍𝛼⁄2 (SE)
Exponentiating end points of this confidence interval yields one for θ.
Example
Reconsidering a table on aspirin vs MI the natural logarithm of ˆ log (1.832) = 0.605. SE of

log ˆ equals
For the population, a 95% confidence interval for logθ equals 0.605±1.96(0.123)=(0.365, 0.846).
the corresponding confidence interval for θ is exp(0.365, .846)=[exp(0.365), exp(0.846)]= (1.44,
2.33)
2.3.3.2.Relationship between Odds Ratios & Relative Risk

Odds ratios & relative risk can be related as
29
2.4 Chi-square tests of independence

Consider the null hypothesis (H0) that cell probabilities equal certain fixed values {πij}. For a
sample of size n with cell counts {nij}, the values {μij = nπij } are expected frequencies, {E(nij )}
when H0 is true. To test a null hypothesis, we compare the observed frequencies nij and the expected
frequencies μij that is {nij − μij}.
The test statistics are functions of observed and expected frequencies. To judge whether the data
contradict H0, we compare {nij } to {μij }.
If H0 is true, nij should be close to μij in each cell. The larger the differences {nij − μij }, the stronger
the evidence against H0. The test statistics used to make such comparisons have large-sample chi-
squared distributions.
If the null hypothesis is true, then the test statistics used to make such comparisons have large-
sample chi-squared distributions. The different chi-squared test statistics are Pearson chi-squared
statistics and Likelihood Ratio statistic.
2.4.1 Pearson statistics and the chi-square distribution

Suppose the null hypothesis, H0: μij = nij.
The Pearson chi-squared statistic for testing H0 is
The Chi–Squared Distribution

The “Degrees of Freedom”, df , completely specifies a chi-squared distribution.
30
2.4.2 Likelihood-Ratio Statistic

This method of testing a hypothesis needs the maximum likelihood estimates of parameters
assuming
c. Null hypothesis is true (simpler, restrictions on parameters).
d. Alternative hypothesis is true (more general, no (or fewer) restrictions on parameters).
The test statistic is based on
31
Independence
Situation: Two response variables (either Poisson sampling or multinomial sampling)
Null Hypothesis: Two variables are statistically independent and alternative hypothesis: Two
variables are dependent.
Definition of statistical independence,
HO :  ij = i+ +j for all i = 1, . . . , I and j = 1, . . . , J.
Statistical dependence is not statistically independent

H1 :  ij   i+ +j for for at least one i = 1, . . . , I and j = 1, . . . , J.
To test this hypothesis, we assume HO is true.

Expected Frequencies Under Independence.
Given data, the observed marginal proportions p1+ and p+1 are the maximum likelihood estimates
of  1+ and  +1 , respectively; that is,
32
2.4.3. Tests of independence
33
General Social Survey that, cross classifies gender and political party identification. Subjects
indicated whether they identified more strongly with the Democratic or Republican party or as
Independents. The table also contains estimated expected frequencies for H0: independence. For
instance, the first cell has
n1+ n +1
For instance, the first cell has 11 = = (1557 * 1246)/2757 = 703.7.
n
would be rather unusual if the variables were truly independent. Both test statistics suggest that
political party identification and gender are associated
2.5. Testing for independence for ordinal data
34
Testing Null Hypothesis of Independence
35
The test is called “Mantel–Haenszel” or “Cochran–Mantel–Haenszel” statistic.
Consider the following example where we had 2 items both with ordinal response options:
Item 1: A working mother can establish just as warm and secure a relationship with her children
as a mother who does not work.
Item 2: Working women should have paid maternity leave.
36
Extra Power with Ordinal Test
Choice of Scores
The choice of scores often does not make much difference with respect to the value of r and thus
test results.
I the above example, an alternative scoring that changed the relative spacing between the scores
leads to an increase of r from .203 (from equal spacing) to .207 (from one possible choice for
unequal spacing).
37
The “best” scores for the above example table that lead to the largest possible correlation, yields r
= .210. (Score from correspondence analysis).
2.6. Exact inference for small samples

The confidence intervals and tests presented so far in this chapter are large-sample methods. As
the sample size n grows, “chi-squared” statistics such as X2, G2, and M2 have distributions that are
more nearly chi-squared. When samples are small, the distributions of X2, G2, and M2 are not well
approximated by the chi–squared distribution (so p–values for hypothesis tests are not good).
Hence when n is small, one can perform inference using exact distributions rather than large-
sample approximations.
2.6.1 Fisher’s Exact Test for 2 × 2 Tables

For 2 × 2 tables, independence corresponds to an odds ratio of θ = 1. Suppose the cell counts {nij
} result from two independent binomial samples or from a single multinomial sample over the four
cells. A small-sample null probability distribution for the cell counts that does not depend on any
unknown parameters results from considering the set of tables having the same row and column
totals as the observed data. Once we condition on this restricted set of tables, the cell counts have
the hyper geometric distribution.
Fisher's Exact Test is useful tool for grouped count data and when the samples are particularly
small. This test it is generally used on 2x2 tables. However, it can be extended to an r x c table. A
lot of times Pearson's chi squared are used for this type of analysis but when the assumptions for
sample size and cell counts are not met then that approach is not acceptable. However, the Fisher's
Exact Test makes the assumption that the margins are fixed values and are not random.
For given row and column marginal totals, n11 determines the other three cell counts. When θ =
1, the probability of a particular value n11 is
To illustrate take for example the classic example where the data was collected by Fisher himself
and is seen in table below. The goal of Fisher's experiment was to determine if someone could
38
determine whether milk or tea was poured first into a cup. To prove this theory Fisher presented
eight randomized cups of tea. Four cups had milk poured first and four cups had tea poured first.
The data is known as the Lady Tasting Tea.
We need the probability density of the only table more extreme than the observed table. The only
table that is more extreme is has n11 = 4.
Poured first Guess poured first Poured first Guess poured first
Milk Tea total Milk Tea total
Milk 3 1 4 Milk 4 0 4
Tea 1 3 4 Tea 0 4 4
Total 4 4 8 Total 4 4 8
 n1   n2   4  4   n1   n2   4  4 
  
  
n11 n1  n11  3 
   n11 n1  n11  4 
 
0
p(n11)   
1
 0.229 p(n11)    0.014
 n  8  n  8
   
     n1   4
 n1   3
=
Therefore, the p-value for this test is 0.229 + 0.014 = 0.243. This result does not establish any
association between the guess on what was poured first and what actually was poured first. It is
difficult to determine an association with a sample this small.
Exercise:
Students were assigned randomly to one of two groups. First group was the control groups where
professors wore ordinary shoes and the second group was in the treatment groups where, professors
wore Nikes to see if students purchased Nikes. The data was summarized as follows.
39
Then compute the exact p value for tests

1. H 0 :  =1 vs H1:   1; "two sided" test
To compute the p-value, we need the alternative 1. 2. H 0 :  =1 vs H1:   1; "Left tail" test
3. H 0 :  =1 vs H1:   1; "right tail" test
n n
(Hint: For left test, find the odds ratio of the observed table, ˆ  11 22 and compute the
n12 n 21
probabilities for the tables where the odds ratios are less than odds ratio from the observed table)
For right tail
For two tailed
X2 criterion for
40
Answer
2.7 Association in Three-Way Tables

An important part of most research studies is the choice of control variables. In studying
the effect of an explanatory variable X on a response variable Y , we should adjust for
confounding variables that can influence that relationship because they are associated both
with X and with Y . A third variable (not the exposure or outcome variable of interest) that
distorts the observed relationship between the independent and dependent variables is
called confounding variables. Confounding is a confusion of effects that is a nuisance and
should be controlled for if possible.
Example
Result of some study: Smoking is protective against cancer
Most of the smokers are the young and non-smokers are the old
Further, most of the persons with disease are the old. Age is a confounder
Reason for controlling confounding:
• To obtain a more precise (accurate) estimate of the true association between the exposure
and disease under study.
41
Methods to control confounding:

i. Design:
1. Randomization
2. Restriction
3. Matching (Analysis also)
ii. Analysis:
Stratification
This is the valuation of the association between independent and response variables within
homogeneous categories or strata of the confounding variable.
 Intuitively appealing, straightforward, and enhances understanding of intricacies of the
data
iii. Multivariate Analysis
 A technique that takes into account a number of variables simultaneously
 Involves construction of a mathematical model that efficiently describes the
association between exposure and disease, as well as other variables that may
confound or modify the effect of exposure
 Can simultaneously control for multiple confounders when stratified analysis is
impractical
Eg. Multiple logistic regression model
Otherwise, an observed XY association may merely reflect effects of those variables on X and Y .
This is especially vital for observational studies, where one cannot remove effects of such variables
by randomly assigning subjects to different treatments.
Consider a study of the effects of smoking on lung cancer, using a cross-sectional study that
compares lung cancer rates between nonsmokers and nonsmokers. In doing so, the study should
attempt to control for age, sex socioeconomic status, or other factors that might relate both smoking
status and to whether one has lung cancer. A statistical control would hold such variables constant
while studying the association. Without such controls, results will have limited usefulness.
Suppose that smokers tend to be younger than smokers and that younger people are less likely to
have lung cancer. Then, a lower proportion of lung cancer cases among smoker may merely reflect
their lower average age and not an effect of smoking.
42
Or a situation of Smoking is protective against disease may be due to the fact that most of the
smokers are male and non-smokers are female. Further, most of the persons with disease are female
Including control variables in an analysis requires a multivariate rather than a bivariate analysis.
We illustrate basic concepts for a single control variable Z, which is categorical. A three-way
contingency table displays counts for the three variables
Examples of 3–Way Tables
 Smoking × cancer × Age.
 Smoking × cancer × gender.
 Group × Response × Z (hypothetical) and soon
+
Slices of this table are “Partial Tables”.
43
There are 3–ways to slice this table up. 1. K Frontal planes or XY for each level of Z, J Vertical
planes or XZ for each level of Y and I Horizontal planes or Y Z for each level of X.
XY tables for each level of Z; The Frontal planes of the box are XY tables for each level of Z are
Partial tables
Conditional or “Partial” Odds Ratios
Conditional Odds Ratios are odds ratios between two variables for fixed levels of the third
variable. For fixed level of Z, the conditional XY association given kth
level of Z is
Conditional odds ratios are computed using the partial tables, and are sometimes referred to as
measures of “partial association”.
Marginal Odds Ratios are the odds ratios between two variables in the marginal table.
For example, for the XY margin is given by
44
Where
Marginal association can be very different from conditional association.

Marginal and Conditional Associations
Independence = “No Association” and dependence =“Association”.
Marginal Independence means that XY = 1 and marginal dependence means that XY  1
Conditional Independence means that XY (k) = 1 for all k = 1, . . . ,K.
Conditional Dependence means that XY (k)  1 1 for at least one k = 1, . . . ,K.
Marginal independence does not imply conditional independence and conditional independence
does not imply marginal independence.
For both the goodness of fit test and the test of independence, the chisquare statistic is the same.
45
Conditional versus Marginal Associations: Death Penalty Example

A 2 × 2 × 2 contingency table – two rows, two columns, and two layers – from an article that
studied effects of racial characteristics on whether subjects convicted of homicide receive the death
penalty. The 674 subjects were the defendants in indictments involving cases with multiple
murders. The variables are Y = death penalty verdict, having categories (yes, no), and X = race of
defendant and Z = race of victims, each having categories (white, black). We study the effect of
defendant’s race on the death penalty verdict, treating victims’ race as a control variable. The table
has a 2 × 2 partial table relating defendant’s race and the death penalty verdict at each level of
victims’ race.
We use these to describe the conditional associations between defendant’s race and the death
penalty verdict, controlling for victims’ race. When the victims were white, the death penalty was
imposed 22.9 − 11.3% = 11.6% more often for black defendants than for white defendants. When
the victim was black, the death penalty was imposed 2.8 − 0.0% = 2.8% more often for black
defendants than for white defendants. Thus, controlling for victims’ race by keeping it fixed, the
percentage of “yes” death penalty verdicts was higher for black defendants than for white
defendants.
The bottom portion of the above table displays the marginal table for defendant’s race and the
death penalty verdict. We obtain it by summing the cell counts in table over the two levels of
victims’ race, thus combining the two partial tables (e.g., 11 + 4 = 15). We see that, overall, 11.0%
of white defendants and 7.9% of black defendants received the death penalty. Ignoring victims’
race, the percentage of “yes” death penalty verdicts was lower for black defendants than for white
defendants. The association reverses direction compared with the partial tables.
46
2.7.2. Simpson’s Paradox

Simpson's paradox refers to the fact that the apparent marginal relationship between two variables
can be different (and even reversed indirection) than the conditional relationship given a third
variable. Were viewed an example where the probability of receiving the death penalty was higher
for white defendants than black defendants marginally. However, when the race of the victim was
considered, the probability of receiving the death penalty was higher for blacks than whites when
the victims were either white or black.
2. Simpson's paradox illustrates that variables that are correlated with both the explanatory and
outcome variable of interest can distort the estimated effect. One
Strategy to address such confounding is to stratify by the confounding variable and
Combine the strata-specific estimates. Stratifying does not come for free, however, as unnecessary
stratification can reduce the precision of estimates.
2.7.3 Conditional Independence versus Marginal Independence

If X and Y are independent in each partial table, then X and Y are said to be conditionally
independent, given Z. All conditional odds ratios between X and Y then equal 1. Conditional
47
independence of X and Y, given Z, does not imply marginal independence of X and Y. That is, when
odds ratios between X and Y equal 1 at each level of Z, the marginal odds ratio may differ from 1.
2.8. Chi-square test of homogeneity

Definition: The association between variables X, Y , and Z is “homogeneous” if the following
three conditions hold:
There is “no interaction between any 2 variables in their effects on the third variable”.
There is “no 3–way interaction” among the variables.
When these three conditions (equations) do not hold, then the conditional odds ratios for any pair
of variables are not equal. Conditional odds ratios differ/depend on the level of the third variable
If one of the above holds, then the other two will also hold.
Conditional independence is a special case of this.
For example,
Testing Homogeneity of Odds Ratios
48
To test for homogeneous association we only need to test one of these, e.g.
Given estimated expected frequencies assuming that HO is true, the test statistic we use is the
Breslow-Day” statistic, which is like Pearson’s X2:
ni  k n jk
where ˆ ijk 
n k
If H0 is true, then the Breslow-Day statistic has an approximate chi-squared distribution with df =
K − 1.
49
If the null hypothesis of homogeneous association is true, then ˆMH is a good estimate of the
common odds ratio. When computing estimated expected frequencies, we want them such that
the odds ratio computed on each of the K partial tables equals the Mantel-Haenszel estimate of
the common odds ratio.
n 11k n22 k
n k ˆ11k ˆ 22 k
ˆMH  k 1

K
ˆ12 k ˆ 21k
n
k 1
12 k n21k
n k
Examples: Testing Homogeneity of Association
Management × Supervisor × Worker
50
2.8. Chi square Goodness of Fit Test

The chi square goodness of fit test begins by hypothesizing that the distribution of a variable
behaves in a particular manner. For example, a student may observe the set of grades for a class,
and suspect that the professor allocated the grades on the basis of a normal distribution. Another
possibility is that a researcher claims that the sample selected has a distribution which is very close
to distribution of the population. While no general statement can be provided to cover all these
possibilities, what is common to all is that a claim has been made concerning the nature of the
whole distribution.
Goodness of Fit to a Theoretical Distribution.

Suppose that a researcher wishes to test the suitability of a model such as the normal distribution
or the binomial probability distribution, as an explanation for an observed phenomenon. For
example, a grade distribution for grades in a class may be observed, and someone claims that the
instructor has allocated the grades on the basis of a normal distribution. The test of the goodness
of fit then looks at the observed number of grades in each category, determines the number that
would be expected if the grades had been exactly normally distributed, and compares these with
the chi square goodness of fit test.
In general this test is the same as the earlier goodness of fit tests except for one alteration. The
degrees of freedom for this test must take into account the number of parameters contained in the
distribution being fitted
Example 10.3.4 Grade Distribution in Social Studies 201
social studies 201
Grade All arts (Per Cent) 990winter (number)
less than 50 8.3 2
51
50s 15.4 7
60s 24.7 10
70s 30.8 15
80s 17.8 8
90s 3 1
total 1 43
Mean 68.2 68.8
Standard
deviation 12.6
Use the data to conduct two chi square goodness of fit tests
1. Test whether the model of a normal distribution of grades adequately explains the grade
distribution of Social Studies 201.
2. Test whether the grade distribution for Social Studies 201 differs from the grade
distribution for the Faculty of Arts as a whole. For each test, use the 0.20 level of
significance.
Solution. For the first test, it is necessary to determine the grade distribution that would exist if
the grades had been distributed exactly as the normal distribution. That is, the normal curve with
mean and standard deviation the same as the actual Social Studies 201 distribution will be used.
This means that the grade distribution for the normal curve with mean 𝜇 = 68.8 and standard
deviation of δ = 12.7 is used to determine the grade distribution if the grades were normally
distributed. The Z values were determined as -1.48, -0.69, 0.09, 0.88 and 1.67 for the X values 50,
60,70, 80 and 90, respectively. By using the normal table in
These are given as the proportions of Table 10.9. These proportions were then multiplied by 43,
the total of the observed values, to obtain the expected number of grades in each of the categories
into which the grades have been grouped. These are given in the last column of Table. In order to
apply the X2 test properly, each of the expected values should exceed 5.
The less than 50 category and the 90s category both have less than 5 expected cases. In this test,
the 90s have only 2 expected cases, so this category is merged with the grades in the 80s. For the
grades less than 50, even though there are only 3 expected cases, these have been left in category
52
of their own. While this may bias the results a little, the effect should not be too great. The
calculation of the X2 statistic is as given in
Category Observed, ni grade Proportion (pi) μi =43*pi A=ni-μi A2/μi
1 2 Less than 50 0.0694 3 -1 0.333
2 7 50s 0.1757 7.6 -1 0.047
3 10 60s 0.2908 12.5 -3 0.5
4 15 70s 0.2747 11.8 3.2 0.868
5 8 80s 0.1419 6.1 0.9 0.1
1 9 90s 0.0475 2 8.1
Total 1 43 1.848
k
(ni  i )2
2   =0.333+…+0.1=1.848
i 1 i
Suppose that a variable has a frequency distribution with k categories into which the data has been
grouped. The frequencies of occurrence of the variable, for each category of the variable, are called
the observed values.
The manner in which the chi square goodness of fit test works is to determine how many cases
there would be in each category if the sample data were distributed exactly according to the claim.
These are termed the expected number of cases for each category. The total of the expected number
of cases is always made equal to the total of the observed number of cases.
The null hypothesis is that the observed number of cases in each category is exactly equal to the
expected number of cases in each category. The alternative hypothesis is that the observed and
expected numbers of cases differ sufficiently to reject the null hypothesis.
Let ni is the observed number of cases in category i and 𝜇i is the expected number of cases in each
category, for each of the k categories
i = 1; 2; 3,…, k, into which the data has been grouped. The hypotheses are
H0 : ni = 𝜇i versus H1 : ni ≠ 𝜇 i
and the test statistic is
k
(ni  i )2
2  
i 1 i
53
There are k – 1 degrees of freedom. Large values of this statistic lead the researcher to reject the
null hypothesis; smaller values mean that the null hypothesis cannot be rejected.
54
55
CHAPTER 3
3 LOGISTIC REGRESSION
Logistic regression is a technique used when the dependent variable is dichotomous or

binary such that Yi = 1 or 0. The distribution of Yi is binomial and we are interested in modeling
the conditional probability probability that Yi = 1 for a given X=x l as a function of the independent
As with the other techniques, independent variables may be
either continuous or categorical
Why not just use linear regression? Because in the case of a binary response variable, the
assumptions of linear regression are not valid:
 The relationship between X and Y is nonlinear

 Error terms are heteroscedastic
 Error terms are not normally distributed
If you proceeded in light of these violations, the result would be:

Predicted values that are not possible (greater than a value of 1, smaller than a value of 0)
Magnitude of the effects of independent variables may be greatly underestimated
Relationships between π(x) and x are usually nonlinear rather than linear. A fixed change in x may
have less impact when π is near 0 or 1 than when π is near the middle of its range.
In practice, π(x) often either increases continuously or decreases continuously as x increases. The
S-shaped curves displayed in Figure 3.1 are often realistic shapes for the relationship. The most
1
important mathematical function with this shape has formula    x 
1 e
2.4. Interpreting the logistic regression model

To begin, suppose there is a single explanatory variable X, which is quantitative. Remember
that a logistic regression model with a single predictor is called simple logistic regression. For
56
a binary response variable Y , recall that π(x) denotes the “success” probability at value x. This
probability is the parameter for the binomial distribution.
1 e   x
 ( x)   , ...., 3.1
1  e   x  1  e   x
This is called the logistic regression function. The corresponding logistic regression model form
is given in equation 3.2. The logistic regression model has linear form for the logit of this
probability
  ( x) 
logitπ(x) = log       x,....,3.2
 1   ( x) 
In logistic regression, a logistic transformation of the odds (referred to as logit) serves as the
dependent variable. One can re arrange equation in 3.2 to have the expression
 ( x)
 e   x ,....,3.3, where e is a constant which equal to 2.718...
1   ( x)
2.4.3. Linear Approximation Interpretations

The logistic regression formula (3.2) indicates that the logit increases by β for every 1 cm increase
in x. Most of us do not think naturally on a logit (logarithm of the odds) scale, so we need to
consider alternative interpretations.
The parameter β in equations (3.1) and (3.2) determines the rate of increase or decrease of the S-
shaped curve for π(x). The sign of β indicates whether the curve ascends (β > 0) or descends (β <
0), and the rate of change increases as |β| increases.
When β = 0, the right-hand side of equation (3.1) simplifies to a constant. Then, π(x) is identical
at all x, so the curve becomes a horizontal straight line. The binary response Y is then independent
of X.
Figure 3.1 shows the S-shaped appearance of the model for π(x), as fitted for the example in the
following subsection. Since it is curved rather than a straight line, the rate of change in π(x) per 1-
unit increase in x depends on the value of x. A straight line drawn tangent to the curve at a particular
x value, such as shown in Figure 3.1 describes the rate of change at that point. For logistic
regression parameter β, that line has slope equal to βπ(x)[1 − π(x)]. For instance, the line tangent
57
to the curve at x for which π(x) = 0.50 has slope β(0.50)(0.50) = 0.25β; by contrast, when π(x) =
0.90 or 0.10, it has slope 0.09β. The slope approaches 0 as the probability approaches 1.0 or 0.
= 0.5. The x value for whi median effective level

and it represent the level at which the outcome has a 50% chance. That x value relates to the logistic
regression parameters by x = −α/β.
Figure 3.1. Linear approximation to logistic curve
 There are many uses of logistic regression such as:
1. To model the probabilities of certain conditions or states as a function of some

explanatory variables such as to identify “risk” factors for certain conditions (i.e. divorce,
disease, adjustment, etc.). For example one might want to model whether or not one has
diabetes as a function of weight, plasma insulin, fasting plasma glucose, and test plasma
glucose intolerance.
2. To describe differences between individuals from separate groups as a function of some

explanatory variables, also known as descriptive discriminate analysis. For example,
one might want to test whether a student attended an academic program in high school is
a function of achievement test scores, desired occupation, and SES.
58
3. To predict the probabilities that individuals fall into one of two categories as a function
of some explanatory variables. For example, one might want to predict whether or not an
examinee correctly answers a test item is a function of race and gender.
4. To classify individuals into one of two categories on the basis of the explanatory
variables.
Example
Dependent or Response Variable Y = 1 if the individual has coronary disease if individual has no
heart disease
Explanatory Variable, x = the individual’s age

   x 
ˆ ˆ = -5.1 + 0.11x
 1    x       x
Model: log 
 
exp{-5.1 + 0.11x}
ˆ  x  
1  exp{-5.1 + 0.11x}
 x) for a unit change in x is not

constant but varies with x.
 At any given value of x, the rate of change corresponds to the slope of the curve at that value
of x x))(1- x)). For example, let x = 300 then
exp 5.3  0.11(20) 0.045

ˆ (20)    0.043
1  e p 5.3  10.11(20) 1+0.045
1  ˆ (20)  1  0.043  0.957
Therefore, the slope of the line at x = 20
))(1- ) = 0.11(0.043)(0.957) =0.0045
 x) = 1 - x) = 0.5. This will occur when x =

  . So for our example the slope is greatest when x = (5.3)  48.18 because
 0.11
ˆ (48.18)  1  ˆ (48.18)  0.5 and the slope when x = 48.18 = (.11(.5)(.5) = 0.0275
59
The median effective level for our example is 48.18 and represents the point at which having
coronary heart disease is equally likely as not having coronary heart disease. We can use our
model to predict other probabilities as well.
2.4.4. Odds Ratio Interpretation

An important interpretation of the logistic regression model uses the odds and the odds ratio.
Another way to interpret our model, which may be a bit easier (at least in terms of computation,
and perhaps also conceptually) is to utilize the relationship between the model and the odds
 x  
ratio. The model log it (( x))  log     x can be written as a multiplicative model
 1  x  
x
e
 This model directly models the odds ratio. Odds increase multiplicatively with x such that a 1
unit increase in x leads to an increase of the odds of e times. e0 = 1, so the odds
do not change.
 The log of the odds changes linearly with x; however the log of the odds is not an intuitively
easy or natural scale to interpret.
 = .111 so the odds ratio for a 1 point change in achievement test score (x)
lead to an odds ratio increase of e the odds of
developing heart disease at a given mean age of x + 1 is 1.17 times the odds given a mean
age of score of x.
 A 10 years change in age leads to an odds ratio increase of e
 x) is not
constant for equal changes in x, the odds ratio interpretation leads to a constant rate of change.
2.5. Inference for Logistic Regression

Any unknown parameters in the logistic model are to be estimated by maximum likelihood
method of estimation
Likelihood Function for Logistic Regression
60
Because logistic regression predicts probabilities, rather than just classes, we can fit it using
likelihood. For each training data-point, we have a vector of features, xi, and an observed class, yi.
The probability of that class was either  , if yi = 1, or 1   , if yi = 0. The likelihood is then
n
L( ,  )    ( xi ) yi (1  ( xi ))1 yi
i 1
We could substitute in the actual equation for  , but things will be clearer in a moment if we
don't.) The log-likelihood turns products into sums:
n
l ( ,  )   yi log  ( xi )  (1  yi )(log(1   ( xi ))
i 1
n n
1   ( xi )
  lo g(1   ( xi ))   yi (log( )
i 1 i 1  ( xi )
n n
  lo g(1   ( xi ))   y (   xi )
i 1 i 1
n n
  lo g(1  e   xi )   yi (   xi )
i 1 i 1
Typically, to find the maximum likelihood estimates we'd differentiate the log likelihood with
respect to the parameters, set the derivatives equal to zero, and solve. To start that, take the
derivative with respect to one component of  , say  j .
We are not going to be able to set this to zero and solve exactly. (That's a transcendental
equation, and there is no closed-form solution.) We can however approximately solve it
numerically.
2.5.3. Confidence Intervals for Effects

A large-sample Wald confidence interval for the parameter β in the logistic regression model,
logit[π(x)] = α + βx, is
A (1 -
  z / 2 (SE )
where SE
61
nd NOT the intercept in the model
 To more readily interpret this confidence interval we can exponentiate the endpoints to
determine the effect on the odds for a 1-unit increase in x.
.111± 1.96(.024) = (0.64, 1.5)
To get a confidence interval for the effect of age on the odds, that is for e , simply take the
(e0.64 1.5
(1.066, 1.17)
 We can also get an interval for the linear approximation to the curve. -
x. We can multiply the
-
suppose we want to determine what the increase in probability of being developing heart
disease x)) = .25 given a 1 unit change in x. Then we can multiply the endpoints of
-
Using our example we would obtain:
(.0211*.1875, .0313*.1875) = (.00396, .00587)
So the rate of increase in the probability of being in an academic program for values of x near
should be noted that due to the large range in achievement test scores a 1-unit increase in score
is not very noticeable, nor is it likely given the scale.
2.5.4. Significance Testing

To test hypothesis one can use one of the following methods
1. Wald test
2. Likelihood ratio test
In the case of a simple logistic regression (i.e., only a single predictor), the tests of overall fit and
the tests of the predictor test the same hypothesis: is the predictor useful in predicting the outcome?
62
  e  1 ). Thus, for simple logistic both the likelihood ratio for the full model and the Wald test
for the significance of the predictor test the same hypothesis.
The Likelihood Ratio and Wald test of the significance of a single predictor are said to be
“asymptotically” equivalent, which means that their significance values will converge with larger
N. With small samples, however, they are not likely to be equal and may sometimes lead to
different statistical conclusions (i.e., significance). The likelihood ratio test for a single predictor
is usually recommended by logistic texts as the most powerful (although some authors have stated
that neither the Wald nor the LR test is superior).
Then to test the hypothesis that the probability of success is independent of

X) the two side alternatives H
ˆ  0
z N (0,1) where
SE
Using our example, we would obtain:
.111  0
z  4.625
.024
We can also use Wald statistics which are simply squared z-statistics with 1 d.f.
2
 ˆ 
X  
2
  (1)
2
 SE 
Likelihood ratio test statistic

Test statistic: LR = −2(L0 − L1) where L0 is the log of the maximum likelihood for the model
logitπ(x)   and L1 is the log of the maximum likelihood for the model
logitπ(x)     x
If the null is true, then the likelihood ratio test statistic is approximately chi-square distributed with
df = 1.
63
Although the Wald test is adequate for large samples, the likelihood-ratio test is a more powerful
alternative to the Wald statistic. The test statistic = -2(L0 - L1), where L0 is the log of the maximum
likelihood for the more parsimonious model (less complex), x 1 is the log
of the maximum likelihood for the more complex model, x x
again, to conduct this test you must fit both models to obtain both likelihoods and calculate the
test statistic by hand. Suppose that we fit both of these models to our data and we obtain:
L1 = -351.9207 and L0 = -415.6749
Therefore our log likelihood ratio test is
-2(L0 - L1) = -2(-415.6749 - (-351.9207) = 127.5084 with 1 df
 We can also look at the confidence intervals for the true probabilities, under the model.
it make use of the covariance matrix of the model parameter estimates.
2.6. Logistic regression with categorical predictors

Logistic regression, like ordinary regression, can have multiple explanatory variables. Some or all
of those predictors can be categorical, rather than quantitative. We look at how to include
categorical predictors, often called factors, into the model.
64
2.6.3. Indicator Variables Represent Categories of Predictors

Suppose a binary response Y has two binary predictors, X and Z. The data are then
displayed in a 2 × 2 × 2 contingency table, such as we’ll see in the example in the
next subsection.
Let x and z each take values 0 and 1 to represent the two categories of each
explanatory variable. The model for P(Y = 1),
logit[P(Y = 1)] = α + β1x + β2z has main effects for x and z. The variables x and z are
called indicator variables.
They indicate categories for the predictors. Indicator variables are also called dummy
variables. For this coding, the following table shows the logit values at the four
combinations of values of the two predictors
This difference between two logits equals the difference of log odds. Equivalently that difference
equals the log of the odds ratio between X and Y , at that category of Z. Thus, exp(β1) equals the
conditional odds ratio between X and Y . Controlling for Z, the odds of “success” at x = 1 equal
exp(β1) times the odds of success at x = 0. This conditional odds ratio is the same at each category
of Z. The lack of an interaction term implies a common value of the odds ratio for the partial tables
at the two categories of Z. The model satisfies homogeneous association
65
Conditional independence exists between X and Y , controlling for Z, if β1 = 0.In that case the
common odds ratio equals 1. The simpler model,
2.7. Multiple Logistic Regression

Next we will consider the general logistic regression model with multiple explanatory
variables. Denote the k predictors for a binary response Y by x1, x2, . . . , xk. The model for the
log odds is
The parameter βi refers to the effect of xi on the log odds that Y = 1, controlling the other xs.
For example, exp(βi ) is the multiplicative effect on the odds of a 1-unit increase in xi , at fixed
levels of the other xs. It measure the association between Y and Xi adjusted for the other
predictors in the model
Summary
66
CHAPTER 4
3. BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Having learned the basics of logistic regression, we now study issues relating to building a model
with multiple predictors and checking its fit. After choosing a preliminary model, model checking
explores possible lack of fit.
3.3. Strategies in model selection

In this section we look at methods for model selection strategies for identifying most relevant
covariates in logistic regression models
Exploratory versus Confirmatory studies

In confirmatory studies an existing or proposed model shall be confirmed. We do this by fitting
the model to a random sample and measuring model fit, checking for interpretation of the
parameter estimates etc..
In exploratory studies the intent is to determine a model that explains the observed data the best.
Once such a model has been chosen a confirmatory study should be conducted to validate the
model.
In the exploratory process many different models are considered and should be compared with
each other. The final model should be most parsimonious model, which describes the data well.
The selection process becomes more challenging as the number of explanatory variables increases,
because of the rapid increase in possible effects and interactions. There are two competing goals:
1. The model should be complex enough to fit the data well and
2. The model should simpler for easier to interpret.
How many predictors?
67
Guideline: The possible number of predictors to be included with the final model is limited by the
sample size. For each predictor in the final model the sample should include at least 10 records of
each outcome of the response.
For example, consider a large sample with n = 500, including 250 successes and 250 failures, then
the final model can include up to 250/10=25 predictors.
But if the sample includes only 50 successes and 450 failures, the final model should not include
more than 50/10=5 predictors.
Another problem caused by the inclusion of many predictors is multicollinearity.
This means that some predictors could be linearly dependent (or close to linearly dependent) and
therefore the estimates become imprecise.
For example, if we have two predictors x1; x2, where x1 = a + bx2 (e.g. x1 is the temperature in
centigrade and x2 is the temperature in Fahrenheit) then because of this linear relationship one of
the two variables is redundant. Now assume that the relationship is not perfect, but really close
Standard errors are very large in the presence of collinearity, resulting in tests for effect of
explanatory variables on the response variable not being significant even when the model utility
test
Principles on model selection:
When choosing predictors for the final model among the variables observed, the following
principles apply.
1. Predictors deemed essential for the model should always be included, regardless of them
being "significant" or not.
2. When including interaction between factors with the model the lower order terms for those
factors should be included with the model as well.
3.3.3. Stepwise variable selection algorithms
Two common strategies for adding or removing variables in a logistic regression are called
backward-selection and forward-selection. These techniques are often referred to
as stepwise model selection strategies, because they add or delete one variable at a time as they
"step" through the candidate predictors.
These algorithms can be helpful in building a model but have to be used with caution and the final
model has always to be reviewed by the researcher.
68
In the backward variable selection algorithm the model is improved step by step by dropping a
variable from the model at each step. The variable which shows the "least significant" effect when
correcting for the other variables in the model is dropped. The process stops if another step does
not show a further improvement of the model.
In the forward variable selection algorithm the model is improved step by step by adding an extra
variable to the model at each step. The variable which shows the "most significant" effect when
correcting for the other variables in the model is added. The process stops if another step does not
show a further improvement of the model.
When using these algorithms one should make sure that either all or none of the dummy variables
for a categorical factor are included with a model, and that if an interaction term is included with
a model automatically all main effect term have to be included too.
3.3.4. AIC, Model Selection, and the “Correct” Mode

“All Models are wrong, but some are useful.”
Be aware that there is no correct model and every model is a simplification of reality. But models
can explain reality well to different degrees and they can provide insight into relationships between
predictors and response.
The significance of variables criteria like AIC (Akaike's Information Criterion) can help in
choosing good models. The AIC measures how close the fitted values based on the model are to
the true probabilities but is applying a penalty for including many variables with the model.
The best model is the one which has the smallest AIC = -2(log likelihood - number of parameters)
3.4. Model Checking

Goodness of Fit Measures for Logistic Regression. Although logistic regression can be used for
description and inference about the effects of predictors on binary responses there is no
guarantee that a particular model fits the data well. Rather we must check model fit.
We next consider ways of checking the model fit for our model and want to assess how well it fits
the data.
The following measures of fit are available, sometimes divided into “global” and “local”
measures:
 Classification tables
69
 ROC curves
 Logistic regression R2
 Hosmer-Lemeshow tests
 Chi-square goodness of fit tests and deviance
3.4.3. Classification Tables

In order to check the predictive power of a model one can use classification tables.
In order to create classification tables we use the model to estimate the probability for success,
ˆ  x  for each individual in the sample. The predicted outcome variable is yˆ  1 if ˆ  x    0 ,
usually with  0 =0.5. The choice of cut-off  0 is rather arbitrary.

If this probability is greater than 0.5, then it is predicted that a success for this individual is more
likely than a failure. Likewise if ˆ  x  is smaller than 0.5, then one would predict a failure for this
individual.
Therefore each individual is classified as success if ˆ  x   0.5 and as a failure if ˆ  x   0.5
A table showing the actual measurement and the classification for the data is called a classification
table.
If the model fits well, most individuals from the sample fall into the correctly predicted categories.
To measure the predictive power of the model finds sensitivity, specificity, and the overall
proportion of correct classifications.
In summary, letting 0 an arbitrary cutoff point, then,
70
3.4.4. Summarizing the Predictive Power, Receiver Operating Characteristic (ROC)

Curves
Receiver operating characteristic (ROC) curves are useful for assessing the accuracy of predictions
The curve is created by plotting the true positive rate against the false positive rate at various
threshold settings. (The true-positive rate is also known as sensitivity. A false-positive rate can be
calculated as 1 - specificity). The ROC curve is thus the sensitivity as a function of false-positive
rate.
When π0 gets near 0, almost all predictions are yˆ  1 ; then, sensitivity is near 1, specificity is near
0, and the point for (1 – specificity, sensitivity) has coordinates near (1, 1). When π0 gets near 1,
almost all predictions are yˆ  0 ; then, sensitivity is near 0, specificity is near 1, and the point for (1
– specificity, sensitivity) has coordinates near (0, 0). The ROC curve usually has a concave shape
connecting the points (0, 0) and (1, 1).
For a given specificity, better predictive power corresponds to higher sensitivity. So, the better the
predictive power, the higher the ROC curve.
71
The area under the curve is identical to the concordance index (c). It estimates the probability that
the predictions and the outcomes are concordant, i.e. that the observations with larger y also have
larger ˆ . A value c = .5 means that the prediction is as good as a random guess and corresponds to
a ROC-curve being a straight line.
3.4.5. Correlation
R the correlation between the observed values yi and the estimated values yˆ i is also measure of
model fit.
For the logistic regression model this is a correlation between values of 0 and 1 for the response
(1=success, 0=failure) and the estimated probabilities for success ˆ  x  . The closer is the value to
1, the better the fit between data and model. Because of the discrete response variable the
usefulness is limited.
R2 is again the coefficient of determination, which gives the proportion of variation in the response
variable explained by the model.
3.4.6. Cox& Snell R2 and Nagelkerke R2 and Homer-Lemeshow Statistics

Cox& Snell R2 and Nagelkerke R2 are alternative measures of fit. Cox&Snell can never be 1, for
which Nagelkerke is adjusting.
In logistic regression, there is no true R2 value as there is in OLS regression. However, because
deviance can be thought of as a measure of how poorly the model fits (i.e., lack of fit between
observed and predicted values), an analogy can be made to sum of squares residual in ordinary
least squares. The proportion of unaccounted for variance that is reduced by adding variables to
the model is the same as the proportion of variance accounted for, or R2.
2 LLnull  2 LLk SS  SSresidual SSregression

2
Rlogistic  2
and ROLS  total 
2 LLnull SStotal SStotal
72
Where the null model is the logistic model with just the constant and the k model contains all the
predictors in the model.
Cox & Snell Pseudo-R2
The Cox and Snell R-square is computed as follows:
2/ n
 2 LLnull 
R  1 
2

 2 LLk 
Because this R-squared value cannot reach 1, Nagelkerke modified it. The correction increases
the Cox and Snell version to make 1.0 a possible value for R-squared.
Nagelkerke Pseudo-R2
2/ n
 2 LLnull 
1 
 2 LLk 
R 
2
1   2 LLnull 
2/ n
3.4.7. Homer-Lemeshow Statistic

Hosmer and Lemeshow (1980) proposed grouping cases together according to their predicted
values from the logistic regression model. This is a Pearson-like χ2 that is computed after
data are grouped by having similar predicted probabilities. Specifically, the

predicted values are arrayed from lowest to highest, and then separated into several
groups of approximately equal size. Ten groups is the standard recommendation . For
each group, we calculate the observed number of events and non-events, as well as the expected
number of events and non events. The expected number of events is just the sum of the predicted
probabilities for all the individuals in the group. And the expected number of non-events is the
group size minus the expected number of is the group size minus the expected number of events.
Pearson’s chi-square is then applied to compare observed counts with expected counts. The degree
of freedom is the number of groups minus 2. As with the classic goodness of tests, low p-values
suggest rejection of the model.
73
Steps in Hosmer-Lemeshow
1. State the null and alternative hypotheses

H0 : the current model fits well
HA : the current model does not fit well
2. To calculate this statistic using the following Procedures:
 Group the observations according to model-predicted probabilities ( ˆ )
 The number of groups is typically determined such that there is roughly an equal number of
observations per group
 Hosmer-Lemeshow (HL) statistic, a Pearson-like chi-square statistic, is computed on the
grouped data, but does NOT have a limiting chi-square distribution because the observations
in groups are not from identical trials. Simulations have shown, that this statistic can be
approximated by chi-squared distribution with df = g- 2 where g is the number of groups.
3.4.8. The Deviance
The deviance of the model is the likelihood ratio test between the most complex model that could
be fit, known as the saturated model (with n parameters that fits the n observations perfectly) and
the model that is being tested. The saturated model has a separate parameter for each logit and
provides a perfect fit to the data. This statistic is large when the model provides a poor fit to the
data.
Let LM denote the maximized log-likelihood value for a model M of interest. Let LS denote the
maximized log-likelihood value for the most complex model possible that has a separate parameter
for each observation, and it provides a perfect fit to the data.
Because the saturated model has additional parameters, its maximized log likelihood LS is at least
as large as the maximized log likelihood LM for a simpler model M. The deviance of a model is
defined as
Deviance = −2[LM − LS]
The deviance is the likelihood-ratio statistic for comparing model M to the saturated model. It is a
test statistic for the hypothesis that all parameters that are in the saturated model but not in model
M equal zero. Statistical software provides the deviance, so it is not necessary to calculate LM or
LS.
74
Hence, it can be shown that the likelihood of this saturated model is equal to 1 yielding a log-
likelihood equal to 0. Therefore, the deviance for the logistic regression model is. If the p-value is
small, then we have evidence that the model does not fit the data.
When the predictors are solely categorical, the data are summarized by counts in a contingency
table. For the ni subjects at setting i of the predictors, multiplying the estimated probabilities of
the two outcomes by ni yields estimated expected frequencies for y = 0 and y = 1. These are the
fitted values for that setting. The deviance statistic then has the G2 form introduced in equation
(2.7), namely
3.4.8.2.Model Comparison Using the Deviance

A measure of discrepancy between observed and fitted values is the deviance statistic. Let us now
consider two models, denoted by M0 and M1, such that M0 is a special case of M1. For normal-
response models, the F-test comparison of the models decomposes a sum of squares representing
the variability in the data. This analysis of variance for decomposing variability generalizes to an
analysis of deviance for models. Given that the more complex model holds, the likelihood-ratio
statistic for testing that the simpler model holds is −2[L0 − L1].
Hence, we can compare the models by comparing their deviances.

This test statistic is large when M0 fits poorly compared with M1. For large samples, the statistic
has an approximate chi-squared distribution, with df equal to the difference between the residual
df values for the separate models. This df value equals the number of additional parameters that
are in M1 but not in M0. Large test statistics and small P-values suggest that model M0 fits more
poorly than M1.
Because the null hypothesis for testing the fit of the model is that the current (reduced) model is
appropriate and the alternative counterpart is the complex model fits the data well.
75
The test statistics has a chi square distribution with degrees of freedom equals Number of
parameters in complex model (saturated model)-number of parameters in model
4.2.3. Likelihood-Ratio Model Comparison Tests

Another way to detect lack of fit uses a likelihood-ratio test to compare the model with more
complex ones. A more complex model might contain a nonlinear effect, such as a quadratic term
to allow the effect of a predictor to change directions as its value increases. Models with multiple
predictors would consider interaction terms. If more complex models do not fit better, this provides
some assurance that a chosen model is adequate.
Note: to use likelihood ratio test, models must be nested in order to be compared. Nested means
that all components of the smaller model must be in the larger model.
Residuals
 Finally, residuals can be studied to determine where the lack of fit lies. In our example, none
of the standardized residuals are that large.
 It is also helpful to plot the observed vs. the fitted proportions. If they are close they should
lie on the 45 degree line. One can also plot both of fitted and observed values against
explanatory variables.
1
0.8
0.6
Fitted
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Observed
76
Observed Fitted
1
0.8
0.6
Pi
0.4
0.2
0
180 205 230 255 280 305 330
Achievem ent
 There is also the possibility that particular observations have too much “influence” in with
respect to
1. Their effect on parameter estimates - if the observation(s) were deleted the values of
parameter estimates may be considerable different.
2. Their effect on model fit - if the observation(s) were deleted there may be a large
improvement to model fit.
3. Their effect on misclassification error - if the observation(s) were deleted there may be a
large difference in the predicted number of “successes”.
 There are several measures that can describe the influence of observations. Each of these
measures are computed for each observation and the larger the value of the measure, the
greater the influence that observation has on model fit.
 These measures are particularly important with quantitative continuous explanatory variables
because looking at residuals (which is one indicator of how influential an observation is) is
particularly daunting.
 All of these measures are related to the leverage an observation has because the greater the
leverage the greater the influence on the model. Mathematically, these measures are related
to the diagonal of the hat-matrix that is used to obtain the predicted logit values for a model
from the sample logits. Large values in the hat matrix represent observations that are
greatly influencing model fit.
77
 Another measure that describes the influence of individual observations is called Dfbeta.
This assesses the effect that an individual observation has on the parameter estimates. The
larger Dfbeta, the larger the change in the estimated parameter when the observation is
removed. Large values indicate that certain observations are leading to instability in the
parameter estimates.
 Another measure that describes the the influence of individual observations is the confidence
interval displacement measures, c and c-bar. These measure the change in the joint
confidence interval for the parameters produced by deleting an observation.
Finally, Delta Deviance and Delta Chi-Square measure the change in the G2 and X2 goodness
of fit statistics, respectively for individual observations. They are diagnostics for detecting
observations that contribute heavily to the lack of fit to the model.
In using these statistics it is useful to plot them versus some index, such as the observation
number.
Although the predictors act additively on the log-odds scale, they are not additive on the odds or
risk (probability) scales
CHAPTER 5
5. MULTICATEGORY LOGIT MODELS
78
In the case of binary logistic or simply logistic regression model, we have restricted the response
or dependent variable in logit models to be dichotomous or binary. Now we will consider a
response variable, Y, with J levels. The explanatory or independent variables may be quantitative,
qualitative, or both.
This is generalizations of logistic regression that model categorical responses with more than two
categories.
There are ways in which logistic regression models for response variables with more than two
outcomes differ from logistic regression for dichotomous data.
1. How the logits are formed

When J = 2 there is only one logit we can form, however when J > 2 there are J(J-1)/2 logits that
we can form, but only J-1 of them are non-redundant. There are different ways to form the non-
redundant logits, each of which results in a “dichotomizing” the response variable. The way we
choose to form the logit will partly depend on whether Y is ordinal or nominal. Therefore, response
(J > 2), it's important to note whether the response is ordinal (consisting of ordered categories)
or nominal (consisting of unordered categories). For binary logistic model this question does not
arise. We will now study models for nominal response variables and for ordinal response variables
2. The sampling distribution

When Y is dichotomous, at each combination of the explanatory variable we assume that data
come from a binomial distribution. When J > 2, at each combination of the explanatory variable
we assume that data come from a multinomial model. The binomial distribution is a special case
of the multinomial distribution.
How to “dichotomize” the response Y?
Some types of models are appropriate only for nominal responses and some for ordinal responses
or for both.
Models for Nominal or ordinal responses
a. “Baseline” logit models or “Multinomial” logistic regression
b. “Conditional” or “Multinomial” logit models
2. Models for Ordinal responses
a. Cumulative logits
b. Adjacent categories
79
c. Continuation ratios
But in this course we will have a look at for only “Baseline” logit models for nominal response
data and Cumulative logits for Ordinal responses data
5.1. Baseline category logit model for nominal response variables
This model is basically just an extension of the binary logistic regression model. It gives a
simultaneous representation of the odds of being in one category relative to being in another
category for all pairs of categories.
Let J denote the number of categories for Y . Let {π1, . . . , πJ } denote the response probabilities,
J
satisfying 
j 1
j  1 . With n independent observations, the probability distribution for the number
of outcomes of the J types is the multinomial. It specifies the probability for each possible way the
n observations can fall in the J categories. Here, we will not need to calculate such probabilities.
For models of this section, the order of listing the categories is irrelevant, because the model treats
the response scale as nominal (unordered categories).
These models pair each category with a baseline category. When the last category (J ) is the
baseline, the baseline-category logits are
The baseline-category logit model with a predictor x is

 j 
log 
    j   j x, j  1, 2,..., J  1 (5.1)
 j 
The model has J − 1 equations, with separate parameters for each. The effects vary according to
the category paired with the baseline. When J = 2, this model simplifies to a single equation for
log(π1/π2) = logit(π1), resulting in ordinary logistic regression for binary responses
80
With a set of J – 1 non-redundant odd in equation 5.1, we can figure out the odds for any pairs of
categories. To determine equations for all other pairs of categories. For example, for an arbitrary
pair of categories a and b,
With the baseline category logit model we choose one of the categories as the “baseline”. This
choice may be arbitrary or there may be a logical choice depending on the data
For convenience, we’ll use the last level (i.e. the Jth level) of the response variable as the
baseline.
The baseline category logit model with one explanatory variable, x, is:
  ij 
log    j   j xi for j = 1, 2, … , J -1
  iJ 
For J = 2 this is just the regular binary
depending on which two categories are being compared. The odds for any pair of categories of
Y are a function of the parameters of the model.
Example
Suppose we have data that identifies respondents’ political affiliation as either democrat,
republican, or independent and we want to know if political affiliation can be predicted by SES
which is a quantitative (i.e. continuous) variable and gender (X2), which is a qualitative.
For this data the response variable is party identification. We could fit a binary logit model to
each pair of party identifications, Y:
81
democrat
 1 female
Y  republican X1 =SES and x 2 = 
indenpendent 0 male

Using using sex as an independent variable we have 3 - 1 = 2 non-redundant logits:
 democrat   1 
log    log    1  1 x2
 independent   3 
 republican   
log    log 2    2   2 x
 independen t   3 
The logit for democrat and republican is:

 1 
 democrat   1    
log    log   log 3

 republican  
 2 
 2 
 3 
1  1 x ) - (  2   2 x )
= (1   2 )  (1   2 ) x
The difference 1   2 is called a contrast.
Using SES and sex as an independent variables we have 3 - 1 = 2 non-redundant logits:
 republican 
log    1  11 x1   21 x2
 independent 
 democrat 
log     2   21 x1   22 x2
 independent 
82
We can write one of the odds in terms of the other two:
 democrat 
log     3  31 x1  32 x2
 republican 
Which means that in the population:
1   2   3 ; 11  21  31 and 21   22  32
CAUTION: You MUST be certain what the computer program that you use to estimate the
J = 0 and others set .
Using the parameters we obtained from fitting the model with SES predicting party affiliation by
SES I obtained:
 democrat   
ˆ 
log   log
ˆ  1   0.1502  0.00013( x2 )
 independent   3 
 republican   
lôg   lôg 2   0.9987  0.0191( x)
 independen t   3 
 1 
 democrat   1    
lôg   lôg   lôg 3

 republican   2    2 
 3 
= (0.1502 - 0.00013x) - (-.9987 + .0191x)
= 1.1489 - .01923 x
We can interpret the parameters of the model in terms of odds ratios, given an increase in SES.
For a 10 point increase in SES index we obtain the following odds ratios:
Democrat to Independent = exp(10(-0.00013)) = 0.998
Republican to Independent = exp(10(.191)) = 6.75
83
Democrat to Republican = exp(10(-.01923)) = 0.825
Republican to Democrat = 1/.825 = 1.212
Just as in binary logistic regression, we can also interpret the parameters of the model in
terms of probabilities.
5.2. Estimating Response Probabilities

The multicategory logit model has an alternative expression in terms of the response
probabilities.
The probability of a response being in category j is
exp( j   j x)
j  J
 exp(
k 1
k   k x)
J J = 0. This is an
J
identification constraint. Furthermore, the denominator  exp(
k 1
k   k x) ensures that the
probabilities sum to 1.
Proof
Now let us consider the simple baseline logit model
 j 
log     j  j x
 J 
Exponentiating both sides, we get
j j jx
e which is the odds for category j versus category J.
J
84
 j  J ej jx
J1 J1 J1
 j j x
  j   J e  J  ej jx ;Now, suppose we sum both sides over j-1
j1 j1 j1
Note, though
J J1
 j   j  J  1
j1 j1
J1
i.e.  j  1  J
j1
So
J1
 j j x
J1
 j j x  J1 j jx  1
1  J  J  e  1  J  J  e  1  J 1   e   J  J1
j1 j1  j1  1   ej jx
j1
 j j x
and, since j  J e substituting in J , we obtain
ej jx
j  J1
1   ej jx
j1
Then sing the estimated parameters obtain the estimated probabilities
exp(0.1502  .00013x)
ˆ democrat 
1  exp(.1502  .00013x)  exp( .9987  .0191x)
exp( 0.9987  .0191x)

ˆ republican 
1  exp(.1502  .00013x)  exp( .9987  .0191x)
1
ˆ independent 
1  exp(.1502  .00013x)  exp( .9987  .0191x)
85
We can use these functions to plot the probabilities versus SES.
0.8
Probability
0.6
0.4
0.2
0
0 50 100
SES
Democrat Republican Independent
We can easily add more explanatory variables to our model and these variables can either be
categorical or numeric. We identify numeric variables in proc catmod by the command “direct”.
Furthermore, by the entire model comparison methods that we have used in the past will work
with this model as well.
The baseline category logit model can be used when the categories of the response variable are
ordered, but it may not be the best model for ordinal responses.
5.3. Cumulative Logit Models for Ordinal Responses

Cumulative logit models are used when the response variable is ordinal and it is used to take into
account the ordering of the categories. When we use the ordering of the categories the resulting
model is a more powerful model than the baseline logit model. It also yields a simpler model with
simpler interpretations.
For this model the effect of the explanatory variable(s) is the same regardless of how we collapse
Y into dichotomous categories. Therefore, a single parameter describes the effect of x on Y,
versus the J-1 parameters that are needed in the baseline model. However, the intercepts can
differ.
86
For this model we use cumulative probabilities which are the probabilities that Y falls in
category j or below. In other words, P( Y ≤ j j, where j = 1, 2,
… , J.
It is P(Y  1)  P(Y  2)  · · ·  P(Y  J) = 1

To model an ordinal response variable one models the cumulative response probabilities or
cumulative odds. In the model cumulative odds for the last category do not have to be modeled
since the cumulative probability for the highest category is always one (no category falls above).
Cumulative probabilities reflect the ordering of the categories and are used to form cumulative
logits.
Models that use cumulative probabilities do not use the final category, P( Y ≤ J) since it must
equal 1.
5.3.1. Cumulative Logit Models with Proportional Odds Property

A model for the jth cumulative logit looks like an ordinary logit model for a dichotomous response
variable in which categories 1 to j combine to form a single category. In other words, the response
variable collapses into two categories, one up to j and one for j + 1 to J.
I.e. a model for cumulative logit j is equivalent to a binary logistic regression model for combined
categories 1 to j (I) versus the combined category j + 1 to J (II)
For one predictor variable x the proportional odds model becomes for j = 1, 2 . . . , J −1
A cumulative logit is of the form:
 P(Y  j )   P(Y  j )   1   2     j 
log   log   log 
 P(Y  j )   1  P(Y  j )      
 j 1 J 
 P(Y  j ) 
log    j  x
 1  P(Y  j ) 
The cumulative logit models measure how likely the response is to be in category j or below versus
in a category higher than j
87
The slope  is the same for all cumulative logits, and therefore this model has a single slope
parameter instead of J − 1 in the multicategory logistic regression model.
The parameter  describes the effect of x on the odds for falling into categories 1 to j. The effect
is assumed to be the same for all cumulative odds and called as proportional odds model.
Cumulative probabilities are given by:
exp( j  x)
P(Y  j ) 
1  exp( j  x)
The cumulative probabilities in dependency on predictor variable x if  > 0.
What is j only? Consider the case of one explanatory variable x again:

We can compute the probability of being in category j by taking differences between cumulative
probabilities. In other word
P(Y = j)=j  P(Y  j)  P(Y  j)  P(Y  j  1) for j = 2, ..., J

 x
ej x e j 1
  j x
  x
1 e 1  e j 1
For j = 1, 1 = P(Y = 1) = P(Y  1) = e1x (1  e1x ) .
88
For j = J, J = P(Y = J) = P(Y  J) – P(Y  J – 1)
 x
e j 1
= 1 – P(Y  J – 1) = 1  .
 j 1 x
1 e
Therefore, this model is sometimes referred to as a difference model.
To interpret this model in terms of odds ratios for a given level of Y, say Y = j
P(Y  j | X  x2 ) / P(Y  j | X  x2 ) P(Y  j | X  x2 ) P(Y  j | X  x1 )


P(Y  j | X  x1 ) / P(Y  j | X  x1 ) P(Y  j | X  x1 ) / P(Y  j | X  x2 )
= exp j 2 j 1 1 -x2)]
The odds ratio is proportional to the difference between x1 and x2 and since this proportionality is
a constant equal i.e. that each
independent variable has an identical effect at each cumulative split of the ordinal dependent
variable.
For large samples with categorical explanatory variables the results are almost the same. In
general, maximum likelihood estimation is preferred with quantitative explanatory variables.
 P(Y  j ) 
 Given log     j  1 x1   2 x2  ....   k xk
 1  P(Y  j ) 
 In this model, intercept αj is the log-odds of falling into or below category j when X1 = X2 = · · · =
0.
 A single parameter βk describes the effect of xk on Y such that βk is the increase in log-odds of falling
into or below any category associated with a one-unit increase in Xk, holding all the other X-variables
constant; compare this to the baseline logit model where there are J-1 parameters for a single
explanatory variable. Therefore, a positive slope indicates a tendency for the response level
to decrease as the variable decreases.
 Constant sloped βk: The effect of Xk, is the same for all J-1 ways to collapse Y into dichotomous
outcomes.
89
 For simplicity, let's consider only one predictor: logit[P(Y≤j)]=αj+βx

 Then the cumulative probabilities are given by: P(Y≤j)=exp(αj+βx)/(1+exp(αj+βx)) and since β is
constant, the curves of cumulative probabilities plotted against x are parallel.
Example
Given preference for watching football game having levels (like it very much, like
rs.
A cumulative was fitted odds ratio predicting whether or not one liked watching football game
from age and got the following:
1  Like it very much  2  Like it = -1.2391  4  Dislike it = 1.667
=3.2566  3  Mixed feelings = -0.1981   age = 0.0361
Interpreting this in terms of odds ratios, .0361(30 - 50) = .4857
Interpreting this in terms of cumulative probabilities:
exp( 3.2566  .0361x)

P(Y  1) 
1  exp( 3.2566  .0361x)
exp( 1.2391  .0361x)

P(Y  2) 
1  exp( 1.2391  .0361x)
Etc.
Graphing this we get:
90
1
0.8
probability
0.6
0.4
0.2
0
10 30 50 70 90
age
P(Y < 1) P(Y < 2) P(Y < 3) P(Y < 4)
Calculating probabilities we get:
exp( 3.2566  .0361x)

P(Y  1) 
1  exp( 3.2566  .0361x)
exp( 1.2391  .0361x) exp( 3.2566  .0361x)

P(Y  2)  P(Y  2)  P(Y  1)  
1  exp( 1.2391  .0361x) 1  exp( 3.2566  .0361x)
Etc.
Graphing this we get:
0.6
0.5
probability
0.4
0.3
0.2
0.1
0
10 30 50 70 90
age
P(Y = 1) P(Y = 2) P(Y=3) P(Y=4) P(Y = 5)
91
CHAPTER 6
6. POISSON REGRESSION MODEL
Before discussing about Poisson regression model, let us revised the definition of Poisson
distribution.
6.1. The Poisson Distribution
A random variable Y is said to have a Poisson distribution with parameter  if it takes integer
values y = 0, 1, 2, with probability
e  y
P(Y =y) =
y!
The Poisson distribution, we mentioned that it is a limiting case of the binomial distribution when
the number of trials becomes large while the expectation remains stable, i.e., the probability of
success is very small.
6.2. Introduction to Poisson regression
Poisson regression deals with situations in which the dependent variable is a count. But we can
also have Y/t, the rate (or incidence) as the response variable, where t is an interval representing
time, space or some other grouping.
Explanatory Variable(s):
 Explanatory variables, X1, X2, … Xk , can be continuous or a combination of continuous and
categorical variables. Convention is to call such a model “Poisson Regression”.
 Explanatory variables, X1, X2, … Xk, can be ALL categorical. Then the counts to be modeled are
the counts in a contingency table, and the convention is to call such a model log-linear model.
 If Y/t is the variable of interest then even with all categorical predictors, the regression model will
be known as Poisson regression, not a log-linear model.
Why do we need special models? What is wrong with OLS?
92
Like in probit and logit models, the dependent variable has restricted support. OLS regression
can/will predict values that are negative and will also predict non-integer values. Non sense
results.
Even though these kinds of response variables are numeric, they create some problems if we try to
analyze the data within the context of regular linear regression because of the limited range of most
of the values (although a large range of values is still possible) and because only nonnegative
integer values can occur. Thus, count data can potentially result in a highly skewed distribution,
cut off at zero.
Since the mean is equal to the variance, any factor that affects one will also affect the other.
Thus, the usual assumption of homoscedasticity would not be appropriate for Poisson data.
The model: Y random variable that has a Poisson distribution
log(E[Y])=log( ) = 0 + 1 x1 + · · · + p xk where Y is a random varible with Y / x1 , x2 ,..., xk pois( )
Since the log of the expected value of Y is a linear function of explanatory variable and the
expected value of Y is a multiplicative function of x. i.e. E[Y]= = exp(0 + 1 x1 +...+k xk )
Parameter Estimation
Similar to the case of Logistic regression, the maximum likelihood estimators (MLEs) for (β0, β1 …
etc.) are obtained by finding the values that maximizes log-likelihood. In general, there are no
closed-form solutions, so the ML estimates are obtained by using iterative algorithms such
as Newton-Raphson (NR), Iteratively re-weighted least squares (IRWLS), etc.
Interpretation of Parameter Estimates:
To interpret the results of the analysis, we need to exponentiation the estimates of interest, (
exp(  j ) , as well as the ends of the confidence intervals and talk about multiplicative changes in
the response variable for each one‐unit change in the explanatory variable. That is
 exp(β0) = effect on the mean of Y, that is  β1, when x1=x2=…xk = 0, which is a baseline
value, i.e. value for an observation with all X’s equal to zero
 exp(β1) = with every unit increase in x1, the predictor variable has multiplicative effect of
exp(β1) on the mean of Y, keeping the effects of the rest predictors constant
93
Consider two2 values of x (x1 & x2) such that the difference between them equals 1. For
example, x1 = 10 and x1 = 11,
The expected value of Y when x1 = 10 is
E[Y]= = exp(0 + 1.10+...+k xk ) =exp(0 )exp( 10)exp(1 )...exp(k )exp(xk )
The expected value of µ when x1 = 11 is
E[Y]= = exp(0 + 1.11+...+k xk ) =exp(0 )exp( 11)exp(1 )...exp(k )exp(xk )
 If βi = 0 where i=1,2,..k, then exp(βi) = 1, and the expected count,  = E(y) = exp(β1),
and Y and xi are not related.
 If β > 0, then exp(β) > 1, and the expected count  = E(y) is exp(β) times larger than
when x = 0
 If β < 0, then exp(β) < 1, and the expected count  = E(y) is exp(β) times smaller than
when X = 0
94
Example
Volunteering Sociologists wanted to determine if sex or education level affected how often people
volunteer. They collected data on the number of volunteer activities people were involved with in
the previous year, their sex, and the number of years of education (high school and beyond). The
numbers of volunteer activities are counts and a log transformation doesn’t really help us to meet
the assumptions of linear regression. Proceed with Poisson regression. There was little evidence
to suggest that the relationship between volunteering and education level differed by sex (
 2 2 1=0.75, P = 0.39). (no need for interaction, simplify to parallel lines model) Males vs.
females: Females are involved with 1.16 times (95% CI = 1.02‐1.32) more volunteer activities
than males, after accounting for years of education (χ 2 1=5.42, P = 0.02). OR Females are involved
with 16 % (95% CI = 3‐32%) more volunteer activities than males, after accounting for years of
education (χ 2 1=5.42, P = 0.02). Years of education: Each additional year of education increases
the expected number of volunteer activities by 1.14 times (95% CI = 1.12‐1.16), after accounting
for sex (χ 2 1=138.9, P < 0.0001). OR Each additional year of education increases the expected
number of volunteer activities by 14% (95% CI = 12‐16%), after accounting for sex (χ 2 1=138.9,
P < 0.0001).
Inference
The usual tools from the basic statistical inference Confidence Intervals and Hypothesis tests for parameters
o Wald statistics and asymptotic standard error (ASE)

o Likelihood ratio tests
o Score tests
 Distribution of probability estimates
Model Fit
 Overall goodness-of-fit statistics of the model are the same as for any GLM:
o Pearson chi-square statistic, X2
o Deviance, G2
o Likelihood ratio test, and statistic, ΔG2
 Residual analysis: Pearson, deviance, adjusted residuals, etc...
 Overdispersion
95
o Recall that a Poisson random variable has the same mean and variance, e.g., E(Y)=Var(Y)= 
Overdispersion means that observed variance is larger than the assumed variance, i.e., Var(Y)=φ  where φ is
a scale parameter like we saw in logistic regression.
o Two typical solutions are:
 Adjust for overdispersion (like in logistic regression) where we estimate φ=X2/(N−p), and adjust the standard errors
and test statistics.
 Use negative binomial regression instead (see notes on ANGEL), where the response Y is assumed to ollow a
Negative Binomial distribution, E(Y)=μ and Var(Y)=μ+Dμ2. The index D is a called a dispersion parameter. Greater
heterogeneity in the Poisson means results in a larger value of D. As D approaches 0, Var(Y) will approach μ , and
the negative binomial and Poisson regression will give the same inference.
An important additional property of the Poisson distribution is that sums of independent Poisson variates
are themselves Poisson variates, i.e., if Y1 and Y2 are independent with Yi having a P(_i ) distribution, then
Y1 + Y2 _ P(_1 + _2) (1)
As we shall see, the key implication of this result is that
individual and grouped data can both be analyzed with the Poisson distribution.
The Poison distribution is a discrete distribution and is appropriate for modeling counts of observations. Counts are observed
cases, like the count of measles cases in cities. You can simply model counts if all data were collected in the same measuring unit
(e.g. the same number of days or same number of square feet).
You can use the Poisson Distribution for modeling rates (rates are counts per unit) if the units of collection were different.
Unlike the familiar normal distribution, which is described by two parameters (mean and variance), the Poisson distribution is
completely described by just one parameter, lambda (λ). Lambda is the average of a Poisson distribution as well as the variance
and λ can take on non-integer values. While it is impossible to have 1.5 cases of measles in a city, it is possible to have the
average number of cases per 1000 person-months be non-integer (like 3.12 cases/1000 person-months).
predicted by simply fitting poison regression model.
96

Catagorical Data Analysis

Uploaded by

Copyright:

Available Formats

Catagorical Data Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Catagorical Data Analysis

Uploaded by

Copyright:

Available Formats

Categorical Data Analysis

1.1.1. Nominal-Ordinal Scale Distinction

1.1.2. Response–Explanatory Variable Distinction

1.2. Probability Distributions for Categorical Data

1.2.1 Binomial Distribution

Since E Yi   E Yi 2   1*   0*(1   )   and var Yi    (1   )

Properties of binomial distribution

For a binomial distribution the probability function is given by

 n c0 0 (1   )n0  n c1 1 (1   )n1  ...  n cn n (1   )n n

Mean of the binomial distribution

For a binomial distribution the probability function is given by

P(Y=y) = c x  (1   ) , y  0,1, 2,........, n ,

Now, the mean of the Binomial distribution is

Variance of the Binomial distribution:

The variance of the Binomial distribution is

= E (Y 2 )  (n )2 …………….. (1) [ E (Y )  n ]

n (1   ) since mean = n and variance = n (1   ) , mean > variance

2. Measure of skew ness of Binomial distribution

Note: The Binomial distribution is called Symmetric Binomial Distribution

4. When n is large, it can be approximated by a normal distribution with   n and

 2  n (1   ) guideline is that the expected number of outcomes of the two types, n

1.2.2 The Multinomial distribution

probability that n1 fall in category 1, n2 fall in category 2, . . . , nc fall in category c,

The binomial distribution is the special case with c = 2 categories.

1.2.3 The Poisson distribution

A Poisson experiment has the following properties:

1.3 Statistical inference for a proportion

The proportion of Y responses in this sample is p=24/32 = 0.75

1.3.1 Likelihood Function and Maximum Likelihood Estimation

get the following likelihood.

π 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.9 1

1.3.2 Significance Test about a Binomial Proportion

1.3.3 Confidence Intervals for a Binomial Proportion

normal percentile having right-tail probability equal to z .

1.4.1 Wald, Likelihood-Ratio and Score Inference

Example: Wald, Score, and Likelihood-Ratio Inference for Binomial Parameter

p(1  p) 0.9(1  0.9)

The likelihood-ratio test:

1.4.2 Small-sample binomial inference

 Construct and interpret probability structure for contingency tables,

Pr(X = i) = 𝜋𝑖+ =∑𝑗 𝜋𝑖𝑗 = 𝜋𝑖1 + 𝜋𝑖2 + …+ 𝜋𝑖𝐽

{𝜋𝑖+ } form the marginal distribution of X.

These satisfy ∑𝑖 𝜋𝑖+ =∑𝑗 𝜋+𝑗 =∑𝑗 ∑𝑖 𝜋𝑖+ =1

Joint distribution of X and Y consists of the set of the 𝜋𝑖𝑗 ′𝑠:

Each marginal distribution refers to a single variable.

P(Y/X)= 𝜋𝑗/𝑖 =𝜋𝑖𝑗 /𝜋𝑖+

2.2.2 Binomial and Multinomial Sampling

2.3 Comparing Proportions in Two-By-Two Tables

2.3.1 Differences of Proportions (Risk Difference)

1. The difference of proportions falls between -1 and +1.

confidence interval for 1   2 is

Then a 95% CI for the difference of proportion, 𝜋1 -𝜋2

So, a large sample (1-α)100% confidence interval for , 𝜋1 -𝜋2 is

2.3.2 Relative Risk

Note: For observed data, estimate using observed proportions, p1 .

The distribution of p1 is highly skewed, unless 𝑛1 and 𝑛1 are large.

2.3.3 The odds ratio

for instance, odds=3, then 𝜋=3/(3+1)=0.75 and 1- 𝜋=0.25.

For the sample data

(.0211.1875, .0313.1875) = (.00396, .00587)