Catagorical Data Analysis
Catagorical Data Analysis
Catagorical Data Analysis
CHAPTER ONE
1. INTRODUCTION TO CATEGORICAL DATA
Learning objectives:
After the completion of this chapter, the students will be able to:
Define categorical variables,
Distinguish the probability distributions most often used for categorical data,
Know what is likelihood function and maximum likelihood estimation, in conducting
statistical inference for discrete data.
Introduction
Virtually every research project categorizes some of its observations into neat, little distinct bins:
gender of individual (male, female), marital status (broken, not broken), attitude of an individual
towards something (agree, neutral, disagree), diagnosis regarding breast cancer based on a
mammogram categories (normal, benign, probably benign, suspicious, and malignant), race of
patient (black, white) presence of heart disease (yes, no), and so on.
Statisticians have devised a number of ways to analyze and explain categorical data. This course
presents explanations of each of the different methods.
1.1. Categorical Response Data
Let us first define categorical data. A categorical variable has a measurement scale consisting of a
set of categories. Categorical variables are often referred to as qualitative, in which distinct
categories differ in quality, not in quantity. A qualitative explanatory variable is called a factor and
its categories are called the levels for the factor. A quantitative explanatory variable is sometimes
called a covariate.
Categorical scales are pervasive in the social sciences for measuring attitudes and opinions.
Categorical scales also occur frequently in the health sciences, for measuring responses
such as whether a patient survives an operation (yes, no), severity of an injury (none, mild,
moderate, severe), and stage of a disease (initial, advanced).
They frequently occur in the behavioral sciences (e.g., categories “schizophrenia,”
“depression,” “neurosis” for diagnosis of type of mental illness)
We can classify categorical variables as nominal and ordinal variables.
1
Categorical Data Analysis
2
Categorical Data Analysis
continuous response. In this section we review the three key distributions for categorical responses:
binomial, multinomial, and Poisson.
denoted by bin(n, π)
The probability mass function for the possible outcomes y for Y is
n
P(Y y ) P( y ) y 1 ,
n y
y 0,1,2,..., n
y
where the binomial coefficient (appears in the statement of a familiar theorem from
n n!
algebra (x+y)n) .
y n y !
3
Categorical Data Analysis
n
c y y (1 )n y x 0,1,2,........, n
P(Y=y) = ,
n n
p(Y y)
y 0
n
c y y (1 )n y
Now, = y 0
n
E (Y ) y P(Y y )
y 0
y
y 0
n
c y y (1 )n y
=
n
n!
y y !(n y)! y
(1 )n y
= y 0
n
n(n 1)!
y y( y 1)!(n y)! y
(1 )n y
= y 0
n
n!
n y y (1 )n y
= y 1 y !(n y )!
n
= n n 1
c y 1 y 1 (1 )n y
y 1
= n
4
Categorical Data Analysis
V (Y ) E (Y 2 ) [ E (Y )]2
3 2
1
23
(1 2 ) 2
=
n (1 )
(1 2 )2
3. The measure of skewness of B.D is 1
n (1 )
1 1
If , the distribution is symmetric, if , the distribution is positively skewed and
2 2
1
if , the distribution is negatively skewed
2
Under certain conditions the B.D approaches to Poisson and Normal distributions.
Hence, the binomial distribution is always symmetric when π = 0.50. For fixed n, it
becomes more skewed as π moves toward 0 or 1. For fixed π, it becomes more bell-shaped
as n increases
5
Categorical Data Analysis
π = 0.10 (or = 0.90) requires n ≥ 50. When gets nearer to 0 or 1, larger samples are
needed before a symmetric, bell shape occurs.
For example, political party affiliation (Conservative, Democratic, Liberal), cereal shelf-
placement in a grocery store (Bottom, middle, or top). When the trials are independent with the
same category probabilities for each trial, the distribution of counts in the various categories is the
multinomial.
Let c denote the number of outcome categories. We denote their probabilities by
c
{π1, π2, . . . , πc}, where
i 1
i 1 . For n independent observations, the multinomial
nπj and standard deviation n j (1- ij ) . Most methods for categorical data assume the binomial
distribution for a count in a single category and the multinomial distribution for a set of counts in
several categories.
Example: Suppose that we have a fair die and we roll it twenty times. Let us define Yi = the
number of times the die lands up with the face having i dots on it, i = 1, ..., 6. It is not hard to see
that individually Yi has the Binomial (20, 1/6) distribution, for every fixed i = 1, ..., 6. It is clear,
6
however, that Yi must be exactly twenty in this example and hence we conclude that Y1, ..., Y6
i1
6
Categorical Data Analysis
are not all free-standing random variables. Suppose that one has observed Y1 = 2, Y2 = 4, Y3 = 4,
Y5 = 2 and Y6 = 3, then Y4 must be 5. But, when we think of Y4 alone, its marginal distribution
is Binomial (20, 1/6) and so by itself it can take any one of the possible values 0, 1,2, ..., 20. On
the other hand, if we assume that Y1 = 2, Y2 = 4, Y3 = 4, Y5 = 2 and Y6 = 3, then Y4 has to be 5.
That is to say that there is certainly some kind of dependence among these random variables Y1,
..., Y6. What is the exact nature of this joint distribution?
Example 2 Roll of a dice Taking the dice example explained at the beginning, we can say that the
number of results described by each face follows the multinomial distribution of parameters ((1/6,
1/6, 1/6, 1/6, 1/6, 1/6), n).
If we say that the face 6 comes out 1 times, 5 comes out 2 times, 4 comes out 4 times, 3 comes out
0 times, 2 comes out 2 times and 1 comes out 3 times, we have n = 12.
Solution:
Suppose you roll a fair die 12 times (12 trials), First, assume (n1, n2, n2, n3, n4, n5&,n6) is a
multinomial random variable with parameters p1= p2 = . . . =p6 ’ = 1/6 and n =12.
3 2 0 4 2 1
n 6 ni 12! 1 1 1 1 1 1 0.0000764
p(3,2,0,4,2,1) i
3!2!0!4!2!1! 6 6 6 6 6 6
n1, n2 , n3 , n4 , n5, n6 i1
e.g.2
Suppose you roll a fair die 6 times (6 trials), First, assume (n1, n2, n2, n3, n4, n5&,n6) is a
multinomial random variable with parameters p1= p2 = . . . =p6 ’ = 1/6 and n =6.
What is the probability of that each face is seen exactly once? This is simply
f 1, 1, 1, 1, 1, 1 | 6, 1 6 , 1 6 , 1 6 , 1 6 , 1 6 , 1 6
6! 1
6 =
5
1! 6 324
’
What is the probability that exactly four 1's and two 2's occur? Then,
4 2
f 4, 2, 0, 0, 0, 0 | 6, 1 6, 1 6, 1 6, 1 6, 1 6, 1 6
6!
1 1
6 =
5
4!2! 0! 6 6 15552
hardly a high probability.
What is the probability of getting exactly two 3's two 4's and two 5's? Try this and get familiar
with the notation and use of the probability function. You can see why such a tool might be useful
7
Categorical Data Analysis
if you were a gambler and wanted to know something quantitative about “the odds" of various
outcomes. Hopefully, your answer will be about 5/2592.
The probability that a Poisson random variable assumes a value of Y in a specific interval is
μy e−μ
P(Y=y) = , y=0,1,2,…
y!
Where: P(Y=y)= the probability of Y occurrences in an interval
μ = expected value or mean number of success (occurrence)
Y= the number of success in the interval
The variance and mean of a Poisson random variable is equal to μ= expected number of
success (occurrence).
Example: Assume that billing clerks rarely make errors in data entry on the billing statements.
Many statements have no mistakes; some have one, a very few have two mistakes; rarely will a
statement have three mistakes; and so on. A random sample of 1000 statements revealed 300
errors. What is the probability of no mistakes appearing in a statement = 300/1000=0.3
(0.3)0 e−0.3
Solution: P(Y=0) = = 0.7408
0!
8
Categorical Data Analysis
multinomial distributions, population proportions are unknown. Using sample data, we estimate
the parameters. When we take a sample to estimate the population proportion, you follow the same
process as we do when taking a sample to estimate the population mean. We use the maximum
likelihood method for estimating the binomial parameter.
Consider the distribution of sample proportions presence of heart disease in hospital. Assume that
its population proportion is p = 0.75 and its standard deviation is √32 ∗ 0.75 ∗ 0.25 = √6. Suppose
you randomly select the following sample of 32 responses: Y Y N Y Y Y Y N Y Y Y Y Y Y N Y
YNYYYNYYNYYNYNYY
Compute the sample proportion, p, for the number of Y’s in this sample. How far does it lie from
the population proportion? What is the probability of selecting another sample with a proportion
greater than the one you selected?
L( ) 1 . It is defined for π between 0 and 1. From the likelihood function, when Y=0 we
10
9
Categorical Data Analysis
Definition: the maximum likelihood estimate of a parameter is the parameter value for which the
probability of the observed data takes its greatest value. It is the parameter value at which the
likelihood function takes its maximum. Suppose β be a parameter and its ML estimate β̂. The
likelihood function is ℓ(β) and the log-likelihood function is L(β)=log(ℓ(β)). For many models,
L(β) has concave shape and β̂ is the point at which the derivative equals zero. The ML estimate
is then the solution of the likelihood equation, ∂L(β)/∂β=0.
Figure 1.1 shows that the likelihood function p(π) = (1 − π)10 has its maximum at π = 0.0. Thus,
when n = 10 trials have y = 0 successes, the maximum likelihood estimate of π equals 0.0. This
means that the result y = 0 in n = 10 trials is more likely to occur when π = 0.00 than when π equals
any other value. In general, for the binomial outcome of y successes in n trials, the maximum
likelihood estimate of π equals p = y/n. This is the sample proportion of successes for the n trials.
If we observe y = 6 successes in n = 10 trials, then the maximum likelihood estimate of π equals p
= 6/10 = 0.60.
The following are plots of the likelihood function when n = 10 with y=0,y=1,…y=10 number of
successes
10
Categorical Data Analysis
N.B. In this frequentist parametric inference framework, we call the “best” estimates the maximum
likelihood estimates of the parameters because they are the parameter values that make the
observed data the most likely to have happened.
Likelihood:
For any of the known probability distributions, the probability of observing data
Yi, given a parameter value π, is:P(Yi| π)
The symbol p=ˆ (“pi-hat”) is used to represent the sample proportion. Before we observe the
data, the value of the ML estimate is unknown. The estimate is then a variate having some sampling
distribution. We refer to this variate as an estimator and its value for observed data as an estimate.
Estimators based on the method of maximum likelihood are popular because they have good large-
sample behavior. Most importantly, it is not possible to find good estimators that are more precise,
in terms of having smaller large-sample standard errors. Also, large-sample distributions of ML
estimators are usually approximately normal. The estimators reported in this text use this method.
Normal Approximation to the Binomial Distribution)
Suppose that Y1, ..., Yn are iid Bernoulli(π), 0 < π < 1, n ≥ 1.
11
Categorical Data Analysis
n
We know that U n Y1 is distributed as Binomial(n, p), n ≥1. Apply the CLT to show that
i1
U n n
N (0,1) as n
n (1 n )
In other words, for practical problems, the Binomial(n, p) distribution can be approximated by the
N(n π, n π (1- π)) distribution, for large n and fixed 0 < π<1
If both n𝝅 and n(1-𝝅) are greater than 5, the z distribution is appropriate.
12
Categorical Data Analysis
p(1 p)
P z .se( p) with se(p) ....................(1.3) where z denotes the standard
2 n 2
Example:
We can be 95% confident that the population proportion of Americans in 2002 who favored
legalized abortion for married pregnant women who do not want more children is between 0.415
and 0.481.
Formula (1.3) is simple. Unless π is close to 0.50, however, it does not work well unless n is very
large. Consider its actual coverage probability, that is, the probability that the method produces an
interval that captures the true parameter value. This may be quite a bit less than the nominal value
(such as 95%). It is especially poor when π is near 0 or 1.
We can make use of confidence interval to make two sided confidence interval. If (1-α)100%
confidence interval consists of all values π0 for the null hypothesis parameter that are judged
plausible in a two sided α level of significance test.
1.4 More on Statistical Inference for Discrete Data: Wald, Likelihood-Ratio and Score
Inference
Let SE denote the standard error of ˆ , evaluated by substituting the ML estimate for the unknown
parameter in the expression for the true standard error. (For example, for the binomial parameter
p(1 p)
π, se( ˆ ) When H0 is true, the test statistic is
n
13
Categorical Data Analysis
ˆ 0
z=
se( ˆ )
has approximately a standard normal distribution. Equivalently, z2 has approximately a chi-
squared distribution with df = 1. This type of statistic, which uses the standard error evaluated at
the ML estimate, is called a Wald statistic. The z or chi-squared test using this test statistic is called
a Wald test.
An alternative test uses the likelihood function through the ratio of two maximizations of it: (1)
the maximum over the possible parameter values that assume the null hypothesis, (2) the maximum
over the larger set of possible parameter values, permitting the null or the alternative hypothesis
to be true. Let ℒ0 denote the maximized value of the likelihood function under the null hypothesis,
and let ℒ1 denote the maximized value more generally. For instance, when there is a single
parameter β, ℒ0 is the likelihood function calculated at β0, and ℒ1 is the likelihood function
calculated at the ML estimate ˆ . Then ℒ1 is always at least as large as ℒ0, because ℒ1 refers to
maximizing over a larger set of possible parameter values.
The likelihood-ratio test statistic equals: 2log
0
1
If the maximized likelihood is much larger when the parameters are not forced to satisfy H0, then
0
the ratio is far below 1. The test statistic 2log
0
must be nonnegative, and relatively
1 1
0
small values of yield large values of 2log 0 and strong evidence against H0. The reason
1 1
for taking the log transform and doubling is that it yields an approximate chi-squared sampling
distribution. UnderH0: β = β0, the likelihood ratio test statistic has a large-sample chi-squared
distribution with df = 1.
The third possible test is called the score test. It finds standard errors under the assumption that
the null hypothesis holds.
14
Categorical Data Analysis
For example, the z test for a binomial parameter that uses the standard error
0 (1 0 )
se( ˆ ) is a score test.
n
Wald test:
0 (1 0 ) 0.5(1 0.5)
se( P ) 0.158
n 10
(0.90 - 0.50)
The Z statistics is z= = 2.53
0.158
The corresponding chi-squared statistic is (2.53)2 = 6.4 (df=1). The P-value=0.011.
0.00977 7.36
The likelihood-ratio test statistic equals: 2log 0 2log
1 0.3874
From the chi-squared distribution with df=1, this statistic has P-value=0.007.
When the sample size is small to moderate, the Wald test is the least reliable of the three tests. We
should not trust it for such a small n as in this example (n = 10). Likelihood-ratio inference and
score-test based inference are better in terms of actual error probabilities coming close to matching
nominal levels. A marked divergence in the values of the three statistics indicates that the
distribution of the ML estimator may be far from normality. In that case, small-sample methods
are more appropriate than large-sample methods.
For small sample sizes, it is safer to use the binomial distribution directly (rather than a normal
approximation) to calculate P-values.
To illustrate, consider testing H0: π = 0.50 against Ha: π > 0.50 for the example of a clinical trial
to evaluate a new treatment, when the number of successes y = 9 in n = 10 trials. The exact P-
value, based on the right tail of the null binomial distribution with π = 0.50, is
10! 10!
P(Y 9) = p(Y=9)+p(Y=10)= (0.50)9 (0.50)1 + (0.50)10 (0.50)0 = 0.011
9!1! 10!0!
For the two sided alternative Ha: π≠0.50,
the P-value is P(Y ≥ 9 or Y ≤ 1) = 2 × P(Y ≥ 9) = 0.021
Unfortunately, with discrete probability distributions, small-sample inference using the ordinary
P-value is conservative. This means that when H0 is true, the P-value is ≤0.05 (thus leading to
rejection of H0 at the 0.05 significance level) not exactly 5% of the time, but typically less than 5%
16
Categorical Data Analysis
of the time. Because of the discreteness, it is usually not possible for a P-value to achieve the
desired significance level exactly.
Then, the actual probability of type I error is less than 0.05.
Mid P-value which adds only half the probability of the observed result to the probability of the
more extreme results is on the average, less conservative than tests using the ordinary P-value.
Exercise
1. Let us assume that Y is a student taking a statistics course. Unfortunately, Y is not a good
student. Y does not read the textbook before class, does not do homework, and regularly
misses class. Y intends to rely on luck to pass the next quiz. The quiz consists of 10 multiple
choice questions. Each question has five possible answers, only one of which is correct. Y
plans to guess the answer to each question.
a. What is the probability that Y gets no answers correct?
b. What is the probability that Y gets two answers correct?
CHAPTER TWO
2 CONTINGENCY TABLES
Learning objectives:
After the completion of this chapter, the students will be able to:
2.1 Introduction
There are many situations in quantitative linguistic analysis where you will be interested in the
possibility of association between two categorical variables. In this case, you will often want to
represent your data as a contingency table. Contingency tables show frequencies produced by
cross-classifying observations.
17
Categorical Data Analysis
In presenting data using contingency tables, it is usual to put the independent variable as row and
dependent variable as column. Suppose there are two categorical variables, denoted by X and Y.
Let I denote the number of categories of X and J the number of categories of Y. A rectangular table
having I rows for the categories of X and J columns for the categories of Y has cells that display
the IJ possible combinations of outcomes.
A table of this form that displays counts of outcomes in the cells is called a contingency table. A
table that cross classifies two variables is called a two-way contingency table; one that cross
classifies three variables is called a three-way contingency table, and so forth .A two-way table
with I rows and J columns is called an I ×J.
Data layout for IXJ contingency tables the cells are presented by counts or frequencies
Y
X 1 2 . . . J Total
1 n11 n12 n1J n1+
. . . . .
. . . . .
. . . . .
I nI1 nI2 nIJ nI+
Total n+1 n+2 n+J n
2.2 Probability Structure for Contingency Tables: Joint, Marginal, and Conditional
Probabilities
For a single categorical variable, we can summarize the data by counting the number of
observations in each category. The sample proportions in the categories estimate the category
probabilities.
Probabilities for contingency tables can be of three types – joint, marginal, or conditional.
Suppose first that a randomly chosen subject from the population of interest is classified on X and
Y. Let 𝜋𝑖𝑗 = P(X = i, Y = j) denote the probability that (X, Y ) falls in the cell in row i and column
I J
j . The probabilities {𝜋𝑖𝑗 } form the joint distribution of X and Y if that satisfy
i 1 j 1
ij 1
18
Categorical Data Analysis
The marginal distributions are the row and column totals of the joint probabilities. We denote these
by {𝜋𝑖+ } for the row variable and {𝜋+𝑗 } for the column variable, where the subscript “+” denotes
the sum over the index it replaces.
For a 2X2 contingency table 𝜋1+ = 𝜋11 + 𝜋12 and 𝜋+1 = 𝜋11 + 𝜋21
We use similar notation for samples, withn roman P in place of Greek 𝜋. For example, {𝑃𝑖𝑗 } are
cell proportions in a sample joint distribution. We denote the cell counts by {𝑛𝑖𝑗 }. The marginal
frequencies are the row totals {𝑛𝑖+ } and the column totals {𝑛+𝑗 }, and n=∑𝑖,𝑗 𝑛𝑖𝑗 denotes the total
sample size. The sample cell proportions relate to the cell counts by = 𝑃𝑖𝑗 =𝑛𝑖𝑗 /𝑛.
Conditional Distribution is the distribution of one variable at given levels of the other.
When one variable is a response and the other is an explanatory variable, we focus on the
distribution of the response variable conditional on the explanatory variable.
A conditional distribution of Y given X is denoted by P(Y/X) and refers to the probability
distribution of Y when we restrict attention to a fixed level of X.
19
Categorical Data Analysis
2.2.1 Independence
Situation: One response variable and the other is an explanatory variable.
Two variables are said to be statistically independent if the population conditional distributions of
Y are identical at each level of X. When two variables are independent, the probability of any
particular column outcome j is the same in each row. I.e. the conditional probabilities of responses
given levels of the explanatory variable should be equal, and they should equal the marginal
probabilities over levels of the explanatory variable
When both variables are response variables, we can describe their relationship using their joint
distribution, or the conditional distribution of Y given X, or the conditional distribution of X given
Y. Statistical independence is, equivalently, the property that all joint probabilities equal the
product of their marginal probabilities,
𝜋𝑖𝑗 = 𝜋𝑖+ 𝜋+𝑗 for i = 1, . . . , I and j = 1, . . . , J
The following table refers to a study that investigated the relationship between smoking and lung
cancer.
Cancer cases
Smoking status Yes No
Smoker 180 172
Non-smoker 90 346
In the table above lung cancer is a response variable and smoking status is an explanatory variable.
We therefore study the conditional distributions of cancer cases, given smoking status. For
smokers, the proportion of “yes” responses was 180/352 = 0.714 and the proportion of “no”
responses was 0.286. The proportions (0.714, 0.286) form the sample conditional distribution of
cancer cases. For non-smokers, the sample conditional distribution is (0.206, 0.794).
Had response and explanatory variables been independent, then the conditional probabilities of
responses given levels of the explanatory variable should have been equal.
But the conditional probability of the cancer cases is not the same at each level of smoking status
indicating that there is association between smoking status and cancer cases
20
Categorical Data Analysis
If the response variable has more than two categories, then we have independent multinomial
sampling.
population proportion, 𝜋1 − 𝜋2 .
2.3.1.1 Confidence Interval for Differences of Proportion
For simplicity, we denote the sample sizes for the two groups ( that is, the row totals ,
𝑛1+ and 𝑛2+ ) by 𝑛1 and 𝑛2 . When the counts in the two rows are independent binomial
samples, the estimated standard error of 𝑃1 − 𝑃2 is
A large-sample (when the conditions n11 5 and n 2 2 5 are satisfied) 100(1 − α)% (Wald)
For small samples the actual coverage probability is closer to the nominal confidence level if you
add 1.0 to every cell of the 2 × 2 table before applying this formula.
For significance test of H0: π1 = π2, a z test statistic divides (p1 − p2) by a pooled standard error,
SE, that applies under H0.
i.e. p1 p2 p1 p2
SE p1 p2 p1 1 p1 p2 1 p2
n1 n2
Example 2.1
Aspirin and Heart Attacks
The following table is a report on the relationship between aspirin use and myocardial infarction
(heart attacks) which was a five-year randomized study testing whether regular intake of aspirin
reduces mortality from cardiovascular disease. Every other day, the male physicians participating
in the study took either one aspirin tablet or a placebo. The study was “blind” – the physicians in
the study did not know which type of pill they were taking.
Table: Cross Classification of Aspirin Use and Myocardial Infarction
22
Categorical Data Analysis
We treat the two rows in Table 1 as independent binomial samples. Of the 𝑛1 = 11,034 physicians
taking placebo, 189 suffered myocardial infarction (MI) during the study, a proportion of 𝑃1 =
189/11,034 = .0171. Of the 𝑛2 = 11,037 physicians taking aspirin, 104 suffered MI, a proportion
of 𝑃2 = 0.0094. The sample difference of proportions is 0.0171 − 0.0094 = 0.0077.
From equation 2.1, this difference has an estimated standard error of p1-p2 is
A 95% confidence interval for the true difference 𝜋1 -𝜋2 is 0.0077±1.96(0.0015) which is (0.005,
0.011). Since this interval contains only positive values, we conclude that 𝜋1 -𝜋2 >0. That is, 𝜋1 >𝜋2 .
For males, taking Aspirin appears to result in diminished risk of heart attack.
Example 2: The following table cross classify cold status versus treatment type (vitamin and
placebo)
p2 (1-p2 ) p 2 (1 p 2 )
An Estimate of the standard error of (𝑃1 -𝑃2 ) is SE= +
n1 n2
23
Categorical Data Analysis
p 1-p2 Z /2 SE p 1-p2 .
Then a 95% CI for the difference of proportion is -0.10 1.96(.045) (-0.19,-0.01)
A 95% confidence interval for the true difference interval contains only negative values, we
conclude that , 𝜋1 -𝜋2 < 0, that is, , 𝜋1 -𝜋2 . Then, taking vitamin C appears to result in a diminished
risk of developing cold
p1
The formulas for computing standard errors, etc. of sampling distribution of are complex.
p2
A relative risk of 1.00 occurs when , 𝜋1 =𝜋2 , that is, when the response is independent of the
group. Two groups with sample proportions, 𝑃1 and 𝑃2 have a sample relative risk of , 𝑃1 / 𝑃2 .
For aspirin example above, the sample relative risk is , 𝑃1 / 𝑃2 = 0.0171/0.0094 = 1.82.
The sample proportion of MI cases was 82% higher for the group taking placebo. The sample
difference of proportions of 0.008 makes it seem as if the two groups differ by a trivial amount.
By contrast, the relative risk shows that the difference may have important public health
implications. Using the difference of proportions alone to compare two groups can be misleading
when the proportions are both close to zero.
As the sampling distribution of the sample relative risk is highly skewed unless the sample sizes
ˆ ) log p1
are quite large, let we work on logarithmic scale as log( RR has an improved bell shaped
p2
̂ , as an estimate of log 𝑅𝑅, is SE(log 𝑅𝑅
sampling distribution. The standard error of log 𝑅𝑅 ̂)
1 1 1 1 1 p1 1 p2
n1 p1 n1 n2 p2 n2 n1 p1 n2 2
24
Categorical Data Analysis
𝑃
A large sample confidence interval for the log of the relative risk is log( 1⁄𝑃 ) ± 𝑍𝛼/2 𝑆𝐸
2
ˆ ) Z SE (log( RR
log( RR /2
ˆ )),log( RR ˆ )) L,U
ˆ ) Z SE (log( RR
/2
Confidence interval for RR can be obtained by exponentiation the end points of the above CI as in
𝐿 𝑈
CI(RR)=((𝑒 , 𝑒 )
Example: the 95% confident interval (Cross Classification of Aspirin Use and Myocardial
Infarction) the true relative risk is (1.43, 2.30). We can be 95% confident that, after 5 years, the
proportion of MI cases for male physicians taking placebo is between 1.43 and 2.30 times the
proportion of MI cases for male physicians taking aspirin. This indicates that the risk of MI is at
least 43% higher for the placebo group.
25
Categorical Data Analysis
The odds ratio is also known as the cross-product ratio, because it equals the ratio of the
products 𝜋11 𝜋22 and 𝜋12 𝜋21 of cell probabilities from diagonal opposite cells.
𝑛11 𝑛22
i.e., 𝜃 = 𝑛12 𝑛21
If the conditional distributions of the column variable on the rows are the same, then the two
variables are statistically independent.
i.e., 𝜋1 = 𝜋2
1 − 𝜋1 = 1 − 𝜋2
𝜋1/ (1 − 𝜋1 ) = 𝑜𝑑𝑑𝑠1 = 𝑜𝑑𝑑𝑠2 = 𝜋2 /(1 − 𝜋2 )
The odds ratio (𝜃)=odds1/odds2=1
E.g., suppose 𝜃=3 ⟹ individuals in row 1 are three times more likely to have a success than
those in row 2. Or the odds of success in row 1 is three time greater than in row 2.
When 𝜃=0.3 ⟹ individuals in row 2 are more likely to have a success than those in row 2. Or
the odds of success in row 1 is 0.3 times the odds in row 2. The odds of success in row 2 is
(1/0.3) =3.33 times greater than the odds in row 1.
Example: Revisit a table that shows the cross classification of Aspirin Use and Myocardial once.
26
Categorical Data Analysis
By considering the two rows as independent binomial sampling, for the physicians taking
n11 189
placebo, the estimated odds of MI equal = 0.0174.
n12 10,845
Since 0.0174=1.74/100, the value 0.0174 means there were 1.74 “MI cases” for every 100 “Non-
MI cases”. The estimated odds equal 104/10,933 = 0.0095 for those taking aspirin, or 0.95 “MI
cases” for every 100 “Non-MI cases”.
Or 𝜃̂=(189*10933)/(104*10845)=1.83
Interpretation:
The estimated odds of MI for male physicians taking placebo equal 1.83 times the
estimated odds for male physicians taking aspirin.
Or individuals who use placebo were 1.83 times more likely to develop MI as compared
to those who take Aspirin.
Or the estimated odds of MI were 83% higher for placebo group.
is preferred when any cell counts are very small. The SE formula replaces {nij} by {nij+0.5}
4. When the order of the rows is reversed or the order of the columns is reversed, the new
value of θ is the inverse of the original value. This ordering is usually arbitrary, so whether
27
Categorical Data Analysis
we get 4.0 or 0.25 for the odds ratio is merely a matter of how we label the rows and
columns.
5. The odds ratio does not change value when the table orientation reverses so that the rows
become the columns and the columns become the rows.
6. The same value occurs when we treat the columns as the response variable and the rows as
the explanatory variable, or the rows as the response variable and the columns as the
explanatory variable. Thus, it is unnecessary to identify one classification as a response
variable in order to estimate θ. By contrast, the relative risk requires this, and its value also
depends on whether it is applied to the first or to the second outcome category.
7. Odds ratios do not depend on the marginal distributions of either variable. Odds ratios only
depend on cell probabilities (proportions or counts) and not on marginal values.
8. Sampling distribution of 𝜃̂ is skewed to the right. Normal approximation is good only if
‘n’ is very large. However, the log odds ratio (log𝜃̂) is symmetric about zero.
The sample log odds ratio log ˆ has a less skewed sampling distribution that is bell-shaped. Its
28
Categorical Data Analysis
1 1 1 1
SE=
n11 n12 n21 n22
The SE decreases as the cell counts increase.
Because the sampling distribution is closer to normality for log𝜃̂ than 𝜃̂, it is better to construct
confidence intervals for logθ. Transform back (that is, take antilogs, using the exponential
function, discussed below) to form a confidence interval for θ.
A large sample confidence interval for logθ is log𝜃̂ ± 𝑍𝛼⁄2 (SE)
Exponentiating end points of this confidence interval yields one for θ.
Example
Reconsidering a table on aspirin vs MI the natural logarithm of ˆ log (1.832) = 0.605. SE of
log ˆ equals
For the population, a 95% confidence interval for logθ equals 0.605±1.96(0.123)=(0.365, 0.846).
the corresponding confidence interval for θ is exp(0.365, .846)=[exp(0.365), exp(0.846)]= (1.44,
2.33)
29
Categorical Data Analysis
30
Categorical Data Analysis
31
Categorical Data Analysis
Independence
Situation: Two response variables (either Poisson sampling or multinomial sampling)
Null Hypothesis: Two variables are statistically independent and alternative hypothesis: Two
variables are dependent.
Definition of statistical independence,
HO : ij = i+ +j for all i = 1, . . . , I and j = 1, . . . , J.
32
Categorical Data Analysis
33
Categorical Data Analysis
General Social Survey that, cross classifies gender and political party identification. Subjects
indicated whether they identified more strongly with the Democratic or Republican party or as
Independents. The table also contains estimated expected frequencies for H0: independence. For
instance, the first cell has
n1+ n +1
For instance, the first cell has 11 = = (1557 * 1246)/2757 = 703.7.
n
would be rather unusual if the variables were truly independent. Both test statistics suggest that
political party identification and gender are associated
34
Categorical Data Analysis
35
Categorical Data Analysis
Consider the following example where we had 2 items both with ordinal response options:
Item 1: A working mother can establish just as warm and secure a relationship with her children
as a mother who does not work.
Item 2: Working women should have paid maternity leave.
36
Categorical Data Analysis
Choice of Scores
The choice of scores often does not make much difference with respect to the value of r and thus
test results.
I the above example, an alternative scoring that changed the relative spacing between the scores
leads to an increase of r from .203 (from equal spacing) to .207 (from one possible choice for
unequal spacing).
37
Categorical Data Analysis
The “best” scores for the above example table that lead to the largest possible correlation, yields r
= .210. (Score from correspondence analysis).
To illustrate take for example the classic example where the data was collected by Fisher himself
and is seen in table below. The goal of Fisher's experiment was to determine if someone could
38
Categorical Data Analysis
determine whether milk or tea was poured first into a cup. To prove this theory Fisher presented
eight randomized cups of tea. Four cups had milk poured first and four cups had tea poured first.
The data is known as the Lady Tasting Tea.
We need the probability density of the only table more extreme than the observed table. The only
table that is more extreme is has n11 = 4.
Poured first Guess poured first Poured first Guess poured first
Milk Tea total Milk Tea total
Milk 3 1 4 Milk 4 0 4
Tea 1 3 4 Tea 0 4 4
Total 4 4 8 Total 4 4 8
n1 n2 4 4 n1 n2 4 4
n11 n1 n11 3
n11 n1 n11 4
0
p(n11)
1
0.229 p(n11) 0.014
n 8 n 8
n1 4
n1 3
=
Therefore, the p-value for this test is 0.229 + 0.014 = 0.243. This result does not establish any
association between the guess on what was poured first and what actually was poured first. It is
difficult to determine an association with a sample this small.
Exercise:
Students were assigned randomly to one of two groups. First group was the control groups where
professors wore ordinary shoes and the second group was in the treatment groups where, professors
wore Nikes to see if students purchased Nikes. The data was summarized as follows.
39
Categorical Data Analysis
X2 criterion for
40
Categorical Data Analysis
Answer
41
Categorical Data Analysis
42
Categorical Data Analysis
Or a situation of Smoking is protective against disease may be due to the fact that most of the
smokers are male and non-smokers are female. Further, most of the persons with disease are female
Including control variables in an analysis requires a multivariate rather than a bivariate analysis.
We illustrate basic concepts for a single control variable Z, which is categorical. A three-way
contingency table displays counts for the three variables
Examples of 3–Way Tables
Smoking × cancer × Age.
Smoking × cancer × gender.
Group × Response × Z (hypothetical) and soon
+
43
Categorical Data Analysis
There are 3–ways to slice this table up. 1. K Frontal planes or XY for each level of Z, J Vertical
planes or XZ for each level of Y and I Horizontal planes or Y Z for each level of X.
XY tables for each level of Z; The Frontal planes of the box are XY tables for each level of Z are
Partial tables
Conditional Odds Ratios are odds ratios between two variables for fixed levels of the third
variable. For fixed level of Z, the conditional XY association given kth
level of Z is
Conditional odds ratios are computed using the partial tables, and are sometimes referred to as
measures of “partial association”.
Marginal Odds Ratios are the odds ratios between two variables in the marginal table.
For example, for the XY margin is given by
44
Categorical Data Analysis
Where
Conditional Dependence means that XY (k) 1 1 for at least one k = 1, . . . ,K.
Marginal independence does not imply conditional independence and conditional independence
does not imply marginal independence.
For both the goodness of fit test and the test of independence, the chisquare statistic is the same.
45
Categorical Data Analysis
We use these to describe the conditional associations between defendant’s race and the death
penalty verdict, controlling for victims’ race. When the victims were white, the death penalty was
imposed 22.9 − 11.3% = 11.6% more often for black defendants than for white defendants. When
the victim was black, the death penalty was imposed 2.8 − 0.0% = 2.8% more often for black
defendants than for white defendants. Thus, controlling for victims’ race by keeping it fixed, the
percentage of “yes” death penalty verdicts was higher for black defendants than for white
defendants.
The bottom portion of the above table displays the marginal table for defendant’s race and the
death penalty verdict. We obtain it by summing the cell counts in table over the two levels of
victims’ race, thus combining the two partial tables (e.g., 11 + 4 = 15). We see that, overall, 11.0%
of white defendants and 7.9% of black defendants received the death penalty. Ignoring victims’
race, the percentage of “yes” death penalty verdicts was lower for black defendants than for white
defendants. The association reverses direction compared with the partial tables.
46
Categorical Data Analysis
47
Categorical Data Analysis
independence of X and Y, given Z, does not imply marginal independence of X and Y. That is, when
odds ratios between X and Y equal 1 at each level of Z, the marginal odds ratio may differ from 1.
There is “no interaction between any 2 variables in their effects on the third variable”.
When these three conditions (equations) do not hold, then the conditional odds ratios for any pair
of variables are not equal. Conditional odds ratios differ/depend on the level of the third variable
If one of the above holds, then the other two will also hold.
For example,
48
Categorical Data Analysis
To test for homogeneous association we only need to test one of these, e.g.
Given estimated expected frequencies assuming that HO is true, the test statistic we use is the
Breslow-Day” statistic, which is like Pearson’s X2:
ni k n jk
where ˆ ijk
n k
If H0 is true, then the Breslow-Day statistic has an approximate chi-squared distribution with df =
K − 1.
49
Categorical Data Analysis
If the null hypothesis of homogeneous association is true, then ˆMH is a good estimate of the
common odds ratio. When computing estimated expected frequencies, we want them such that
the odds ratio computed on each of the K partial tables equals the Mantel-Haenszel estimate of
the common odds ratio.
n 11k n22 k
n k ˆ11k ˆ 22 k
ˆMH k 1
K
ˆ12 k ˆ 21k
n
k 1
12 k n21k
n k
50
Categorical Data Analysis
51
Categorical Data Analysis
50s 15.4 7
60s 24.7 10
70s 30.8 15
80s 17.8 8
90s 3 1
total 1 43
Mean 68.2 68.8
Standard
deviation 12.6
Use the data to conduct two chi square goodness of fit tests
1. Test whether the model of a normal distribution of grades adequately explains the grade
distribution of Social Studies 201.
2. Test whether the grade distribution for Social Studies 201 differs from the grade
distribution for the Faculty of Arts as a whole. For each test, use the 0.20 level of
significance.
Solution. For the first test, it is necessary to determine the grade distribution that would exist if
the grades had been distributed exactly as the normal distribution. That is, the normal curve with
mean and standard deviation the same as the actual Social Studies 201 distribution will be used.
This means that the grade distribution for the normal curve with mean 𝜇 = 68.8 and standard
deviation of δ = 12.7 is used to determine the grade distribution if the grades were normally
distributed. The Z values were determined as -1.48, -0.69, 0.09, 0.88 and 1.67 for the X values 50,
60,70, 80 and 90, respectively. By using the normal table in
These are given as the proportions of Table 10.9. These proportions were then multiplied by 43,
the total of the observed values, to obtain the expected number of grades in each of the categories
into which the grades have been grouped. These are given in the last column of Table. In order to
apply the X2 test properly, each of the expected values should exceed 5.
The less than 50 category and the 90s category both have less than 5 expected cases. In this test,
the 90s have only 2 expected cases, so this category is merged with the grades in the 80s. For the
grades less than 50, even though there are only 3 expected cases, these have been left in category
52
Categorical Data Analysis
of their own. While this may bias the results a little, the effect should not be too great. The
calculation of the X2 statistic is as given in
Category Observed, ni grade Proportion (pi) μi =43*pi A=ni-μi A2/μi
1 2 Less than 50 0.0694 3 -1 0.333
2 7 50s 0.1757 7.6 -1 0.047
3 10 60s 0.2908 12.5 -3 0.5
4 15 70s 0.2747 11.8 3.2 0.868
5 8 80s 0.1419 6.1 0.9 0.1
1 9 90s 0.0475 2 8.1
Total 1 43 1.848
k
(ni i )2
2 =0.333+…+0.1=1.848
i 1 i
Suppose that a variable has a frequency distribution with k categories into which the data has been
grouped. The frequencies of occurrence of the variable, for each category of the variable, are called
the observed values.
The manner in which the chi square goodness of fit test works is to determine how many cases
there would be in each category if the sample data were distributed exactly according to the claim.
These are termed the expected number of cases for each category. The total of the expected number
of cases is always made equal to the total of the observed number of cases.
The null hypothesis is that the observed number of cases in each category is exactly equal to the
expected number of cases in each category. The alternative hypothesis is that the observed and
expected numbers of cases differ sufficiently to reject the null hypothesis.
Let ni is the observed number of cases in category i and 𝜇i is the expected number of cases in each
category, for each of the k categories
i = 1; 2; 3,…, k, into which the data has been grouped. The hypotheses are
H0 : ni = 𝜇i versus H1 : ni ≠ 𝜇 i
and the test statistic is
k
(ni i )2
2
i 1 i
53
Categorical Data Analysis
There are k – 1 degrees of freedom. Large values of this statistic lead the researcher to reject the
null hypothesis; smaller values mean that the null hypothesis cannot be rejected.
54
Categorical Data Analysis
55
Categorical Data Analysis
CHAPTER 3
3 LOGISTIC REGRESSION
Learning objectives:
After the completion of this chapter, the students will be able to:
Why not just use linear regression? Because in the case of a binary response variable, the
assumptions of linear regression are not valid:
In practice, π(x) often either increases continuously or decreases continuously as x increases. The
S-shaped curves displayed in Figure 3.1 are often realistic shapes for the relationship. The most
1
important mathematical function with this shape has formula x
1 e
56
Categorical Data Analysis
a binary response variable Y , recall that π(x) denotes the “success” probability at value x. This
probability is the parameter for the binomial distribution.
1 e x
( x) , ...., 3.1
1 e x 1 e x
This is called the logistic regression function. The corresponding logistic regression model form
is given in equation 3.2. The logistic regression model has linear form for the logit of this
probability
( x)
logitπ(x) = log x,....,3.2
1 ( x)
In logistic regression, a logistic transformation of the odds (referred to as logit) serves as the
dependent variable. One can re arrange equation in 3.2 to have the expression
( x)
e x ,....,3.3, where e is a constant which equal to 2.718...
1 ( x)
57
Categorical Data Analysis
to the curve at x for which π(x) = 0.50 has slope β(0.50)(0.50) = 0.25β; by contrast, when π(x) =
0.90 or 0.10, it has slope 0.09β. The slope approaches 0 as the probability approaches 1.0 or 0.
58
Categorical Data Analysis
3. To predict the probabilities that individuals fall into one of two categories as a function
of some explanatory variables. For example, one might want to predict whether or not an
examinee correctly answers a test item is a function of race and gender.
4. To classify individuals into one of two categories on the basis of the explanatory
variables.
Example
Dependent or Response Variable Y = 1 if the individual has coronary disease if individual has no
heart disease
At any given value of x, the rate of change corresponds to the slope of the curve at that value
of x x))(1- x)). For example, let x = 300 then
ˆ (48.18) 1 ˆ (48.18) 0.5 and the slope when x = 48.18 = (.11(.5)(.5) = 0.0275
59
Categorical Data Analysis
The median effective level for our example is 48.18 and represents the point at which having
coronary heart disease is equally likely as not having coronary heart disease. We can use our
model to predict other probabilities as well.
Another way to interpret our model, which may be a bit easier (at least in terms of computation,
and perhaps also conceptually) is to utilize the relationship between the model and the odds
x
ratio. The model log it (( x)) log x can be written as a multiplicative model
1 x
x
e
This model directly models the odds ratio. Odds increase multiplicatively with x such that a 1
unit increase in x leads to an increase of the odds of e times. e0 = 1, so the odds
do not change.
The log of the odds changes linearly with x; however the log of the odds is not an intuitively
easy or natural scale to interpret.
= .111 so the odds ratio for a 1 point change in achievement test score (x)
lead to an odds ratio increase of e the odds of
developing heart disease at a given mean age of x + 1 is 1.17 times the odds given a mean
age of score of x.
x) is not
constant for equal changes in x, the odds ratio interpretation leads to a constant rate of change.
60
Categorical Data Analysis
Because logistic regression predicts probabilities, rather than just classes, we can fit it using
likelihood. For each training data-point, we have a vector of features, xi, and an observed class, yi.
The probability of that class was either , if yi = 1, or 1 , if yi = 0. The likelihood is then
n
L( , ) ( xi ) yi (1 ( xi ))1 yi
i 1
We could substitute in the actual equation for , but things will be clearer in a moment if we
don't.) The log-likelihood turns products into sums:
n
l ( , ) yi log ( xi ) (1 yi )(log(1 ( xi ))
i 1
n n
1 ( xi )
lo g(1 ( xi )) yi (log( )
i 1 i 1 ( xi )
n n
lo g(1 ( xi )) y ( xi )
i 1 i 1
n n
lo g(1 e xi ) yi ( xi )
i 1 i 1
Typically, to find the maximum likelihood estimates we'd differentiate the log likelihood with
respect to the parameters, set the derivatives equal to zero, and solve. To start that, take the
derivative with respect to one component of , say j .
We are not going to be able to set this to zero and solve exactly. (That's a transcendental
equation, and there is no closed-form solution.) We can however approximately solve it
numerically.
A (1 -
z / 2 (SE )
where SE
61
Categorical Data Analysis
To more readily interpret this confidence interval we can exponentiate the endpoints to
determine the effect on the odds for a 1-unit increase in x.
To get a confidence interval for the effect of age on the odds, that is for e , simply take the
(e0.64 1.5
(1.066, 1.17)
We can also get an interval for the linear approximation to the curve. -
x. We can multiply the
-
suppose we want to determine what the increase in probability of being developing heart
disease x)) = .25 given a 1 unit change in x. Then we can multiply the endpoints of
-
So the rate of increase in the probability of being in an academic program for values of x near
should be noted that due to the large range in achievement test scores a 1-unit increase in score
is not very noticeable, nor is it likely given the scale.
62
Categorical Data Analysis
e 1 ). Thus, for simple logistic both the likelihood ratio for the full model and the Wald test
for the significance of the predictor test the same hypothesis.
The Likelihood Ratio and Wald test of the significance of a single predictor are said to be
“asymptotically” equivalent, which means that their significance values will converge with larger
N. With small samples, however, they are not likely to be equal and may sometimes lead to
different statistical conclusions (i.e., significance). The likelihood ratio test for a single predictor
is usually recommended by logistic texts as the most powerful (although some authors have stated
that neither the Wald nor the LR test is superior).
ˆ 0
z N (0,1) where
SE
.111 0
z 4.625
.024
We can also use Wald statistics which are simply squared z-statistics with 1 d.f.
2
ˆ
X
2
(1)
2
SE
63
Categorical Data Analysis
Although the Wald test is adequate for large samples, the likelihood-ratio test is a more powerful
alternative to the Wald statistic. The test statistic = -2(L0 - L1), where L0 is the log of the maximum
likelihood for the more parsimonious model (less complex), x 1 is the log
of the maximum likelihood for the more complex model, x x
again, to conduct this test you must fit both models to obtain both likelihoods and calculate the
test statistic by hand. Suppose that we fit both of these models to our data and we obtain:
We can also look at the confidence intervals for the true probabilities, under the model.
64
Categorical Data Analysis
This difference between two logits equals the difference of log odds. Equivalently that difference
equals the log of the odds ratio between X and Y , at that category of Z. Thus, exp(β1) equals the
conditional odds ratio between X and Y . Controlling for Z, the odds of “success” at x = 1 equal
exp(β1) times the odds of success at x = 0. This conditional odds ratio is the same at each category
of Z. The lack of an interaction term implies a common value of the odds ratio for the partial tables
at the two categories of Z. The model satisfies homogeneous association
65
Categorical Data Analysis
Conditional independence exists between X and Y , controlling for Z, if β1 = 0.In that case the
common odds ratio equals 1. The simpler model,
The parameter βi refers to the effect of xi on the log odds that Y = 1, controlling the other xs.
For example, exp(βi ) is the multiplicative effect on the odds of a 1-unit increase in xi , at fixed
levels of the other xs. It measure the association between Y and Xi adjusted for the other
predictors in the model
Summary
66
Categorical Data Analysis
CHAPTER 4
3. BUILDING AND APPLYING LOGISTIC REGRESSION MODELS
Learning objectives:
After the completion of this chapter, the students will be able to:
Having learned the basics of logistic regression, we now study issues relating to building a model
with multiple predictors and checking its fit. After choosing a preliminary model, model checking
explores possible lack of fit.
67
Categorical Data Analysis
Guideline: The possible number of predictors to be included with the final model is limited by the
sample size. For each predictor in the final model the sample should include at least 10 records of
each outcome of the response.
For example, consider a large sample with n = 500, including 250 successes and 250 failures, then
the final model can include up to 250/10=25 predictors.
But if the sample includes only 50 successes and 450 failures, the final model should not include
more than 50/10=5 predictors.
Another problem caused by the inclusion of many predictors is multicollinearity.
This means that some predictors could be linearly dependent (or close to linearly dependent) and
therefore the estimates become imprecise.
For example, if we have two predictors x1; x2, where x1 = a + bx2 (e.g. x1 is the temperature in
centigrade and x2 is the temperature in Fahrenheit) then because of this linear relationship one of
the two variables is redundant. Now assume that the relationship is not perfect, but really close
Standard errors are very large in the presence of collinearity, resulting in tests for effect of
explanatory variables on the response variable not being significant even when the model utility
test
Principles on model selection:
When choosing predictors for the final model among the variables observed, the following
principles apply.
1. Predictors deemed essential for the model should always be included, regardless of them
being "significant" or not.
2. When including interaction between factors with the model the lower order terms for those
factors should be included with the model as well.
3.3.3. Stepwise variable selection algorithms
Two common strategies for adding or removing variables in a logistic regression are called
backward-selection and forward-selection. These techniques are often referred to
as stepwise model selection strategies, because they add or delete one variable at a time as they
"step" through the candidate predictors.
These algorithms can be helpful in building a model but have to be used with caution and the final
model has always to be reviewed by the researcher.
68
Categorical Data Analysis
In the backward variable selection algorithm the model is improved step by step by dropping a
variable from the model at each step. The variable which shows the "least significant" effect when
correcting for the other variables in the model is dropped. The process stops if another step does
not show a further improvement of the model.
In the forward variable selection algorithm the model is improved step by step by adding an extra
variable to the model at each step. The variable which shows the "most significant" effect when
correcting for the other variables in the model is added. The process stops if another step does not
show a further improvement of the model.
When using these algorithms one should make sure that either all or none of the dummy variables
for a categorical factor are included with a model, and that if an interaction term is included with
a model automatically all main effect term have to be included too.
69
Categorical Data Analysis
ROC curves
Logistic regression R2
Hosmer-Lemeshow tests
Chi-square goodness of fit tests and deviance
individual.
Therefore each individual is classified as success if ˆ x 0.5 and as a failure if ˆ x 0.5
A table showing the actual measurement and the classification for the data is called a classification
table.
If the model fits well, most individuals from the sample fall into the correctly predicted categories.
To measure the predictive power of the model finds sensitivity, specificity, and the overall
proportion of correct classifications.
70
Categorical Data Analysis
The curve is created by plotting the true positive rate against the false positive rate at various
threshold settings. (The true-positive rate is also known as sensitivity. A false-positive rate can be
calculated as 1 - specificity). The ROC curve is thus the sensitivity as a function of false-positive
rate.
When π0 gets near 0, almost all predictions are yˆ 1 ; then, sensitivity is near 1, specificity is near
0, and the point for (1 – specificity, sensitivity) has coordinates near (1, 1). When π0 gets near 1,
almost all predictions are yˆ 0 ; then, sensitivity is near 0, specificity is near 1, and the point for (1
– specificity, sensitivity) has coordinates near (0, 0). The ROC curve usually has a concave shape
connecting the points (0, 0) and (1, 1).
For a given specificity, better predictive power corresponds to higher sensitivity. So, the better the
predictive power, the higher the ROC curve.
71
Categorical Data Analysis
The area under the curve is identical to the concordance index (c). It estimates the probability that
the predictions and the outcomes are concordant, i.e. that the observations with larger y also have
larger ˆ . A value c = .5 means that the prediction is as good as a random guess and corresponds to
a ROC-curve being a straight line.
3.4.5. Correlation
R the correlation between the observed values yi and the estimated values yˆ i is also measure of
model fit.
For the logistic regression model this is a correlation between values of 0 and 1 for the response
(1=success, 0=failure) and the estimated probabilities for success ˆ x . The closer is the value to
1, the better the fit between data and model. Because of the discrete response variable the
usefulness is limited.
R2 is again the coefficient of determination, which gives the proportion of variation in the response
variable explained by the model.
72
Categorical Data Analysis
Where the null model is the logistic model with just the constant and the k model contains all the
predictors in the model.
2/ n
2 LLnull
R 1
2
2 LLk
Because this R-squared value cannot reach 1, Nagelkerke modified it. The correction increases
the Cox and Snell version to make 1.0 a possible value for R-squared.
Nagelkerke Pseudo-R2
2/ n
2 LLnull
1
2 LLk
R
2
1 2 LLnull
2/ n
73
Categorical Data Analysis
Steps in Hosmer-Lemeshow
The deviance of the model is the likelihood ratio test between the most complex model that could
be fit, known as the saturated model (with n parameters that fits the n observations perfectly) and
the model that is being tested. The saturated model has a separate parameter for each logit and
provides a perfect fit to the data. This statistic is large when the model provides a poor fit to the
data.
Let LM denote the maximized log-likelihood value for a model M of interest. Let LS denote the
maximized log-likelihood value for the most complex model possible that has a separate parameter
for each observation, and it provides a perfect fit to the data.
Because the saturated model has additional parameters, its maximized log likelihood LS is at least
as large as the maximized log likelihood LM for a simpler model M. The deviance of a model is
defined as
Deviance = −2[LM − LS]
The deviance is the likelihood-ratio statistic for comparing model M to the saturated model. It is a
test statistic for the hypothesis that all parameters that are in the saturated model but not in model
M equal zero. Statistical software provides the deviance, so it is not necessary to calculate LM or
LS.
74
Categorical Data Analysis
Hence, it can be shown that the likelihood of this saturated model is equal to 1 yielding a log-
likelihood equal to 0. Therefore, the deviance for the logistic regression model is. If the p-value is
small, then we have evidence that the model does not fit the data.
When the predictors are solely categorical, the data are summarized by counts in a contingency
table. For the ni subjects at setting i of the predictors, multiplying the estimated probabilities of
the two outcomes by ni yields estimated expected frequencies for y = 0 and y = 1. These are the
fitted values for that setting. The deviance statistic then has the G2 form introduced in equation
(2.7), namely
75
Categorical Data Analysis
The test statistics has a chi square distribution with degrees of freedom equals Number of
parameters in complex model (saturated model)-number of parameters in model
Residuals
Finally, residuals can be studied to determine where the lack of fit lies. In our example, none
of the standardized residuals are that large.
It is also helpful to plot the observed vs. the fitted proportions. If they are close they should
lie on the 45 degree line. One can also plot both of fitted and observed values against
explanatory variables.
1
0.8
0.6
Fitted
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Observed
76
Categorical Data Analysis
Observed Fitted
1
0.8
0.6
Pi
0.4
0.2
0
180 205 230 255 280 305 330
Achievem ent
There is also the possibility that particular observations have too much “influence” in with
respect to
1. Their effect on parameter estimates - if the observation(s) were deleted the values of
parameter estimates may be considerable different.
2. Their effect on model fit - if the observation(s) were deleted there may be a large
improvement to model fit.
3. Their effect on misclassification error - if the observation(s) were deleted there may be a
large difference in the predicted number of “successes”.
There are several measures that can describe the influence of observations. Each of these
measures are computed for each observation and the larger the value of the measure, the
greater the influence that observation has on model fit.
These measures are particularly important with quantitative continuous explanatory variables
because looking at residuals (which is one indicator of how influential an observation is) is
particularly daunting.
All of these measures are related to the leverage an observation has because the greater the
leverage the greater the influence on the model. Mathematically, these measures are related
to the diagonal of the hat-matrix that is used to obtain the predicted logit values for a model
from the sample logits. Large values in the hat matrix represent observations that are
greatly influencing model fit.
77
Categorical Data Analysis
Another measure that describes the influence of individual observations is called Dfbeta.
This assesses the effect that an individual observation has on the parameter estimates. The
larger Dfbeta, the larger the change in the estimated parameter when the observation is
removed. Large values indicate that certain observations are leading to instability in the
parameter estimates.
Another measure that describes the the influence of individual observations is the confidence
interval displacement measures, c and c-bar. These measure the change in the joint
confidence interval for the parameters produced by deleting an observation.
Finally, Delta Deviance and Delta Chi-Square measure the change in the G2 and X2 goodness
of fit statistics, respectively for individual observations. They are diagnostics for detecting
observations that contribute heavily to the lack of fit to the model.
In using these statistics it is useful to plot them versus some index, such as the observation
number.
Although the predictors act additively on the log-odds scale, they are not additive on the odds or
risk (probability) scales
CHAPTER 5
5. MULTICATEGORY LOGIT MODELS
Learning objectives:
After the completion of this chapter, the students will be able to:
78
Categorical Data Analysis
In the case of binary logistic or simply logistic regression model, we have restricted the response
or dependent variable in logit models to be dichotomous or binary. Now we will consider a
response variable, Y, with J levels. The explanatory or independent variables may be quantitative,
qualitative, or both.
This is generalizations of logistic regression that model categorical responses with more than two
categories.
There are ways in which logistic regression models for response variables with more than two
outcomes differ from logistic regression for dichotomous data.
79
Categorical Data Analysis
c. Continuation ratios
But in this course we will have a look at for only “Baseline” logit models for nominal response
data and Cumulative logits for Ordinal responses data
This model is basically just an extension of the binary logistic regression model. It gives a
simultaneous representation of the odds of being in one category relative to being in another
category for all pairs of categories.
Let J denote the number of categories for Y . Let {π1, . . . , πJ } denote the response probabilities,
J
satisfying
j 1
j 1 . With n independent observations, the probability distribution for the number
of outcomes of the J types is the multinomial. It specifies the probability for each possible way the
n observations can fall in the J categories. Here, we will not need to calculate such probabilities.
For models of this section, the order of listing the categories is irrelevant, because the model treats
the response scale as nominal (unordered categories).
These models pair each category with a baseline category. When the last category (J ) is the
baseline, the baseline-category logits are
80
Categorical Data Analysis
With a set of J – 1 non-redundant odd in equation 5.1, we can figure out the odds for any pairs of
categories. To determine equations for all other pairs of categories. For example, for an arbitrary
pair of categories a and b,
With the baseline category logit model we choose one of the categories as the “baseline”. This
choice may be arbitrary or there may be a logical choice depending on the data
For convenience, we’ll use the last level (i.e. the Jth level) of the response variable as the
baseline.
The baseline category logit model with one explanatory variable, x, is:
ij
log j j xi for j = 1, 2, … , J -1
iJ
For J = 2 this is just the regular binary
depending on which two categories are being compared. The odds for any pair of categories of
Y are a function of the parameters of the model.
Example
Suppose we have data that identifies respondents’ political affiliation as either democrat,
republican, or independent and we want to know if political affiliation can be predicted by SES
which is a quantitative (i.e. continuous) variable and gender (X2), which is a qualitative.
For this data the response variable is party identification. We could fit a binary logit model to
each pair of party identifications, Y:
81
Categorical Data Analysis
democrat
1 female
Y republican X1 =SES and x 2 =
indenpendent 0 male
democrat 1
log log 1 1 x2
independent 3
republican
log log 2 2 2 x
independen t 3
1 1 x ) - ( 2 2 x )
= (1 2 ) (1 2 ) x
republican
log 1 11 x1 21 x2
independent
democrat
log 2 21 x1 22 x2
independent
82
Categorical Data Analysis
democrat
log 3 31 x1 32 x2
republican
Which means that in the population:
CAUTION: You MUST be certain what the computer program that you use to estimate the
J = 0 and others set .
Using the parameters we obtained from fitting the model with SES predicting party affiliation by
SES I obtained:
democrat
ˆ
log log
ˆ 1 0.1502 0.00013( x2 )
independent 3
republican
lôg lôg 2 0.9987 0.0191( x)
independen t 3
1
democrat 1
lôg lôg lôg 3
republican 2 2
3
= 1.1489 - .01923 x
We can interpret the parameters of the model in terms of odds ratios, given an increase in SES.
For a 10 point increase in SES index we obtain the following odds ratios:
83
Categorical Data Analysis
Just as in binary logistic regression, we can also interpret the parameters of the model in
terms of probabilities.
exp( j j x)
j J
exp(
k 1
k k x)
J J = 0. This is an
J
identification constraint. Furthermore, the denominator exp(
k 1
k k x) ensures that the
probabilities sum to 1.
Proof
j
log j j x
J
Exponentiating both sides, we get
j j jx
e which is the odds for category j versus category J.
J
84
Categorical Data Analysis
j J ej jx
J1 J1 J1
j j x
j J e J ej jx ;Now, suppose we sum both sides over j-1
j1 j1 j1
Note, though
J J1
j j J 1
j1 j1
J1
i.e. j 1 J
j1
So
J1
j j x
J1
j j x J1 j jx 1
1 J J e 1 J J e 1 J 1 e J J1
j1 j1 j1 1 ej jx
j1
j j x
and, since j J e substituting in J , we obtain
ej jx
j J1
1 ej jx
j1
exp(0.1502 .00013x)
ˆ democrat
1 exp(.1502 .00013x) exp( .9987 .0191x)
1
ˆ independent
1 exp(.1502 .00013x) exp( .9987 .0191x)
85
Categorical Data Analysis
0.8
Probability
0.6
0.4
0.2
0
0 50 100
SES
Democrat Republican Independent
We can easily add more explanatory variables to our model and these variables can either be
categorical or numeric. We identify numeric variables in proc catmod by the command “direct”.
Furthermore, by the entire model comparison methods that we have used in the past will work
with this model as well.
The baseline category logit model can be used when the categories of the response variable are
ordered, but it may not be the best model for ordinal responses.
For this model the effect of the explanatory variable(s) is the same regardless of how we collapse
Y into dichotomous categories. Therefore, a single parameter describes the effect of x on Y,
versus the J-1 parameters that are needed in the baseline model. However, the intercepts can
differ.
86
Categorical Data Analysis
For this model we use cumulative probabilities which are the probabilities that Y falls in
category j or below. In other words, P( Y ≤ j j, where j = 1, 2,
… , J.
I.e. a model for cumulative logit j is equivalent to a binary logistic regression model for combined
categories 1 to j (I) versus the combined category j + 1 to J (II)
For one predictor variable x the proportional odds model becomes for j = 1, 2 . . . , J −1
A cumulative logit is of the form:
P(Y j ) P(Y j ) 1 2 j
log log log
P(Y j ) 1 P(Y j )
j 1 J
P(Y j )
log j x
1 P(Y j )
The cumulative logit models measure how likely the response is to be in category j or below versus
in a category higher than j
87
Categorical Data Analysis
The slope is the same for all cumulative logits, and therefore this model has a single slope
parameter instead of J − 1 in the multicategory logistic regression model.
The parameter describes the effect of x on the odds for falling into categories 1 to j. The effect
is assumed to be the same for all cumulative odds and called as proportional odds model.
Cumulative probabilities are given by:
exp( j x)
P(Y j )
1 exp( j x)
88
Categorical Data Analysis
x
e j 1
= 1 – P(Y J – 1) = 1 .
j 1 x
1 e
Therefore, this model is sometimes referred to as a difference model.
To interpret this model in terms of odds ratios for a given level of Y, say Y = j
= exp j 2 j 1 1 -x2)]
The odds ratio is proportional to the difference between x1 and x2 and since this proportionality is
a constant equal i.e. that each
independent variable has an identical effect at each cumulative split of the ordinal dependent
variable.
For large samples with categorical explanatory variables the results are almost the same. In
general, maximum likelihood estimation is preferred with quantitative explanatory variables.
P(Y j )
Given log j 1 x1 2 x2 .... k xk
1 P(Y j )
In this model, intercept αj is the log-odds of falling into or below category j when X1 = X2 = · · · =
0.
A single parameter βk describes the effect of xk on Y such that βk is the increase in log-odds of falling
into or below any category associated with a one-unit increase in Xk, holding all the other X-variables
constant; compare this to the baseline logit model where there are J-1 parameters for a single
explanatory variable. Therefore, a positive slope indicates a tendency for the response level
to decrease as the variable decreases.
Constant sloped βk: The effect of Xk, is the same for all J-1 ways to collapse Y into dichotomous
outcomes.
89
Categorical Data Analysis
Given preference for watching football game having levels (like it very much, like
rs.
A cumulative was fitted odds ratio predicting whether or not one liked watching football game
from age and got the following:
Etc.
90
Categorical Data Analysis
1
0.8
probability
0.6
0.4
0.2
0
10 30 50 70 90
age
Etc.
0.6
0.5
probability
0.4
0.3
0.2
0.1
0
10 30 50 70 90
age
91
Categorical Data Analysis
CHAPTER 6
6. POISSON REGRESSION MODEL
Learning objectives:
After the completion of this chapter, the students will be able to:
Before discussing about Poisson regression model, let us revised the definition of Poisson
distribution.
6.1. The Poisson Distribution
A random variable Y is said to have a Poisson distribution with parameter if it takes integer
values y = 0, 1, 2, with probability
e y
P(Y =y) =
y!
The Poisson distribution, we mentioned that it is a limiting case of the binomial distribution when
the number of trials becomes large while the expectation remains stable, i.e., the probability of
success is very small.
6.2. Introduction to Poisson regression
Poisson regression deals with situations in which the dependent variable is a count. But we can
also have Y/t, the rate (or incidence) as the response variable, where t is an interval representing
time, space or some other grouping.
Explanatory Variable(s):
Explanatory variables, X1, X2, … Xk , can be continuous or a combination of continuous and
categorical variables. Convention is to call such a model “Poisson Regression”.
Explanatory variables, X1, X2, … Xk, can be ALL categorical. Then the counts to be modeled are
the counts in a contingency table, and the convention is to call such a model log-linear model.
If Y/t is the variable of interest then even with all categorical predictors, the regression model will
be known as Poisson regression, not a log-linear model.
Why do we need special models? What is wrong with OLS?
92
Categorical Data Analysis
Like in probit and logit models, the dependent variable has restricted support. OLS regression
can/will predict values that are negative and will also predict non-integer values. Non sense
results.
Even though these kinds of response variables are numeric, they create some problems if we try to
analyze the data within the context of regular linear regression because of the limited range of most
of the values (although a large range of values is still possible) and because only nonnegative
integer values can occur. Thus, count data can potentially result in a highly skewed distribution,
cut off at zero.
Since the mean is equal to the variance, any factor that affects one will also affect the other.
Thus, the usual assumption of homoscedasticity would not be appropriate for Poisson data.
The model: Y random variable that has a Poisson distribution
log(E[Y])=log( ) = 0 + 1 x1 + · · · + p xk where Y is a random varible with Y / x1 , x2 ,..., xk pois( )
Since the log of the expected value of Y is a linear function of explanatory variable and the
expected value of Y is a multiplicative function of x. i.e. E[Y]= = exp(0 + 1 x1 +...+k xk )
Parameter Estimation
Similar to the case of Logistic regression, the maximum likelihood estimators (MLEs) for (β0, β1 …
etc.) are obtained by finding the values that maximizes log-likelihood. In general, there are no
closed-form solutions, so the ML estimates are obtained by using iterative algorithms such
as Newton-Raphson (NR), Iteratively re-weighted least squares (IRWLS), etc.
Interpretation of Parameter Estimates:
To interpret the results of the analysis, we need to exponentiation the estimates of interest, (
exp( j ) , as well as the ends of the confidence intervals and talk about multiplicative changes in
the response variable for each one‐unit change in the explanatory variable. That is
exp(β0) = effect on the mean of Y, that is β1, when x1=x2=…xk = 0, which is a baseline
value, i.e. value for an observation with all X’s equal to zero
exp(β1) = with every unit increase in x1, the predictor variable has multiplicative effect of
exp(β1) on the mean of Y, keeping the effects of the rest predictors constant
93
Categorical Data Analysis
Consider two2 values of x (x1 & x2) such that the difference between them equals 1. For
example, x1 = 10 and x1 = 11,
The expected value of Y when x1 = 10 is
E[Y]= = exp(0 + 1.10+...+k xk ) =exp(0 )exp( 10)exp(1 )...exp(k )exp(xk )
The expected value of µ when x1 = 11 is
E[Y]= = exp(0 + 1.11+...+k xk ) =exp(0 )exp( 11)exp(1 )...exp(k )exp(xk )
If βi = 0 where i=1,2,..k, then exp(βi) = 1, and the expected count, = E(y) = exp(β1),
and Y and xi are not related.
If β > 0, then exp(β) > 1, and the expected count = E(y) is exp(β) times larger than
when x = 0
If β < 0, then exp(β) < 1, and the expected count = E(y) is exp(β) times smaller than
when X = 0
94
Categorical Data Analysis
Example
Volunteering Sociologists wanted to determine if sex or education level affected how often people
volunteer. They collected data on the number of volunteer activities people were involved with in
the previous year, their sex, and the number of years of education (high school and beyond). The
numbers of volunteer activities are counts and a log transformation doesn’t really help us to meet
the assumptions of linear regression. Proceed with Poisson regression. There was little evidence
to suggest that the relationship between volunteering and education level differed by sex (
2 2 1=0.75, P = 0.39). (no need for interaction, simplify to parallel lines model) Males vs.
females: Females are involved with 1.16 times (95% CI = 1.02‐1.32) more volunteer activities
than males, after accounting for years of education (χ 2 1=5.42, P = 0.02). OR Females are involved
with 16 % (95% CI = 3‐32%) more volunteer activities than males, after accounting for years of
education (χ 2 1=5.42, P = 0.02). Years of education: Each additional year of education increases
the expected number of volunteer activities by 1.14 times (95% CI = 1.12‐1.16), after accounting
for sex (χ 2 1=138.9, P < 0.0001). OR Each additional year of education increases the expected
number of volunteer activities by 14% (95% CI = 12‐16%), after accounting for sex (χ 2 1=138.9,
P < 0.0001).
Inference
The usual tools from the basic statistical inference Confidence Intervals and Hypothesis tests for parameters
Model Fit
Overall goodness-of-fit statistics of the model are the same as for any GLM:
o Pearson chi-square statistic, X2
o Deviance, G2
o Likelihood ratio test, and statistic, ΔG2
Residual analysis: Pearson, deviance, adjusted residuals, etc...
Overdispersion
95
Categorical Data Analysis
o Recall that a Poisson random variable has the same mean and variance, e.g., E(Y)=Var(Y)=
Overdispersion means that observed variance is larger than the assumed variance, i.e., Var(Y)=φ where φ is
a scale parameter like we saw in logistic regression.
o Two typical solutions are:
Adjust for overdispersion (like in logistic regression) where we estimate φ=X2/(N−p), and adjust the standard errors
and test statistics.
Use negative binomial regression instead (see notes on ANGEL), where the response Y is assumed to ollow a
Negative Binomial distribution, E(Y)=μ and Var(Y)=μ+Dμ2. The index D is a called a dispersion parameter. Greater
heterogeneity in the Poisson means results in a larger value of D. As D approaches 0, Var(Y) will approach μ , and
the negative binomial and Poisson regression will give the same inference.
An important additional property of the Poisson distribution is that sums of independent Poisson variates
are themselves Poisson variates, i.e., if Y1 and Y2 are independent with Yi having a P(_i ) distribution, then
Y1 + Y2 _ P(_1 + _2) (1)
As we shall see, the key implication of this result is that
individual and grouped data can both be analyzed with the Poisson distribution.
The Poison distribution is a discrete distribution and is appropriate for modeling counts of observations. Counts are observed
cases, like the count of measles cases in cities. You can simply model counts if all data were collected in the same measuring unit
(e.g. the same number of days or same number of square feet).
You can use the Poisson Distribution for modeling rates (rates are counts per unit) if the units of collection were different.
Unlike the familiar normal distribution, which is described by two parameters (mean and variance), the Poisson distribution is
completely described by just one parameter, lambda (λ). Lambda is the average of a Poisson distribution as well as the variance
and λ can take on non-integer values. While it is impossible to have 1.5 cases of measles in a city, it is possible to have the
average number of cases per 1000 person-months be non-integer (like 3.12 cases/1000 person-months).
96