6 Loglinear Models Beamer-Online PDF

Log-linear Models for Contingency Tables
Edps/Psych/Soc 589
Carolyn J. Anderson
Department of Educational Psychology
c
Board of Trustees, University of Illinois
Fall 2018
Overivew LL2-way Parm Constraints LL3–way Inference Stat vs Practical 4+–Way Tables Logit≡Log-linear Strategies
Outline
In this set of notes:
◮ Loglinear models for 2–way tables.
◮ Loglinear models for 3–way tables.
◮ Statistical inference & model checking.
◮ Statistical versus Practical Significance.
◮ Higher–way tables.
◮ The logit–log-linear model connection. (We’ll discuss further
connections when we cover multicategory logit models).
◮ Model building (graphical models).
◮ Modeling ordinal associations, including linear × linear
association models.
◮ Modeling approach to testing conditional independence.
◮ Sparse data, including
◮ Structural zeros
◮ Sampling zeros
◮ Effect on G 2 and X 2 .
C.J. Anderson (Illinois) Log-linear Models for Contingency Tables Fall 2018 2.1/ 112
Log-linear models (or Poisson regression)

log(µ) = α + β1 x1 + β2 x2 + . . . + βk xk
where µ = response variable = count (or rate)
◮ A very common use of log-linear models is for modeling counts in

contingency tables; that is, the explanatory variables are all
categorical.
◮ Log-linear models are used to model the association (or interaction
structure) between/among categorical variables.
◮ The categorical variables (which in GLM terminology are
explanatory variables) are the “responses” in the sense that we’re
interested in describing the relationship between the variables.
◮ This use of “response” differs from our use in GLM. For log-linear
models, the response variable are the cell frequencies (counts) in the
contingency table.
Log-linear Models for 2-way Tables
Review of notations for 2-way Contingency tables:
◮ I = the number of rows.

◮ J = the number of columns.
◮ I × J contingency table.
◮ N = IJ = the number of cells in the table.
◮ n = the number of subjects (respondents, objects, etc.)
cross-classified by 2 discrete (categorical) variables.
Example: 1989 GSS data

From Demaris: Cross-classification of respondents according to
◮ Choice for president in the 1988 presidential election
(Dukakis or Bush).
◮ Political View with levels liberal, moderate, conservative.
Political Vote Choice

View Dukakis Bush Total
Liberal 197 65 263
Moderate 148 186 334
Conservative 68 242 310
413 493 906
A More Recent Example

Cross-classification of respondents to the General Social Survey
from 1996.
◮ Choice for president in the 1992: presidential election:

“If you voted in 1992, did you vote for Clinton, Bush or
Perot?”
◮ Political View with levels liberal, moderate, conservative.
“We hear a lot of talk these days about liberals and
conservatives. I’m going to give you a scale . . . Where do you
place yourself on this scale?”
The scale had 7 levels: extremely liberal, liberal, slightly
liberal, moderate, etc., but I collapsed them in to 3.
The 1996 Data

View Bush Clinton Perot Total
Liberal 70 342 56 468
(.15) (.73) (.12) 1.00
Moderate 195 332 101 628
(.31) (.53) (.16) 1.00
Conservative 382 199 117 698
(.55) (.29) (.17) 1.00
Total 647 873 274 1794
Statistical Independence
◮ The joint probabilities {πij } of observations falling into a cell equal
the product of the marginal probabilities,
πij = πi + π+j for all i = 1, . . . , I and j = 1, . . . , J
◮ The frequencies (cell counts) equal

µij = nπij = nπi + π+j for all i, j
◮ The probabilities πij are the parameters of Binomial or Multinomial

distribution.
◮ In Log-linear models, the response variable equals the counts and
expected cell counts {µij }, rather than cell probabilities {πij };
therefore, the random component is Poisson.
◮ Taking logarithms gives us a log-linear model of statistical
independence
log(µij ) = log(n) + log(πi + ) + log(π+j )
Log-Linear Model of Statistical Independence

“Log-linear model of independence” for 2–way contingency tables:
log(µij ) = λ + λX Y
i + λj
◮ This is an “ANOVA” type representation.

◮ λ represents an “overall” effect or a constant.
P P
It term ensures that i j µij = n.
◮ λXi represents the “main” or marginal effect of the row variable
X . It represents the effect of classification in row i .
The λX
P
i ’s ensure that j µij = µi + = ni + .
◮ Y
λj represents the “main” or marginal effect of the column
variable Y & represents the effect of classification in column j.
The λY
P
j ’s ensure that i µij = µ+j = n+j .
◮ The re-parametrization allows modeling association structure.
Log-linear model of Statistical Independence

i + λj
& hypothesis test of statistical independence in 2–way tables.

◮ The estimated expected cell counts for the chi-squared test of
independence equal (from beginning of course)
ni + n+j
µ̂ij =
n
which also equal the estimated fitted values for the
independence log-linear model.
◮ The significance of this: The X 2 and G 2 tests of independence
are goodness–of–fit tests of the independence log-linear model.
◮ The null hypothesis of independence is equivalent to
The model log(µij ) = λ + λX i + λj
Y
holds.
◮ and the alternative hypothesis of dependence is equivalent to
The model log(µij ) = λ + λX Y

i + λj does not hold.
Example
The observed and estimated fitted values

Liberal 70 342 56 468
(168.78) (227.74) (71.478)
Moderate 195 332 101 628
(226.49) (305.60) (95.915)
(251.73) (339.66) (106.61)
Total 647 873 274 1794
The fitted values satisfy the definition of independence perfectly.

e.g., for the (Liberal, Bush) cell n1+nn+1 = (468)(647)
1794 = 168.78
Example: SAS & R

Fit Independence model in SAS:
PROC GENMOD ORDER=DATA;
CLASS pview choice;
MODEL count = pview choice / LINK=log DIST=Poisson
OBSTATS;
The observed and estimated fitted values
R:
summary(i.mod ← glm(count ∼ view + choice,
data=gss.data,family=poisson) )
(X2 ← sum(residuals(indep.mod,type=c("pearson"))**2))
i.mod$fitted
Independence also implies
that the odds ratios for every 2 × 2 sub-table must equal 1.

For our example, fitted odds ratio for each of the 3 possible
sub-tables for Bush and Clinton:
(168.78)(305.60)/(227.74)(226.49) = 1.00
(226.49)(339.66)/(305.60)(251.73) = 1.00
(168.78)(339.66)/(227.74)(251.73) = 1.00
The same is true for all possible (2 × 2) sub-tables.

Fit statistics for the independence log-linear model. . .
Independence (continued)
Fit statistics for the independence log-linear model:
Statistic df Value p–value

X2 4 252.10 < .0001
G2 4 262.26 < .0001
Log Likelihood 7896.55
These are the same the X 2 and G 2 we get when testing

independence. You get these in SAS from PROC FREQ or
GENMOD. In R, “deviance” from glm is G 2 and can compute X 2
using X2 <-
sum(residuals(indep.mod,type=c("pearson"))**2).
Any guesses as to what model might fit these data?
Log-linear Model as a GLM

For I × J Tables
◮ Random component. The N = IJ counts in the cells of the
contingency tables are assumed to be N independent
observations of a Poisson random variable. Thus, we focus on
expected values of counts:
E (counts) = µij
◮ Link is log (canonical link for the Poisson distribution).

◮ Systematic component is a linear predictor with discrete
variables.
Loglinear model (of independence) is
i + λj
Log-Linear Model Parameters

log(µij ) = λ + λXi + λYj
◮ The row and column variables (X and Y , respectively) are both

“response” variables (classification variables) in the sense that this
model represents the relationship between the 2 variables.
◮ Log-linear models do not distinguish between “response” and
explanatory (predictor) variables.
◮ When one variable is a response variable, then this influences
(guides) the interpretation of parameters (as well as choice of
model).
Case of an I × 2 tables where the column classification (Y ) is the
“response” or outcome variable and the row classification (X ) is an
explanatory variable. e.g., 1992 Presidential election
E.g., 1992 Presidential Election

Just consider Clinton and Bush and re-fit the independence model.
Let π = Prob(Y = 1) = Prob(Clinton), so
logit(π) = log(µi 1 /µi 2 )
= log(µi 1 ) − log(µi 2 )
= (λ + λXi + λY1 ) − (λ + λXi + λY2 )
= λY1 − λY2
which does not depend on the row variable.

◮ The log-linear model of independence corresponds to the logit
model with only an intercept term; that is,
logit(π) = α
where α = (λY1 − λY2 ) is the same for all rows (levels of political
view).
◮ Odds = exp(α) = exp(λY1 − λY2 ) is the same for all rows.
Interpretation of Parameters
◮ When there are only 2 levels of the response, logit models are
preferable (fewer terms in the model). This is especially true
when we have 2 or more explanatory variables.
◮ Log-linear models are primarily used when modeling the
relationship among 2 or more categorical responses.
Odds ratios are functions of model parameters:

µ11 µ22
log(odds ratio) = log(θ(12,12) ) = log
µ12 µ21
= (λ + λX Y X Y
1 + λ1 ) + (λ + λ2 + λ2 )
−(λ + λX Y X Y
1 + λ2 ) − (λ + λ2 + λ1 )
= 0
So the odds ratio, θ = exp(0) = e 0 = 1.

Parameter Identification Constraints

Identification constraints on parameters to be able to estimate
them from data.
There is not a unique set of parameters.
There are I terms in the set {λXi }, but 1 is redundant.
There are J terms in the set {λY j }, but 1 is redundant.
Possible (typical) constraints:
1. Fix 1 value in a set equal to a constant, usually 0.
SAS/GENMOD sets the last one equal to 0, e.g.,
λX
I = 0.. . . dummy coding (i.e., X = 0, 1). R glm sets the first
equal to 0, e.g., λX
1 = 0.
2. Fix the sum of the terms equal to a constant, usually 0.
SAS/CATMOD uses zero sum or “ANOVA” type constraints.
e.g., Ii =1 λX
P
i = 0.. . . “effect” coding (i.e., X = 1, 0, −1).
What’s Unique about the Parameters?

◮ The differences between them are unique:
(λ̂Y Y
1 − λ̂2 ) = unique value
(λ̂X X
1 − λ̂2 ) = unique value
◮ Since differences are unique,
log(odds) = log(θ) = unique value
and odds = unique value

◮ The goodness-of-fit statistics are unique.
◮ The fitted values are unique, which takes more space to
show. . .
Fitted Values are Unique

e.g., for 2 × 2
log(µ̂ij ) = α + λX Xi + λY Yj
For Dummy Coding (i.e., X1 = 0, X2 = 1 and Y1 = 0, Y2 = 1),

 λ for (1, 1)
λ + λX

for (2, 1)

log(µ̂ij ) = Y

 λ + +λ for (1, 2)
λ + λX + λY for (2, 2)

For Effect Coding (i.e.,X1 = −1, X2 = 1 and Y1 = −1, Y2 = 1),

 ∗

 λ − λ∗X − λ∗Y for (1, 1)
λ + λ∗X − λ∗Y for (2, 1)
 ∗
log(µ̂ij ) =

 λ∗ − λ∗X + λ∗Y for (1, 2)
λ + λ∗X + λ∗Y for (2, 2)
 ∗
What’s the correspondence?

Saturated Log-linear Model for 2–way Tables

If rows and columns are dependent, then
log(µij ) = λ + λX Y XY
i + λj + λij
◮ λ, λX Y
i , and λj , are the overall and marginal effect terms (as
defined before).
◮ λXY
ij ’s
◮ Represent the association between X and Y .
◮ Reflect the departure or deviations from independence.
◮ Ensure that µij = nij
◮ Fits the data perfectly; the fitted values are exactly equal to
the observed values.
◮ Has as many unique parameters are there are cells in the table
(i.e., N = IJ), so df = 0.
◮ Called the “Saturated Model”.
◮ Is the most complex model possible for a 2–way table.
◮ Has independence as a special case (i.e., the model with
λXY
ij = 0 for all i and j).
Parameters and Odds Ratios

There is a functional relationship between the model parameters
and odds ratios, which is how we are defining and measuring
interactions.
log(θii ′ ,jj ′ ) = log(µij µi ′ j ′ /µi ′ j µij ′ )

= log(µij ) + log(µi ′ j ′ ) − log(µi ′ j ) − log(µij ′ )
= (λ + λX Y XY X Y XY
i + λj + λij ) + (λ + λi ′ + λj ′ + λi ′ j ′ )
−(λ + λX Y XY X Y XY
i ′ + λj + λi ′ j ) − (λ + λi + λj ′ + λij ′ )
= λXY XY XY XY
ij + λi ′ j ′ − λi ′ j − λij ′
The odds ratio θ measures the strength of the association and

depends only on the interaction terms {λXY
ij }.
How many numbers do we need to completely characterize the
association in an I × J table?
Parameters Needed to Describe Association

(I − 1)(J − 1), the number of unique λXY
ij = Id. constraints:
◮ Fix 1 value equal to a constant, e.g.,
λXY XY
i 1 = λ1j = 0
◮ Fix the sum equal to a constant, i.e.,
X X
λXY
ij = λXY
ij =0
i j
Count of unique parameters:
Number of Number of Number
Terms Terms Constraints Unique
λ 1 0 1
{λXi } I 1 I −1
{λYj } J 1 J −1
{λXY
ij } IJ I +J −1 (I − 1)(J − 1)
Total IJ = N cells of table
Parameters Needed to Describe Association
We generally hope to find models that are simpler than the data
itself (simpler than the saturated model). Simpler models
“smooth” the sample data and provide more parsimonious
descriptions.
When we have 3 or more variables, we can include 2-way

interactions and the model will not be saturated.
Hierarchical Models
We’ll restrict attention to hierarchical models.
◮ Hierarchical models include all lower-order terms that
comprise the the higher-order terms in the model.
◮ Is this a hierarchical model?
log(µij ) = λ + λX XY
i + λij
◮ Restrict attention to hierarchical models because
◮ We want interaction terms to represent just the association
(dependency).
◮ Without lower order terms, the statistical significance and
(substantive) interpretation of interaction terms would depend
on how variables were coded.
◮ With hierarchical models, coding doesn’t matter.
Hierarchical∗∗ Models (continued)
If there is an interaction in the data, we do not look at the

lower-order terms, but interpret the higher-order (interaction)
terms. It can be misleading to look at the lower-order terms,
because the values will depend on the coding scheme.
Log-linear Models for 3–way Tables

Example: We can add a third variable to our GSS presidential election
data — gender.
Gender Political Choice for President

Males Liberal 26 121 24 171
Moderate 82 128 52 262
Females Liberal 44 221 32 297
Moderate 113 204 49 366
The most saturated log-linear model for this table is

log(µijk ) = λ + λPi + λCj + λGk + λPC PG CG PCG
ij + λik + λjk + λijk
where G = gender, P = political view, C = choice.
Saturated Model for 3–way Table
More generally, the most complex log-linear model (the saturated

model) for a 3–way table
log(µijk ) = λ + λX Y Z XY XZ YZ XYZ
i + λj + λk + λij + λik + λjk + λijk
Simpler models set higher order interaction terms equal to 0.
Overview of hierarchy of models for 3–way contingency tables

(from complex to simple)
Overview of Models for 3–way Table

3–way Association (XYZ )
Homogeneous association (XY , XZ , YZ )

Conditional
Independence
(XY , XZ ) (XY , YZ ) (XZ , YZ )
Joint
Independence
(XY , Z ) (XZ , Y ) (X , YZ )
Complete Independence (X , Y , Z )
Blue Collar worker data
(Andersen, 1985)
Bad Management Good Management
Worker’s Worker’s
satisfaction satisfaction
Low High Low High
Supervisor’s Low 103 87 190 Low 59 109 168
satisfaction High 32 42 74 High 78 205 283
135 129 264 137 314 451
θ̂bad = 1.55 and 95% CI for θbad (.90, 1.67)

θ̂good = 1.42 and 95% CI for θgood (.94, 2.14)
Blue Collar worker data Analyses

◮ CMH for testing whether worker and supervisor satisfaction is
conditionally independent given management quality = 5.42,
p–value= .02
◮ The combined G 2 ’s from separate partial tables
= 2.57 + 2.82 = 5.39, df = 1 + 1 = 2, p–value= .02.
Bad Management Good Management
Statistic df Value p–value Value p–value
X2 1 2.56 .11 2.85 .09
G2 1 2.57 .11 2.82 .09
◮ Mantel-Haenszel estimator of common odds ratio (for

Worker–Supervisor or “W-S” odds ratio) = 1.47.
◮ Breslow-Day statistic = .065, df = 1, p–value= .80.
Complete Independence
There are no interactions; everything is independent of everything
else.
log(µijk ) = λ + λX Y
i + λj + λk
Z
Depending on author, this model is denoted by

◮ (X , Y , Z ) (e.g., Agresti) or [X , Y , Z ]
◮ (X )(Y )(Z ) or [X ] [Y ] [Z ] (e.g., Fienberg).
Degrees of freedom are computed in the usual way:
df = # cells − # unique parameters
= # cells − (# parameters − # constraints)
= IJK − 1 − (I − 1) − (J − 1) − (K − 1)
Complete Independence (continued)
log(µijk ) = λ + λX Y Z
i + λj + λk
In terms of associations, all partial odds ratios equal 1,
θXY (k) = θii ′ ,jj ′ ,(k) = µijk µi ′ j ′ k /µi ′ jk µij ′ k = 1

θYZ (i ) = θ(i ),jj ′ ,kk ′ = µijk µij ′ k ′ /µij ′ k µijk ′ = 1
θXZ (j) = θii ′ ,(j),kk ′ = µijk µi ′ jk ′ /µi ′ jk µijk ′ = 1
Joint Independence
Two variables are “jointly” independent of the third variable. For

example, X and Y and jointly independent of Z ,
log(µijk ) = λ + λX Y Z
i + λj + λk + λij
XY
This model may be denoted as

◮ (XY , Z ) or [XY , Z ].
◮ (XY )(Z ) or [XY ] [Z ].
Degrees of Freedom:
df = IJK − 1 − (I − 1) − (J − 1) − (K − 1) − (I − 1)(J − 1)
= (IJ − 1)(K − 1)
Joint Independence Continued
log(µijk ) = λ + λX Y Z XY
The partial or conditional odds ratios for XZ given Y and the odds
ratios for YZ given X equal 1.
θXZ (j) = θii ′ ,(j),kk ′ = µijk µi ′ jk ′ /µi ′ jk µijk ′ = 1

θYZ (i ) = θ(i ),jj ′ ,kk ′ = µijk µij ′ k ′ /µij ′ k µijk ′ = 1
And what does θXY (k) equal?
Joint Independence (continued)
log(θXY (k) ) = log(θii ′ ,jj ′ (k) )

= log(µijk µi ′ j ′ k /µi ′ jk µij ′ k )
= (λ + λX Y Z XY
i + λj + λk + λij )
+(λ + λX Y Z XY
i ′ + λj ′ + λk + λi ′ j ′ )
−(λ + λX Y Z XY
i ′ + λj + λk + λi ′ j )
−(λ + λX Y Z XY
i + λj ′ + λk + λij ′ )
= λXY XY XY XY
ij + λi ′ j ′ − λi ′ j − λij ′
Joint Independence (continued)
XY λXY
e λij e i ′j′
θXY (k) = θii ′ ,jj ′ (k) = exp(λXY
ij + λXY
i ′j ′ − λXY
i ′j − λXY
ij ′ ) =
λXY
i′j
λXY
ij ′
e e
◮ The interaction terms represent the association between X

and Y .
◮ Since θXY (k) does not depend on k (level of variable) Z , we
know that θXY (1) = θXY (2) = . . . = θXY (K )
Conditional Independence
Two variables are conditionally independent given the third
variable. e.g., the model in which Y and Z are conditionally
independent given X equals
i + λj + λk + λij + λik
XZ
This model may be denoted by

◮ (XY , XZ ) or [XY , XZ ].
◮ (XY )(XZ ) or [XY ] [XZ ].
df = IJK − 1 − (I − 1) − (J − 1) − (K − 1)
−(I − 1)(J − 1) − (I − 1)(K − 1)
= I (J − 1)(K − 1)
Conditional Independence (continued)
log(µijk ) = λ + λX Y Z XY XZ
The partial odds ratios of YZ given X equals 1:
log(θYZ (i ) ) = log(θ(i ),jj ′ ,kk ′ )

= log(µijk µij ′ k ′ /µij ′ k µijk ′ )
= log(µijk ) + log(µij ′ k ′ ) − log(µij ′ k ) − log(µijk ′ ) = 0
θYZ (i ) = θ(i ),jj ′ ,kk ′ = exp(0) = e 0 = 1
Conditional Independence: θXY (k) & θXZ (j)
θXY (k) = θii ′ ,jj ′ (k) = exp(λXY XY XY XY

ij + λi ′ j ′ − λi ′ j − λij ′ )
θXZ (j) = θii ′ ,(j),kk ′ = exp(λXZ XZ XZ XZ
ik + λi ′ k ′ − λi ′ k − λik ′ )
◮ The partial odds ratios are completely characterized by the

corresponding 2–way interaction terms from the model (and
no other parameters).
◮ Neither of these depend on the level of the third variable.
◮ Since the partial odds ratios are equal across levels of the
third variable,
θXY (1) = θXY (2) = . . . = θXY (K )

and θXZ (1) = θXZ (2) = . . . = θXZ (J)
Homogeneous Association
or the “no 3–factor interaction model”.
This is a model of association; it is not an “independence” model,
but it is also not the most complex model possible.
log(µijk ) = λ + λX Y Z XY XZ YZ
i + λj + λk + λij + λik + λjk
This model may be denoted by

◮ (XY , XZ , YZ ) or [XY , XZ , YZ ].
◮ (XY )(XZ )(YZ ) or [XY ] [XZ ] [YZ ].
df = IJK − 1 − (I − 1) − (J − 1) − (K − 1) − (I − 1)(J − 1)
−(I − 1)(K − 1) − (J − 1)(K − 1) = (I − 1)(J − 1)(K − 1)
df = the number of odds ratios to completely represent a 3–way

association?
None of the partial odds ratios (necessarily) equal 1.
Homogeneous Association (continued)
The partial odds ratios are a direct function of the model parameters
θXY (k) = θii ′ ,jj ′ ,(k) = exp(λXY XY XY XY

ij + λi ′ j ′ − λi ′ j − λij ′ )
θXZ (j) = θii ′ ,(j),kk ′ = exp(λXZ XZ XZ XZ

ik + λi ′ k ′ − λi ′ k − λik ′ )
θYZ (i ) = θ(i ),jj ′ ,kk ′ = exp(λYZ YZ YZ YZ

jk + λj ′ k ′ − λj ′ k − λjk ′ )
Each of the partial odds ratios for 2 variables given levels of the third
variable
◮ depends only on the corresponding 2–way interaction terms.
◮ do not depend on levels of the third variable.
◮ are equal across levels of the third variable.
3–way Association (the saturated model)
This model has a three factor association
log(µijk ) = λ + λXi + λYj + λZk + λXY XZ YZ XYZ

This model may be denoted by (XYZ ) or [XYZ ].
df = 0
3–way Association (the saturated model)

The partial odds ratios for two variables given levels of the third
variable equal
log(θXY (k) ) = log(θii ′ ,jj ′ (k) )

= log(µijk µi ′ j ′ k /µi ′ jk µij ′ k )
= λ + λXi + λYj + λZk + λXY XZ YZ XYZ
+λ + λXi′ + λYj′ + λZk + λXY XZ YZ XYZ

i ′ j ′ + λi ′ k + λj ′ k + λi ′ j ′ k
−λ + λXi′ + λYj + λZk + λXY XZ YZ XYZ

i ′ j + λi ′ k + λjk + λi ′ jk
−λ + λXi + λYj′ + λZk + λXY XZ YZ XYZ

ij ′ + λik + λj ′ k + λij ′ k
= (λXY XY XY XY
ij + λi ′ j ′ − λi ′ j − λij ′ )
+(λXYZ
ijk + λXYZ XYZ XYZ
i ′ j ′ k − λi ′ jk − λij ′ k )
3–way Association
A measure/definition of 3–way association is the ratio of partial odds
ratios (ratios of ratios of ratios),
Θii ′ ,jj ′ ,kk ′ = θXY (k) /θXY (k ′ )
which in terms of our model parameters equals

θii ′ ,jj ′ (k)
Θii ′ ,jj ′ ,kk ′ =
θii ′ ,jj ′ (k ′ )
= exp(λXYZ
ijk + λXYZ XYZ XYZ
i ′ j ′ k + λi ′ jk ′ + λij ′ k ′
−λXYZ XYZ XYZ XYZ

i ′ jk − λij ′ k − λijk ′ − λi ′ j ′ k ′ )
That is, the 3–way association is represented by the 3–way interaction

terms {λXYZ
ijk }.
There are analogous expressions for θXZ (j) and θYZ (i ) .
Summary of Hierarchy of Models

3–way Association (XYZ )
Homogeneous association (XY , XZ , YZ )

Conditional
Independence
(XY , XZ ) (XY , YZ ) (XZ , YZ )
Joint
Independence
(XY , Z ) (XZ , Y ) (X , YZ )
Complete Independence (X , Y , Z )
Any model that lies below a given model may be a special case of the
more complex model(s).
Example: Blue Collar Workers

Fitted values and observed values from some models
no
Manage Super Worker nijk M, S, W MS, W MS, MW MSW
bad low low 103 50.15 71.78 97.16 102.26
bad low high 87 82.59 118.22 92.84 87.74
bad high low 32 49.59 27.96 37.84 32.74
bad high high 42 81.67 46.04 36.16 41.26
good low low 59 85.10 63.47 51.03 59.74
good low high 109 140.15 104.53 116.97 108.26
good high low 78 84.15 105.79 85.97 77.26
good high high 205 138.59 174.21 197.28 205.74
Example: (M, S, W ) and partial odds ratio for S and W
θ̂SW = (50.15)(81.67)/(82.59)(49.59) = 1.00
Fitted (partial) Odds Ratios
Fitted Partial Odds Ratio

Model W and S M and W M and S
(M, S, W ) 1.00 1.00 1.00
(MS, W ) 1.00 1.00 4.28
(MS, MW ) 1.00 2.40 4.32
(MS, WS, MW ) 1.47 2.11 4.04
(MSW )–level 1 1.55 2.19 4.26
(MSW )–level 2 1.42 2.00 3.90
Inference for Log-linear Models
1. Chi–squared goodness of fit tests.
2. Residuals.
3. Tests about partial associations (e.g., HO : λXY

ij = 0 for all
i , j).
4. Confidence intervals for odds ratios.
Chi–squared goodness-of-fit tests

where HO is
◮ a model “holds”.
◮ a model gives a good (accurate) description/representation of the
data.
◮ log(µij ) = some model (i.e., expected frequencies given by loglinear
model).
For “large” samples chi–squared statistics to test this hypothesis, we
compare the observed and estimated expected frequencies.

2
XXX nijk
Likelihood ratio statistic: G = 2 nijk log
µ̂ijk
i j k
X X X (nijk − µ̂ijk )2
Pearson statistic: X 2 =
µ̂ijk
i j k
Chi–squared goodness-of-fit tests

If HO is true and for larger samples, these statistics are approximately
chi-squared distributed with degrees of freedom
df = (# cells) − (# non-redundant parameters)

= (# cells) − (# parameters) + (# id constraints)
Blue collar worker data:

Model df G2 p–value X2 p-value
(M, S, W ) 4 118.00 < .001 128.09 < .001
(MS, W ) 3 35.60 < .001 35.72 < .001
(MW , S) 3 87.79 < .001 85.02 < .001
(M, WS) 3 102.11 < .001 99.09 < .001
(MW , SW ) 2 71.90 < .001 70.88 < .001
(MS, MW ) 2 5.39 .07 5.41 .07
(MS, WS) 2 19.71 < .001 19.88 < .001
(MW , SW , MS) 1 .065 .80 .069 .80
These are all global tests.
Residuals
Local (miss)fit. A good model has small residuals.
We can use Pearson residuals
(observed − expected)
eijk = q
V̂ar(expected)
(nijk − µ̂ijk )
= p
µ̂ijk
or eijk
adjusted residual = p
(1 − hijk )
where hijk equals the leverage of cell (i, j, k).
If the model holds, then adjusted residuals ≈ N(0, 1)
Adjusted residuals suggest a lack of fit of the model
◮ When there are few cells (small N) and adjusted residuals > 2.
◮ When these are lots and lost of cells (larger N) and adjusted
residuals > 3.
Residuals & Blue Collar Data

(MS, MW ) (MS, MS, WS)
Manage Super Worker nijk µ̂ijk adj res µ̂ijk adj res
bad low low 103 97.16 1.60 102.26 .25
bad low high 87 92.84 -1.60 87.74 -.25
bad high low 32 37.84 -1.60 32.74 -.25
bad high high 42 36.16 1.60 41.26 .25
good low low 59 51.03 1.69 59.74 -.25
good low high 109 116.97 -1.69 108.26 .25
good high low 78 85.97 1.69 77.26 .25
good high high 205 197.28 -1.69 205.74 -.25
◮ df for the model (MS, MW ) equals 2 and therefore there are only 2
non-redundant residuals.
◮ df for the model (MS, MW , WS) equals 1 and therefore there is
only 1 non-redundant residual.
Hypothesis about partial association

The following are all equivalent statements of the null hypothesis
considered here:
◮ There is no partial association between two variables given the
level of the third variable.
e.g., There is no partial association between supervisor’s job
satisfaction and worker’s satisfaction given management
quality.
◮ The conditional or partial odds ratios equal 1.00.
e.g., θSW (i ) = 1.00.
◮ The two-way interaction terms equal zero.
e.g., λSW
jk = 0.
Tests about partial association

To test partial association, we use the likelihood ratio statistic
−2(LO − L1 ) to test the difference between a restricted and a more
complex model.
e.g., The restricted model or MO is (MS, MW ) or
log(µijk ) = λ + λM S W MS MW
and the more complex model M1 is (MS, MW , MS)

log(µijk ) = λ + λM S W MS MW
i + λj + λk + λij + λik + λSW
ij
The likelihood ratio statistic −2(LO − L1 ) equals the difference between

the deviances of the 2 models, or equivalently the difference in G 2 for
testing model fit.
e.g.,
G 2 [(MS, MW )|(MS, MW , WS)] = G 2 (MS, MW ) − G 2 (MS, MW , WS)
and df = df (MS, MW ) − df (MS, MW , WS).
Example Testing partial association
Model df G2 p HO ∆df ∆G 2 p
(MW , SW , MS) 1 .065 .80 — — — —
(MW , SW ) 2 71.90 < .001 λMS
ij =0 1 71.835 < .01
(MS, MW ) 2 5.39 .07 λSW
jk = 0 1 5.325 .02
(MS, WS) 2 19.71 < .001 λMW
ik =0 1 19.645 < .01
Sample size and hypothesis tests:
◮ With small samples, “reality may be much more complex than
indicated by the simplest model that passed the goodness-of-fit test.
◮ With large samples, “. . . statistically significant effects can be weak
and unimportant.”
(MSW )
df = 0
Summary G2 = 0
∆df = 1 ∆G 2 = .065
(MS, MW , SW )
df = 1
2
∆df = 1 G
=X.065
XX

∆G = 5.33 ∆G 2 = 19.645XXXX∆G 2 = 71.84
2
XXX
(MS, MW ) (MS, SW ) (MW , SW )
df = 2 df = 2 df = 2
G 2 =X5.39 G 2=X19.71 G 2= 71.90
XXX XXX
XX XX
XX X XX X
(MS, W ) (MW , S) (M, SW )
df = 3 df = 3 df = 3
G 2 =X35.6 G 2 = 87.79 G2 = 102.11
XX
XXX
XX
(M, S, W )
df = 4
G 2 = 118.0
Confidence Intervals for Odds Ratios

. . . a bit more informative than hypothesis testing alone.
We know that odds ratios are direct functions of log-linear model
parameters.
e.g., Suppose that we are interested in an estimate of the SW partial
odds ratio. Using the estimated parameters of the homogeneous
association model (MS, MW , SW ) (estimated using SAS/GENMOD.
Estimate ASE
SW
λ̂low,low = .3847 .1667
λ̂SW
low,hi = .0000 .0000
λ̂SW
hi,low = .0000 .0000
SW
λ̂hi,hi = .0000 .0000
R will yield different parameter estimates but same estimate of odds

ratio. What would R glm estimates be?
Confidence Intervals for Odds Ratios

log(θ̂SW (i ) ) = λ̂SW SW SW SW
low,low + λ̂hi,hi − λ̂low,hi − λ̂hi,low
= .3827 + 0.00 − 0.00 − 0.00
= .3827
and θ̂SW (i ) = e .3827 = 1.4662.

A (1 − α) × 100% confidence interval for log(θSW (i ) ) is
log(θ̂SW (i ) ) ± zα/2 (ASE )

e.g., A 95% confidence interval for the log of the supervisor by
worker satisfaction odds ratio is
.3827 ± 1.96(.1667) −→ (.05596, .70943)

For confidence interval for the odds ratio take the anti-log of the
interval for log(θSW (i ) ) to get the confidence interval for the odds
ratio. So the 95% confidence interval for the (partial) odds ratio
θSW (i ) is
(e .05596 , e .70943 ) −→ (1.058, 2.033)
Identification constraints don’t matter for the end results.
Statistical versus Practical Significance

and Large Samples.
For the blue collar worker data, two models that could be a good model
(representation) of the data.
In favor of In favor of
Criterion (MS, MW ) (MS, MW , SW )
Model goodness
of fit G 2 = 5.39 .065
df = 2, p = .07 with df = 1, p = .80
Largest adjusted
residual 1.69 .25
2
Likelhood ratio G = 5.325,
test of λSW
jk = 0 na df = 1, p = .02
Complexity simpler more complex
Question: Do we really need the SW partial association? Weak effect,
but is significant due to large sample size (n = 715) relative to table size
(N = 2 × 2 × 2 = 8)?
Deciding on a Model: Subjective Judgment
When there is more than one reasonable model, additional things

to consider in choosing a single “best” model are
1. Substantive importance and considerations.

2. Closeness between observed and fitted odds ratios.
3. Dissimilarity index.
4. Correlations between observed and fitted values.
5. Information Criteria (AIC, BIC & others).
6. Analysis of association.
Let’s see if these help in making the decision between (MS, MW )
and (MS, MW , SW ).
Similarity between the Fitted Odds Ratios

If two models have nearly the same values for the odds ratio, then
choose the simpler one.
What constitutes “nearly the same values” is a subjective decision.
Fitted partial odds ratios based two best model and the observed
partial odds ratios for the worker satisfaction data:
Fitted Odds Ratio

Model W–S M–W M–S
(MS, MW ) 1.00 2.40 4.32
(MS, WS, MW ) 1.47 2.11 4.04
Observed or (MSW )–level 1 1.55 2.19 4.26
(MSW )–level 2 1.42 2.00 3.90
They seem similar. Whether they are “close” enough, that depends
on purpose or uses you’ll make of the results.
Dissimilarity Index
For a table of with
Any dimension (e.g., (I × J), (I × J × K ), etc).
Cell counts equal to ni = npi .
Fitted counts equal to µ̂i = nπ̂i .
The “Dissimilarity Index” is a summary statistic of how close the
fitted values of a model are to the data. It equals
P P
i |ni − µ̂i | |pi − π̂i |
D= = i
2n 2
Properties of D:
◮ 0 ≤ D ≤ 1.
◮ D = the proportion of sample cases that need to move to a
different cell to have the model fit perfectly.
Properties of the Dissimilarity Index

◮ Small D means that there is little difference between fitted values
and observed counts.
◮ Larger D means that there is a big difference between fitted values
and observed counts.
◮ D is an estimate of the change, ∆, which measures the lack-of-fit of
the model in the population.
When the model fits perfectly in the population,
◮ ∆=0
◮ D overestimates the lack-of-fit (especially for small samples).
◮ For large samples when the model does not fit perfectly,
◮ G 2 and X 2 will be large.
◮ D reveals when the lack-of-fit is important in a practical sense.
◮ Rule-of-thumb: D < .03 indicates non-important lack-of-fit.
Example of the Dissimilarity Index

Bluecollar Example: For the model (MW , MS)
55.2306
D= = .039
2(715)
We would need to move 3.9% percent of the observations to

achieve a perfect fit of the model (MW , MS) to observed (sample)
data.
For the model (MW , MS, SW ),
5.8888
D= = .004
2(715)
We would need to move .4% of the observations to achieve a

perfect fit of the models (MW , MS, SW ) to the observed data.
Which one? Possibly the model of conditional independence.
Correlations between Counts and Fitted Counts
A large value indicates that the observed and fitted are “close”.
Worker satisfaction example:
For the model of conditional independence (MW , MS),
r = .9906
and for the model of homogeneous association
r = .9999
Information Criteria
◮ Indices (statistics) that weigh goodness-of-fit of model to

data, complexity of the model, and in some cases sample size.
◮ Good way to choose among reasonable models.
◮ Does not require the models be nested.
◮ Akiake Information Criteria (AIC):
AIC = −2 log(L) + 2(number of parameters)
◮ Bayesian Information Criteria (BIC):
Information Criteria References

References:
◮ Raftery, A.E. (1985). A note on Bayes factors for log-linear
contingency table models with vague prior information.
Journal of the Royal Statistical Society, Series B.
◮ Raftery, A. E. (1986). Choosing models for
cross-classifications. American Sociological Review, 51,
145–146.
◮ Spiegelhalter, D.J. and Smith, A.F.M. (1982). Bayes Factors
for linear and log-linear models with vague prior information.
Journal of the Royal Statistical Society, Series B, 44, 377–387.
◮ 1998 or 1999 special issue of Sociological Methodology &
Research on the BIC statistic.
The Bayesian Approach to Model Selection

Another way of making the trade-off between a simple parsimonious
model (practical significance) and a more complex and closer to reality
model (statistical significance), besides using just G 2 and df .
◮ Suppose you are considering a model, say Mo , and you are

comparing it to the saturated model, M1 .
◮ Which model gives a better description of the main features of the
reality as reflected in the data?
◮ More precisely, which of Mo and M1 is more likely to be the “true”
model?
◮ Answer: posterior odds
P(Mo |X )
B=
P(M1 |X )
The BIC statistic

Skipping many details....for large samples:
BIC = −2 log B = G 2 − (df ) log N
where N =total number of observations.

◮ If BIC is negative, accept Mo ; it’s preferable to the saturated
model.
◮ When comparing a set of models, choose the one with the
smallest BIC value. (The models do not have to be nested).
This procedure provides you with a consistent model in the
sense that in large samples, it chooses the correct model with
high probability.
Example of Information Criteria

Example: Worker Job Satisfaction × Supervisor’s Job Satisfaction ×
Quality of Management
N = 715
# of
Model df G2 p-value BIC Parameters AIC
(MSW) 0 0.00 1.000 .00 8 —
(MS)(MW)(SW) 1 .06 .800 -6.51 7 −13.94
(MW)(SW) 2 71.90 .000 58.76 6 59.90
(MS)(WS) 2 19.71 .000 6.57 6 7.71
(SM)(WM) 2 5.39 .068 -7.75 6 −6.61
(MW)(S) 3 87.79 .000 68.07 5 77.79
(WS)(M) 3 102.11 .000 82.07 5 92.11
(MS)(W) 3 35.60 .000 15.88 5 25.60
(M)(S)(W) 4 118.00 .000 91.71 4 110.00
“Analysis of Association” Table

Another aid in helping to decide between practical and statistical
significance. This comes from Leo Goodman’s toolbox.
This is useful even when the sampling distribution of G 2 is not well
approximated by chi–squared distribution.
We use the G 2 from independence as a measure of the total association
in the data and see how much association is accounted for by certain
effects.
∆ ∆ p- Cummul.
Effect Models df G 2 value % %
MS (M,S,W) - (MS,W) 1 82.40 .000 69.8% 69.8%
MW (MS,W) - (MS,MW) 1 30.21 .000 25.6% 95.4%
SW (MS,MW) - (MS,MW,SW) 1 5.33 .021 4.6% 100.0%
MSW (MS,MW,SW) - (MSW) 1 .06 .800 0.0% 0.0%
total (M,S,W) 4 118.00
“Analysis of Association” Table

∆ ∆ p- Cummul.
Effect Models df G2 value % %
MS (M,S,W) - (MS,W) 1 82.40 .000 69.8% 69.8%
MW (MS,W) - (MS,MW) 1 30.21 .000 25.6% 95.4%
SW (MS,MW) - (MS,MW,SW) 1 5.33 .021 4.6% 100.0%
MSW (MS,MW,SW) - (MSW) 1 .06 .800 0.0% 0.0%
total (M,S,W) 4 118.00
◮ ∆G 2 = the difference between goodness-of-fit statistics for the

models indicated.
◮ ∆df = the corresponding difference between the models’ df .
◮ The column labeled “p-value” really shouldn’t be in this table.
◮ Percent = ∆G 2 /118.00. Note: 118.00 is the G 2 from (M,S,W).
◮ Cumulative percent = sum of “Percent” of current and all rows
above the current one.
Log-linear Models for 4+–Way Tables

They are basically the same as models for 3–way tables, but just more
complex. They can have many more 2– and 3–way associations, as well
as higher–way associations.
Example: These data come from a study by Thornes & Collard (1979),
described by Gilbert (1981), and analyzed by others (including Agresti,
1990; Meulman & Heiser, 1996).
A sample of men and woman who filed for petition for divorce (they
weren’t married to each other), and a similar sample of married people
were asked
◮ “Before you were married to your (former) husband/wife, had you
ever made love to anyone else?”
◮ “During your (former) marriage, (did you have) have you had any
affairs or tried sexual encounters with another man/woman?”
Example of 4–Way Table

These data form a 4–way, 2 × 2 × 2 × 2 table with variables
G for gender
E or EMS for whether reported extramarital sex.
P or PMS for whether reported premarital sex.
M for martial status.
Gender
Women Men
Martial PMS: Yes No Yes No
Status EMS: Yes No Yes No Yes No Yes No
Divorce 17 54 36 214 28 60 17 68
Still Married 4 25 4 322 11 42 4 130
For these data, a good model (perhaps the best) is (GP, MEP)
G 2 = 8.15 df = 6 p = .23
(we’ll talk about how we arrived at this model later).
Parameter Estimates for example

Estimated Parameters for the highest–way associations in the model
(from SAS/GENMOD)
Param df Est. ASE Wald p
PG yes women 1 -1.3106 .1530 73.4249 < .001
PG yes men 0 0.0000
PG no women 0 0.0000
PG no men 0 0.0000
MEP div yes yes 1 -1.7955 .5121 12.2948 < .001
MEP div yes no 1 0.0000
MEP div no yes 1 0.0000
MEP div no no 0 0.0000
MEP mar yes yes 1 0.0000
MEP mar yes no 0 0.0000
MEP mar no no 0 0.0000
MEP mar no no 0 0.0000
Interpretation of GP Partial Association
Since λ̂GP
women,yes = −1.3106, given EMS and marital status the
odds of PMS for women is
e −1.3106 = .2696
times the odds for men.
Alternatively, given EMS and marital status, the odds of PMS for
men is
e 1.3106 = 1/.2696 = 3.71

times the odds for women.
Using the Fitted Values

. . . from the (GP, MEP) model. . .
Gender
Women Men
PMS: Yes No Yes No
EMS: Yes No Yes No Yes No Yes No
Divorced 18.67 47.30 38.40 204.32 26.33 66.70 14.60 77.68
Married 6.22 27.80 5.80 327.49 8.78 39.20 2.20 124.51
For EMS=yes and martial status=divorced
odds(PMS for woman) (18.67)(14.60)

= = .2696
odds(PMS for man) (38.40)(26.33)
or for EMS=yes and martial status=married
odds(PMS for woman) (6.22)(2.20)

= = .2696
odds(PMS for man) (5.80)(8.78)
which also equals the value if we use EMS=no.
Martial–PMS–EMS partial association

Can use either our estimated parameters or using fitted values.
Fitted values from the (GP, MEP) model
Gender
Women Men
PMS: Yes No Yes No
EMS: Yes No Yes No Yes No Yes No
Divorced 18.67 47.30 38.40 204.32 26.33 66.70 14.60 77.68
Married 6.22 27.80 5.80 327.49 8.78 39.20 2.20 124.51
The odds ratio for marital status and extramarital sex for those who did
and those who did not have premarital sex.
PMS=yes: Of those who had premarital sex, the odds of divorce given
the person had extramarital sex
(18.67)(27.80)
θ̂ME |PMS=yes = = 1.76
(6.22)(47.30)
times the odds of divorce given the person did not have extramarital sex.
Note: We could also use the fitted values for men
(26.33)(39.20)
θ̂ME |PMS=yes = = 1.76
(8.78)(66.70)
Martial–PMS–EMS partial association

PMS=no Of those who did not have premarital sex, the odds of divorce
given the person had extramarital sex is
(38.40)(327.49)
θ̂ME |PMS=no = = 10.62
(5.50)(204.32)
times the odds of divorce given the person did not have extramarital sex.
3–way EMP association: The partial odds ratio for marital status and
extramarital sex given the person did not have premarital sex are
10.62
= 6.03
1.76
times the partial odds ratio given the person did have premarital sex.
(We could have arrived at the same interpretation of the partial
associations by using the parameters of the log-linear model.)
MEP Partial association

What do the odds ratios equal in terms of the model parameters?
Let i index M (marital status), j index EMS, k index PMS, and l
index Gender,

µ11kl µ22kl
log = log(µ11kl ) + log(µ22kl ) − log(µ12kl ) − log(µ21kl )
µ12kl µ21kl
= λME ME ME ME
11 + λ22 − λ12 − λ21
+λMEP MEP MEP MEP
11k + λ22k − λ12k − λ21k
λ̂ME
11 = 2.3960 (Divorced and had EMS),
all other λ̂ME
ij ’s equal zero.
λ̂MEP
111 = −1.7955 (Divorced, had EMS & had PMS),
all other λ̂ME
ijk ’s equal zero.
MEP Partial association

Of those who had PMS (k = 1 = yes), the estimated odds ratio for
marital status and extramarital sex equals
exp(2.3626 − 1.7955) = exp(.5671) = 1.76 (95%CI : 0.33, 9.21)
Of those who did not have PMS (k = 2 = no), the estimated odds ratio
for marital status and extramarital sex equals
exp(2.3626) = 10.62
and the ratio of the odds ratios equals
exp(2.3626−.5671) = exp(+1.7955) = 6.03 or exp(−1.7955) = 1/6.03
(95%CI : 7.92, 14.24)

Before summarizing the findings, how to compute CIs for these odds
ratios. . .
(1 − α) × 100% CI for MEP Partial association

Of those who had PMS (k = 1 = yes), the estimated odds ratio
for marital status and extramarital sex equals
exp(λ̂ME MEP
11 − λ̂111 ) = exp(2.3626 − 1.7955) = exp(.5671) = 1.76
Need the variances and covariance of parameters:

2

σME σME ,MEP 0.14962 −0.14962
Σ= 2 =
σME ,MEP σMEP −0.14962 0.2622
q
se(λ̂ME MEP
11 − λ̂111 ) =
2 + σ2
σME MEP − 2σME ,MEP
p
= 0.14962 + 0.2622 − 2(−0.14962) = 0.84323
So. . .
Computing CI for Partial association
So a (1 − α) × 100% CI for the log of the MEP partial for those

who had PMS is
1.76 ± 1.96(0.84323) −→ .5671 ± 1.6527 −→ (−1.0856, 2.2198)
and for the MEP partial association for those who had PMS
(exp(−1.0856), exp(2.2198)) −→ (0.33, 9.21)
Logit–Log-linear Model Connection

Log-linear models:
◮ All variables are considered response variables; no distinction
is made between response and explanatory variables (in terms
of a variable’s role/treatement in an analysis).
◮ Distribution = Poisson.
◮ Link = Log.
Logit Models:
◮ Represent how a binary response variable depends (or is
related to) a set of explanatory variables.
◮ Distribution = Binomial.
◮ Link = Logit.
Logit/Log-linear Model Connection
Logit and log-linear models are related

Logit models are equivalent to certain log-linear model.
Log-linear models are more general than logit models.
More specifically,
1. For a log-linear model, you can construct logits for 1 (binary)
response variable to help interpret the log-linear model.
2. Logit models with categorical explanatory variables have
equivalent log-linear models.
The relationship is useful. . . use Logit models to interpret log-linear
models
Using Logit models to interpret loglinear models
For 2–way tables: Interpret log-linear models by looking at

differences between λ’s, which equal log of odds and functions of
λ’s equal odds ratios.
For 3–way tables: The blue collar worker data and the
homogeneous association model (MW , MS, SW ),
log µijk = λ + λM S W MS MW
i + λj + λk + λij + λik + λSW
jk

If we focus on worker’s job satisfaction, then we consider
πij = Prob(Hi worker satisfaction|M = i , S = j)
and the logit model for worker job satisfaction is
logit(πij ) = logit(πij )

P(Hi worker satisfaction|M = i , S = j)
= log
P(Lo worker satisfaction|M = i , S = j)
= log(µij2 /µij1 )
= log(µij2 ) − log(µij1 )

For 3–way tables (continued):
logit(πij ) = (λW W MW
2 − λ1 ) + (λi 2 − λMW SW SW
i 1 ) + (λj2 − λj1 )
= α + βiM + βjS
This is the additive effects logit model , where
◮ α = (λW W
2 − λ1 ) a constant.
◮ βM = (λMW− λMW
i i2 i 1 ).
The relationship (effect) of management quality between (on)
worker job satisfaction is the same at each level of supervisor’s
job satisfaction.
logit(πij ) = (λW W MW
2 − λ1 ) + (λi 2 − λMW SW SW
i 1 ) + (λj2 − λj1 )
= α + βiM + βjS
And. . .
βjS = (λSW SW
j2 − λj1 ).
The relationship (effect) of supervisor’s job satisfaction between
(on) worker job satisfaction is the same at each level of
management quality.
Example of 4-way Table

Marital status × EMS × PMS × Gender — A good model for these data
is (GP, MEP).
We can use a logit model formulation to help interpret the results of the
(GP, MEP) log-linear model,
log µijkl = λ + λM E P G ME
i + λj + λk + λl + λij + λMP
ik
+λEP GP MEP
jk + λkl + λijk
We will focus on marital status and form (log) odds of divorce,

π1jkl
log( ) = log(π1jkl ) − log(π2jkl )
π2jkl
= (λM M ME ME
1 − λ2 ) + (λ1j − λ2j )
+(λMP MP MEP MEP

1k − λ2k ) + (λ1jk − λ2jk )
= α + βjE + βkP + βjk

EP
and the estimated parameters for the logit model using the ones from the
log-linear model. . . .
Marital status × EMS × PMS × Gender

Loglinear Model Parameters Logit
Marital Status Model
Divorced Married Parameters
λ̂M
1 = −.4718 λ̂M
2 = 0.00 α̂ = −.4718
EMS
yes λ̂ME
11 = 2.3626 λ̂ME
21 = 0.0000 β̂1E = 2.3626
no λ̂ME
12 = 0.0000 λ̂ME
22 = 0.0000 β̂2E = 0.0000
PMS
yes λ̂MP
11 = 1.0033 λ̂MP
21 = 0.0000 β̂1P = 1.0033
no λ̂MP
12 = 0.0000 λ̂MP
22 = 0.0000 β̂2P = 0.0000
EMS PMS
yes yes λ̂MEP
111 = −1.796 λ̂MEP
211 = 0.0000 EP
β̂11 = −1.796
yes no λ̂MEP
112 = 0.0000 λ̂MEP
212 = 0.0000 EP
β̂12 = 0.0000
no yes λ̂MEP
121 = 0.0000 λ̂MEP
221 = 0.0000 EP
β̂12 = 0.0000
MEP
no no λ̂122 = 0.0000 λ̂MEP
222 = 0.0000 EP
β̂22 = 0.0000
Loglinear–Logit Model Equivalence

Marital status seems like a response/outcome variable, while the others
seem to be more explanatory/predictor variables.
So rather than fitting a log-linear model, we could treat the data as if we
have independent Binomial samples, and fit a logit model where the
(binary) response variable is marital status and the explanatory variables
are Gender, EMS, and PMS.
Marital Status
Gender PMS EMS Divorced Married total
Women yes yes 17 4 21
no 54 25 79
no yes 36 4 40
no 214 322 536
Men yes yes 28 11 39
no 60 42 102
no yes 17 4 21
no 68 130 198
Loglinear–Logit Model Equivalence (continued)

The saturated logit model for these data

P(divorcedijk )
log = α + βiG + βjE + βkP + βijGE + βikGP
P(marriedijk )
EP GEP
+βjk + βijk
Logit Model df G2 p
E,G,P 4 13.63 .001
GP,E 3 13.00 < .001
EG,P 3 10.75 .010
EP,G 3 .70 .873
EG,GP 2 10.33 < .001
EP,GP 2 .44 .803
EG,EP 2 .29 .865
EG,EP,GP 1 .15 .700
EGP 0 0.00 1.00
The “best” logit model is (EP, G )
EP
logit(πijk ) = α + βiG + βjE + βkP + βjk ,
which is different from the logit model that we used to interpret
our log-linear model (GP, EMP), i.e.,
logit(πijk ) = α + βjE + βkP + βjk
EP
The (GP, EMP) log-linear model is not equivalent to any logit

model that we could fit to the data with marital status as the
response variable because. . . .
The “best” logit model is (EP, G )
because. . . .
◮ When we consider the data as 8 independent Binomial
samples, the “row” margin corresponding to the total number
of observations for each Gender × EMS × PMS combination
is “ fixed.”
◮ When we fit a log-linear model to the data, we should always
include parameters to ensure that the GEP margin is fit
perfectly.
◮ If marital status is our response variable, we are not interested
in the relationship between/among Gender, EMS, and PMS,
except with respect to how they are related to marital status.
The log-linear equivalent to (G , EP) logit
EP
logit(πijk ) = α + βiG + βjE + βkP + βjk
is the (GEP, MEP, GM) log-linear model,
µijkl = λ + λG E P GE GP EP GEP
i + λj + λk + λij + λik + λjk + λijk
+λM GM
l + λil + λEM PM MEP
jl + λkl + λljk
When odds are computed for marital status using a log-linear

model with λGEP
ijk , all terms associated with this association and
lower order terms drop out; that is,
λ, λG E P GE GP EP GEP
i , λj , λk , λij , λik , λjk , λijk
The log-linear equivalent to (G , EP) logit
The log-linear model (GEP, MEP, GM) will have the exact same
df and fit statistics as the (EP, G ) logit model.
The estimated parameters of the logit model are equal to
differences of estimated log-linear model parameters.
The log-linear/logit equivalents

The logit models that we fit to these data and corresponding loglinear
models:
Logit Loglinear
Model Model df G2 p
E,G,P EGP,ME,MG,MP 4 13.63 .001
GP,E EGP,MGP,ME 3 13.00 < .001
EG,P EGP,MEG,MP 3 10.75 .010
EP,G EGP,MEP,MG 3 .70 .873
EG,GP EGP,MEG,MGP 2 10.33 < .001
EP,GP EGP,MEP,MGP 2 .44 .803
EG,EP EGP,MEG,MEP 2 .29 .865
EG,EP,GP EGP,MEG,MEP,MGP 1 .15 .700
EGP EGPM 0 0.00 1.00
Strategies in (Log-linear) Model Selection

First, when to use logit models and when to use log-linear models.
◮ When one variable is a response variable and the rest are
explanatory variables, you can use either logit models or log-linear
models; however, the logit models are easier (better) to use.
◮ The logit models can be fit directly and are advantageous in this
situation in that the logit model is simpler; that is, the logit model
formulations have fewer parameters than the equivalent log-linear
model.
◮ If the response variable has more than 2 levels, you can use a
multicategory logit model (later lecture).
◮ If you use log-linear models, the highest–way associations among
the explanatory variables should be included in all models.
◮ Whether you use logit or log-linear formulations, the results will be
the same regardless of which formulation you use.
Two or Model Response Variables

. . . Then the log-linear model should be used.
Log-linear models are more general than logit models.
In the Marital status × Gender × EMS ×PMS example, with the
log-linear models we can examine not only how marital status is
related to EMS, PMS and gender, but we can also examine
associations between (for example) gender and EMS or PMS.
There classes are multivariate logit models:
◮ “Standard” type (see McCullah & Nelder)

◮ IRT models are multivariate logit models.
◮ Other kinds (see Anderson & Böckenholt, 2000; Anderson &
Yu, 2007).
Model selection strategies with Log-linear models

The more variables, the more possible models that exist.
We’ll talk about strategies for more of an exploratory study here and
later we’ll talk more specifically about strategies for
hypothesis/substantive theory guided studies (i.e., association graphs).
1. Determine whether some variables are responses and others are

explanatory variables.
◮ Terms for associations among the explanatory variables should
always be included in the model.
◮ Focus your model search on models that relate the responses
to explanatory variables.
2. If a margin is fixed by design, then a term corresponding to that
margin should always be included in the log-linear model (to ensure
that the marginal fitted values from the model equal to observed
margin). This reduces the set of models that need to be considered.
blue 4. Try to determine the level of complexity that is necessary

by fitting models with
◮ marginal/main effects only.
◮ all 2–way associations.
◮ all 3–way associations.
..
.
◮ all highest–way associations.
You can use a backward elimination strategy (analogous to one we

discussed for logit models) or a stepwise procedure (but don’t use
computer algorithms for doing this).
Example of backward elimination and as promised how it was
decided that (EGP, GP) was a good log-linear model for the EMS
× PMS × Gender × Marital Status data.. . .
Backward Elimination
Stage Model G2 df Best Model
Initial (EMP,EGM,EGP,GMP) 0.15 1
1 (EMP,EGM,GMP) 0.19 2 ∗
(EMP,EGM,EGP) 0.29 2
(EMP.EGP,GMP) 0.44 2
(EGM,EGP,GMP) 10.33 2
2 (GP,EMP,EGM) 0.37 3 ∗
(EG,EMP,GMP) 0.46 3
(EP,EGM,GMP) 10.47 3
3 (EG.GM,GP,EMP) 0.76 4 ∗
(EP,GP,MP,EGM) 10.80 4
(EMP,EGM) 67.72 4
4 (GM,GP,EMP) 5.21 5 ∗
(EG,GP,EMP) 5.25 5
(EG,GM,GP,EM,EP,MP) 13.63 5
(EG,GM,EMP) 70.10 5
5 (GP,EMP) 8.15 6 ∗
(GM,GP,EM,EP,MP) 18.13 6
(GM,EMP) 83.38 6
6 (GP,EM,EP,MP) 21.07 7 ∗
(G,EMP) 83.41 7
(from Agresti, 1990).

Why Stepwise is Bad

See Flom, R.L. & Cassell, D.L. (2009). Stopping stepwise: Why
stepwise and similar selections methods are bad, and what you
should use. Proceedsing of NESUG.
http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf.
For normal linear regression (mostly due to Harrell, 2001) but also
apply to GLMS:
◮ R 2 are biased.
◮ Sampling distributions of F and χ2 test statistics aren’t what
you would expect.
◮ Standard errors of parameters are too small.
◮ p values are too small.
◮ Parameter estimates are biased high in absolute value.
◮ Collinearity problems are exacerbated.
◮ Discourages thinking.
◮ Many not get the best model.
◮ Better alternatives: LASSO, LARS, model averaging,
Ridge-Regression, and Elastic Nets.
LASSO
Least Absolute Shrinkage and Selection Operator.
A constrained regression that finds the βk s that solves:
 !2 
n p
1 X X
min  yi − βk xki + tP(β)
βk ∈R 2n
i =1 k=0
where
◮ t is the “tuning” parameter.
◮ P(β) is the penalty p
X
P(β) = ||β||ℓ1 = |βk |
k=1
◮ Shrinks parameters toward 0; ideal when many βk s are close to 0.

◮ Works as long as correlations between predictors are not too large.
◮ Breaks down when all predictors are equal.
Ridge Regression
Finds the βk s that solves:
 !2 
n p
1 X X
min  yi − βk xki + tP(β)
βk ∈R 2n
i =1 k=0
◮ t is the “tuning” parameter.

◮ P(β) is the penalty p
1 X1
P(β) = ||β||2ℓ2 = β2.
2 2 k
k=1
◮ Shrinks the βk s toward each other so is ideal when many predictors

have non-zero values.
◮ Works well when predictors are correlated.
◮ Extreme case when all are equal (i.e., = 1/p), any single predictor is
as good as another.
Elastic Net
Elastic net is a compromise between ridge regression and lasso.
It finds the βk s that solve:
n p
" #
1 X X
2
min (yi − βk xki ) + tPα (β)
βk ∈R 2n
i =1 k=0
where
◮ Pα (β) is the elastic-net penalty
p
1 2
X 1 2
Pα (β) = (1 − α) ||β||ℓ2 + α||β||ℓ1 = (1 − α) βk + α|βk |
2 2
k=1
Pp 1 2
◮ “Ridge regression” −→ α = 0 so Pα (β) = k=1 2 βk
◮ “lasso” −→ α = 1 so Pα (β) = |βk |
◮ If α is close to 1, it performs like LASSO but without problems
caused by extreme correlations.
GLM and Regularized Regressions
◮ The same logic is used for GLMs, except rather than

minmized least squares, we maximize the penalized likelihood.
◮ SAS PROC GLIMSELECT for normal regression. Ad hoc
method
1. Transform data to approximate normality
2. Use GLIMSELECT.
◮ SAS PROC HPGENMOD is designed for generalized linear
models; however, the lasso doesn’t seem to be working on my
version of SAS. There is a suite of HP (high performance
PROCS which use multiple cores on your computer).
◮ R there are multiple options, but glmnet package probably
your best option
Penalized Regression for GLMMs
◮ Friedman, J., Hastie, T, & Tibshirani, R. (2010).

Regularization Paths for Generalized Linear Models via
Coordinate Descent. Journal of Statistical Software, 33.
◮ Friedman, J., Hastie, T, Simon, N, & Tibshirani, R. (2015)
Package ‘glimnet’.
Next: strategies to use when guided more by theory.

6 Loglinear Models Beamer-Online PDF

Uploaded by

Copyright:

Available Formats

6 Loglinear Models Beamer-Online PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Loglinear Models Beamer-Online PDF

Uploaded by

Copyright:

Available Formats

Log-linear Models for Contingency Tables

Department of Educational Psychology

Log-linear models (or Poisson regression)

◮ A very common use of log-linear models is for modeling counts in

Log-linear Models for 2-way Tables

Review of notations for 2-way Contingency tables:

◮ I = the number of rows.

Example: 1989 GSS data

Political Vote Choice

A More Recent Example

◮ Choice for president in the 1992: presidential election:

The 1996 Data

Political Vote Choice

◮ The frequencies (cell counts) equal

◮ The probabilities πij are the parameters of Binomial or Multinomial

Log-Linear Model of Statistical Independence

◮ This is an “ANOVA” type representation.

Log-linear model of Statistical Independence

& hypothesis test of statistical independence in 2–way tables.

The model log(µij ) = λ + λX Y

Political Vote Choice

The fitted values satisfy the definition of independence perfectly.

Example: SAS & R

Independence also implies

that the odds ratios for every 2 × 2 sub-table must equal 1.

The same is true for all possible (2 × 2) sub-tables.

Statistic df Value p–value

These are the same the X 2 and G 2 we get when testing

Log-linear Model as a GLM

◮ Link is log (canonical link for the Poisson distribution).

Log-Linear Model Parameters

◮ The row and column variables (X and Y , respectively) are both

E.g., 1992 Presidential Election

which does not depend on the row variable.

So the odds ratio, θ = exp(0) = e 0 = 1.

Parameter Identification Constraints

What’s Unique about the Parameters?

◮ Since differences are unique,

log(odds) = log(θ) = unique value

and odds = unique value

Fitted Values are Unique

For Effect Coding (i.e.,X1 = −1, X2 = 1 and Y1 = −1, Y2 = 1),

What’s the correspondence?

Saturated Log-linear Model for 2–way Tables

Parameters and Odds Ratios

log(θii ′ ,jj ′ ) = log(µij µi ′ j ′ /µi ′ j µij ′ )

The odds ratio θ measures the strength of the association and

Parameters Needed to Describe Association

Parameters Needed to Describe Association

When we have 3 or more variables, we can include 2-way

Hierarchical∗∗ Models (continued)

If there is an interaction in the data, we do not look at the

Log-linear Models for 3–way Tables

Gender Political Choice for President

The most saturated log-linear model for this table is

where G = gender, P = political view, C = choice.

Saturated Model for 3–way Table

More generally, the most complex log-linear model (the saturated