Extensions Beyond Linear Regression: Topics in Data Science

Extensions beyond linear regression
Topics in Data Science
David Rossell, UPF

Reading material
• Section 1. Basu & Michailidis (LASSO theory for time series, at

Box)
• Section 2. Rockove & George (reading paper at Box)
• Sections 3-4. Hastie-T-M Ch 3.1-3.4; Gelman et al, Ch 16.
References
Basu, Michailidis. Regularized estimation in sparse high-dimensional time series
models. The Annals of Statistics 2015
Gelman et al. Bayesian data analysis (3rd ed). CRC press.
Hastie, Tibshirani, Wainwright. Statistical Learning with Sparsity. CRC Press
Rockova, George. Fast Bayesian factor analysis with automatic rotations to sparsity.
JASA, 2015
Outline
1 Time series
2 Factor regression
3 Generalized linear models
4 Beyond standard GLMs
5 Application to polarization and segregation

Until now we assumed
y ∼ N(X β; φI )
If y = (y1 , . . . , yT ) is a time series, independence not reasonable
Example: AR1 model
yt = ηyt−1 + x0t β + t
where |η| < 1 and t ∼ N(0, φ) indep t = 1, . . . , T . Likelihood:
T
Y
p(y | η, β, φ, X ) = p(yt | yt−1 , η, β, φ, xt ) = N(y; µ, φΣ)
t=1
Defining y0 = 0, algebra shows

• µ = Xβ
P
• σt,t = t−1 2 i
i=0 (η ) ≈
1
1−η 2
• σt,t−j = η j σt−j,t−j ⇒ cor(yt , yt−j ) ≈ η j

Main novelty: η features in the covariance
Penalized likelihood
min(y − X β)0 Σ−1 (y − X β) + h(β, Σ)

β,Σ
Easy to min over β for given Σ, but we also need to min over Σ
• Not too hard for AR(r) models (Σ−1 has 0’s after r lags)
• Function convex in β, not necessarily in (β, Σ)
Bayesian inference
Z
p(γ | y) ∝ p(γ) N(y; Xγ βγ , φΣ)p(β, φ, Σ | γ)dβdφdΣ
Closed form given Σ, but we also need to integrate over Σ

Approx for AR(r) models
Let D be (T − r ) × r matrix with lagged predictors dtj = yt−j

    
yr +1 yr yr −1 . . . y1 η1
 ...  =  ...  . . . + X β + = Dη + X β +
yT yT −1 . . . yT −r ηr
where ∼ N(0, φI )
• Can infer r via variable selection on η’s. Hierarchical restrictions?
• We lost the first r observations. OK if r T .
Formally equivalent to
log p(yr +1:T | y1:r , β, φ, η) = log p(y | β, φ, η) − log p(y1:r | β, φ, η)

Approx for MA models
Consider the MA(1) model
yt = x0t β + αt−1 + t , t ∼ N(0, φ) iid
Since t−1 = yt−1 − x0t−1 β − αt−2 , this implies
yt = x0t β − x0t−1 αβ + αyt−1 − α2 t−2 + t = . . . =

t
X t
X
(−1)j x0t−j αj β + (−1)j+1 αj yt−j + t
j=0 j=1
Regresses on lagged x’s and y’s imposing restrictions on the parameters

Alternative: If |α| < 1 this can be approximated with r lags
y = Dy ηy + X β + Dx ηx +
where ηy , ηx now unrestricted, Dy , Dx contain lagged y, X

Canada macroeconomy
(data(Canada) in R package vars)
Outcome: quarterly labour productivity in 1980-2000

• e: 100*ln(civil employment)
• U: unemployment rate
• rw: 100*ln(100* manufacturing real wage)
2
Residuals
0
Residuals under iid model
(+ 3 covariates)
−2
−4
0 20 40 60 80
Time
Residuals from AR(5) model (+ 3 covariates)
1.0
2
0.8
1
0.6
Residuals
ACF
0.4
0
0.2
−1
0.0
−0.2
−2
0 20 40 60 80 0 5 10 15
Time Lag
Let’s run BVS with Zellner + BetaBin(1,1) priors

iid model
Top models
Model p(γ | y)
e,rw .461 Variable P(γj = 1 | y)
e,rw,u .337 e 0.991
e,u .156 rw 0.807
u .037 u 0.502
rw,u .009
...
AR(5) model
Top models Variable P(γj = 1 | y)

Model p(γ | y) e 0.109
lag1 0.231 rw 0.222
u, lag1 0.129 u 0.282
lag1, lag2 0.065 lag1 1.000
lag1, lag3 0.054 lag2 0.178
lag1, lag4 0.054 lag3 0.148
lag1, lag5 0.050 lag4 0.167
... lag5 0.181
Extensions to panel data/VAR
Let yt ∈ Rq , xt ∈ Rp . Consider
yt = A1 yt−1 + . . . + Ar yt−r + B0 xt + . . . + Bm xt−m + t
t ∼ N(0, Σ) indep t = 1, . . . , T
This implies y0 = (y1 , . . . , yT ) ∼ Normal

• Dimension of A’s and B’s high ⇒ sparsity more important
• Algebra tractable, log-likelihood quadratic in A’s & B’s
VAR & Graphical models
Let yt be endogenous, xt exogenous, t ∼ N (0, φI ). Model
A0 yt = A1 yt−1 + . . . + Ar yt−r + B0 xt + . . . + Bm xt−m + t

⇒ yt = Ã1 yt−1 + . . . + Ãr yt−r + B̃0 xt + . . . + B̃m xt−m + ˜t
where Ãj = A−1 −1

˜t ∼ N (0, Σ), Σ = (A0 A00 )−1
0 Aj , B̃j = A0 Bj ,
• Ã1 , . . . , Ãr , B̃1 , . . . , B̃m and Σ are identifiable

• A0 not identifiable ⇒ Aj = A0 Ãj and Bj = A0 B̃j not identifiable
Idea: if A0 has enough 0’s, then Σ = (A0 A00 )−1 has unique solution
log p(y | A0 , A1 , . . . , Ar , B0 , . . . , Bm ) + h(A0 )

where h() is penalty inducing 0’s in A0
Example
(by Monetta 2007) Quarterly US data from 1947-1994 (T = 188)
• R: nominal interest rate
• I : investment per capita
• Y : GNP per capita
• C : consumption per capita
• M: real balances (ratio money/price level)
4. GRAPHICAL MODELS AND STRUCTURAL VARS
• ∆P: inflation
Graphical structure implied by Â0
R I Y M ∆P
❅
❅
❅❅C
Figure 1.1. Output of the search algorithm.

Time-varying yt = x0t βt + t
PT −1
Option 1. Penalize changes in βt , say t=1 |βt+1 − βt |1
Option 2. Changepoints in βt : consider t1 < . . . < tK , reparameterize

K
X
yt = x0t θk I(t > tk ) + t
k=1
• If t ∈ [t1 , t2 ) then yt = x0t θ1 + t

• If t ∈ [t2 , t3 ) then yt = x0t (θ1 + θ2 ) + t
• ...
 
x01 0 0 ... 0  
   
y1  0
x02 ...  θ1 1
 x2 0 0   θ2 
. . . =  x03 x03 x03 ... 0    + . . . 
  . . .
yT . . .  T
θK
x0T x0T x0T ... x0T
Outline
1 Time series
2 Factor regression

Factor regression
We have n observations, q outcomes, p known predictors, r factors
Y = X B + Z M + E
n×q n×p p×q n×r r ×q n×q
• X are p observed predictors, B regression coef

• Z are r unobserved factors, M are factor loadings
• E has i th row ei ∼ N(0, Σ) iid i = 1, . . . , n, Σ diagonal
The log-likelihood, given Z
n
1 1X
log p(Y | Z , B, M, Σ) = − log(|Σ|) − |yi − B 0 xi − M 0 zi |22
2 2
i=1
Complete with p(Z | B, M, Σ), add likelihood penalties / priors as usual

Z
p(Y | B, M, Σ) = p(Y | Z , B, M, Σ)p(Z | B, M, Σ)dZ
Example (Rockova & George, JASA 2015)
Let zi ∼ N(0, I ) indep. To encourage loadings m̂ij = 0 set g0 < g1 ,
Y
p(mij ) = L(mij ; 0, g0 )π + L(mij ; 0, g1 )(1 − π)
i,j
n = 48 job applicants, p = 15 characteristics (10-point scale). No X ’s

Factor 1 Factor 2 Factor 3 Factor 4 Factor 5 σ̂jj
Application 0.89 1.3 2.08 0 0 0.16
Appearance 1.01 0.02 0.14 1.6 0 0.18
Academic ability 0.2 0.5 -0.3 0.2 -0.1 1.84
Likability 1.41 0.03 0.46 0.35 2.29 0.16
Self-confidence 2.02 0 0 0 0 1.24
Lucidity 2.81 0 0 0 0 1.28
Honesty 0 0 0 0 0 2.49
Salesmanship 3.13 0 0 0 0 1.30
Experience 0.87 3.11 0 0 0 0.24
Drive 2.51 0 0 0 0 1.42
Ambition 2.61 0 0 0 0 1.20
Grasp 2.72 0 0 0 0 1.20
Potential 2.79 0 0 0 0 1.39
Keenness 1.67 0 0 0 0 2.00
Suitability 1.85 1.91 0 0 0 1.84
GDP nowcasting (Ferrara & Simoni 2018)
Outcome: Euro area GDP growth rate at quarter t (2005-2016)
yt = β0 + β1 yt−1 + βg0 xt,g + βs xt,s + βh xt,h + t
• xt,g : weekly Google trends data (search in broad categories)

• xt,s : monthly Sentiment index (European Commission survey)
• xt,h : biweekly industrial production (Eurostat)
Google data reduces MSE mainly weeks 1-3 (relative to no Google data)
xt,g has p = 1776 de-seasonalized weekly changes for category/country
• 26 search categories, 296 subcategories
• Belgium, France, Germany, Italy, Netherlands, Spain
Strategy 1: add xt,g with LASSO penalties
Strategy 2: let zt be PCA from xt,g
yt = β0 + β1 yt−1 + βg0 zt + βs xt,s + βh xt,h + t
Note: zt given by eigendecomposition of X 0 X
Strategy 3. Let zt be Sparse PCA

T
X X X
max log p(Xg | Z , M) = min |xg ,t − M 0 zt |22 + λ1 mij2 + λ2 |mij |
Z Z
t=1 i,j i,j
Note: zt given by iterative LARS-LASSO algorithm

Predictive accuracy
Authors selected r = 3 factors (adhoc). Data until 04 2014 to fit model,
rest to assess MSE
0.15
0.10
MSE (out−of−sample)
0.05
LASSO
PCA
0.00
SPCA
2 4 6 8 10 12
Week
Factor loadings
Loadings for the 1st factor
Factor LoadiSparse
ng for F1 (SPCA) PCA Factor Loading forPCA
F1 (PCA)
0.2 0.2
Factor Loading
Factor Loading
0 0
-0.2 -0.2
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Category by Country Category by Country
Factor Loading for F2 (SPCA) Factor Loading for F2 (PCA)
0.4 0.4 analysis?
Ideas on how to potentially improve this
oading
oading
0.2 0.2
Outline
1 Time series
2 Factor regression

In many problems yi is not a continuous variable
• Binary yi ∈ {0, 1}
• Discrete yi ∈ {1, . . . , G } (or ordinal)
• Counts yi ∈ N
GLMs: common framework to deal with these cases
p(yi | xi , β, φ) = p(yi | x0i β, φ)

g (E (yi | xi )) = x0i β
• g : known invertible link function

• φ: nuisance parameter vector (may be absent)
Also, p(yi | xi , β, φ) assumed to be in the exponential family of
distributions (includes Normal, Binomial, Poisson etc)
Examples
Logistic regression: p(yi | xi , β) = Bern(yi ; µi ), µi = P(yi = 1 | x0i β)

µi
log = x0i β
1 − µi
Poisson regression: p(yi | xi , β) = Poisson(yi ; µi ), µi = E (yi | xi )
log µi = x0i β
Can lack flexibility to capture variance, since Var(yi | xi , β) = µi
Negative Binomial regression: p(yi | xi , β) = NegBin(yi ; µi , φ),

µi = E (yi | xi ), Var(yi | xi ) = µi φ
Multinomial regression, ordinal regression...

Penalized log-likelihood
n
X
min − log p(y | β, φ) + h(β) = − log p(yi | β, φ) + h(β)
β,φ
i=1
For most GLMs log p(y | β, φ) concave in β and depends on X through a

linear function
Theory: most LASSO, ALASSO, Bayesian results etc carry over to

GLMs: consistency, MSE, variable screening & selection
Bayesian framework: integrated likelihood p(y | γ) has no closed-form,

but convexity facilitates Laplace approx
Example: Poisson regression
Log-likelihood: sum of linear + concave functions in β
If yi ∼ Poisson(yi ; µi ), µi = E (yi | xi ) where log µi = x0i β, then
n
Y n
X
µyi 0
log p(y | β) = log i
e −µi = c + yi x0i β − e xi β
yi !
i=1 i=1
Example: data PoissonExample (glmnet package)

• n = 500, p = 20
• Simulate Poisson y with β ∗ = (−0.75, −0.5, 0.5, 0.75, 1, 0, . . . , 0)
• Add LASSO penalty (convex optimization)
LASSO analysis
(glmnet with family=’poisson’)
Histogram yi LASSO path

0 3 5 5
350
1.0
300
250
0.5
200
Coefficients
Frequency
150
0.0
100
−0.5
50
0
0 20 40 60 0 1 2 3
y L1 Norm
Cross-validated λ: correctly selects the first 5 variables

Bayesian analysis
(modelSelection with family=’poisson’)
BMA
1.0
BMS
0.5
Model γ p(γ | y)
1, 2, 3, 4, 5 0.721
β
1, 2, 3, 4, 5, 10 0.128
0.0
1, 2, 3, 4, 5, 12 0.018
... ...
−0.5
5 10 15 20
Variable
Outline
1 Time series
2 Factor regression

Limitations of standard GLMs
Data may be over- or under-dispersed relative to predicted by model

• Linear models E (yi ) = x0i β, Var(yi ) = φ. β and φ are “decoupled”
• Poisson regr. log E (yi ) = x0i β, Var(yi ) = E (yi )
• Binomial regr. logitE (yi ) = x0i β, Var(yi ) = ni E (yi )(1 − E (yi ))
Data may have more 0’s than predicted by model. Let µi = E (yi | xi )
• Poisson: P(yi = 0 | xi ) = e −µi
• Binomial: P(yi = 0 | xi ) = (1 − µi /ni )ni
Common solutions
1 Models with additional variance parameter, e.g. NegBinom
2 Zero-inflated and hurdle models
Why is it a problem in practice?
Pni
Dispersion. If zij ∼ Bern(µi ) iid then yi = j=1 zij ∼ Bin(ni , µi )
• What if not indep across j? e.g. measures from same school,
company, region...
• What if µi not constant across j? e.g. unobserved covariates
Zero counts
• Zeroes depend on x’s in a different way, e.g. never visiting a doctor,
never applying for well-fare
• Under-reporting (e.g. self-reported depression), lack of sensitivity
(e.g. counting people in satellite images)
model. In addition we assessed the general fit of the models by computing Pearson’s
chi-square using the predicted counts. Lastly, we used the Akaike (AIC) and
Bayesian (BIC) information criteria to compare models.
5. RESULTS
Vaccine adverse events
(Rose et al, J Biopharm Stat 2006, 16: 463-81)
We computed the empirical mean and variance for our systemic adverse event
Goal: assess safety
endpoint. of Anthrax
Total number vaccine
of systemic adverse from randomized
events was clinical
recorded after trial
each of the
four injections for the 1005 study participants, which results in 4020 observations.
• n= 1005
Study participants,
participants 4 study
experienced injections
from 0–12 (0, 2,adverse
unique systemic 4, 24events
weeks)
after
an injection. The observed mean and variance using all 4020 observations are
Motivation
• 5 treatments: full-doseOur
1.51 and 2.90, respectively. SQobserved
or IM,variance
reduced dose,
to mean placebo
ratio SQ or IM
is 1.92, which
indicates some over-dispersion. Figure 1 compares the observed count distribution
• Outcome: number of adverse events after each injection
Negative Binomial
Def. yi = number of successes in Ber(µi ) trials until r failures
⇒ yi ∼ NegBin(r , µi )
r µi E (yi )
E (yi ) = ; Var(yi ) =
1 − µi 1 − µi

yi + r − 1
p(yi | r , µi ) = (1 − µi )r µyi i
yi
Often parameterized as log E (yi | xi ) = x0i β, Var(yi ) = φE (yi | xi )

Note that φ = 1/(1 − µi ) > 1, i.e. NegBin captures over-dispersion
MLE. Init φ̂ = 1, iterate until convergence

1 β̂ = arg maxβ p(y | β, φ̂) (convex optimization)
2 φ̂ = arg maxφ p(y | β̂, φ) (univariate optimization)
MCMC. Gibbs or HMC easy to apply
Gibbs sampling
Trick: Suppose we model
yi | λi ∼ Poi(λi )
λi | r , µi ∼ Gamma(r , µi /(1 − µi )).
This implies E (yi ) = E (λi ) = r µi /(1 − µi ). Further,

Z
p(yi | r , µi ) = Poi(yi ; λi )p(λi | r , µi )dλi = NegBin(yi ; r , µi )
Let log E (yi ) = x0i β. Gibbs for parameter augmentation (λ1 , . . . , λn , β, r )

µi µi
p(λi | yi , r , µi ) ∝ Poi(yi ; λi )G λi ; r , ∝ G λi ; r + yi , yi +
1 − µi 1 − µi
Sample from p(β, r | λ1 , . . . , λn , y) using Gamma regression methods

Example: Twitter German elections 2017
(with R. Knudsen & S. Majó, Oxford Reuters Institute for Investigative Journalism)
Data: n = 11, 503 tweets (21/08-25/09) by candidates/parties
Outcomes: number of favourites, number of retweets
Predictors
• Number of followers
• Party affiliation
• Party vs. candidate
• % of tweet talking about 40 topics (Supervised LDA, to be seen)
● ● ●
● ●
● ●
● ●
● ● ● ●● ● ● ● ●
●●●●
●● ●●●●●
●●
●
●●
●●●
●●●●
●
●●●●
●●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●
●
●●
●●●●●
●●●
●●●
● ● ●● ●● ● ● ●●
● ●●● ●●● ●● ●●●● ● ●
● ●● ● ●
● ● ● ●● ● ● ●
●
●●●● ●
● ● ●●●● ●
●● ●●
1,000.0 ● ● ● ●● ●●●●● ●●
●● ●
10.0 ●●● ●● ●● ● ●
●●●●●●●●● ● ●
● ● ●
● ● ●●● ●●● ●
● ● ●
●●
●●
●● ● ●
● ●●●●
● ● ●●
●●● ●
Outcomes vs. number of followers and party ● ● ●●
●
● ●
● ●●
● ● ●●● ●●
● ● ●
●
●
●
●
● ●● ●●
●●
●
●● ●●
●● ●
●
●
●
●
●●
●● ●
●
●
● ● ●●
●●
● ●●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●●
●
●●● ● ●●●
● ● ● ● ●●
●
●●● ●
● ● ●●
● ●
●
●
●●● ● ●●●●●● ●●
● ● ● ● ● ●● ●
● ● ●●
●● ●●●
●
●● ●● ●●
● ● ●● ●● ●
● ●●● ●●
● ● ● ● ●●
●● ●●●● ● ●● ● ● ● ● ●
● ● ● ● ● ● ●
● ●● ● ●● ● ● ● ● ● ●●● ● ●
● ●● ●● ●● ● ●
●●●
●●
●● ● ● ●●
● ●● ●● ●●●●
● ● ●●
● ● ●● ● ● ● ● ● ●● ● ● ●●
● ●● ● ● ● ● ●
● ●●●● ●● ● ● 0.1 ● ● ●●●
● ●●
● ● ● ●
10.0 ●
● ● ● ●● ● ●● ●
● ●● ●● ●● ●
Favorites
●
●●
●● ●
●●●●
●
●●●
●● ● ●
●●
●●● ●●
●
●●
●
●●
●
●●
●●
●●●●●●● ●
●
●●● ●● ● ● Retweets●
● ●
●
● ●●● ●● ●●●
●
● ●●●●●● ●● ●●
●
●
yu2
●●● ●●● ● ●
● ●●●●●●●●● ●● ●
●●●
●●
● ●
●●●
●
●●●●●
●
●
●
●●●● ● ● ● ● ● ●●●● ●●
● ● ● ●
●● ●
● x ● ●●
● ● ● ● ●●●●●● ●● ●●●●
●● ●●●●●
●●
●
●●
●●●
●●●●
●
●●●●
●
●●
●●
●●
●
●
●●
●●
●
●
●●
●
●●
●●
●●
●
●●
●●
●●
●●
●●
●
●●
●●●
●●●●
●
●●●
●
● ●● u2 ●=● "candidate"
●
● ●●
●
●● ●
●
●
●●● ●
● ●
●
● ● ●
● ●●● ●
● ● ● xu2 = "party" ● ●
●
●●● ●
●● ●●●●● ●● ●●●
●●● ●●
●●
●●
●●
●● ●●
●
●●●●
● ●
●●
●
●● ● ●
● ●● ●
● ●●●● ● ●● ● ●● ●● ●
●● ● ●
● ●●●●●● ●●●● ●● ● ●
● party
● ● ●
●
●
●●●●● ● ●
●●
● ●●● ● ●● ●● ● ●●● ●
●● ●
●●● ●
●
1,000.0
● ●● ●●
1,000.0 ● ●
● ● ●●
● ●● ● ●● ●●●●●● ● ● ● AfD
●● ● ●●● ● ●
0.1 ●
● ●● ● ●
●
● ●
● ● ●●
● ●● ● ● ● ●
● ● ●● ● ● ● ●● ●●● ●●
●
● ● CDU
●● ●● ●● ● ● ●● ●
● ● ●●● ● ●● ●● ●
●● ● ● ●
● ●● ●●● ●● ●● ●● ● ● ●● ● ●●
● ● ● ● ● ●
●
● ●● ● ●● ●● ● ● ●
● ●● ● ● ● ● ● ●
●
●● ● ●● ● ● ●
●●●●
●● ●●●● ●●●●●●
● ●●
●
●
●●
●●●●
●
●
●
●
●●
●●●
●●●
●●
●●
●●
●●
●●●
●●
●●●●
●●●
●
●●
●●●●
●●
●●●●
●●●
● ●●
●●●
● ● ● ●●●
●
● ● ●● ● ● ●●●● ●● ●● ●
● ● ● ● CSU
● ● ● ●● ●●●●●
● ●●
● ●
●
● ● ● ●●
● 10.0 ● ● ●● ● ●
● ● ●
●● ●● ● ●● ● ●●
●● ●
10.0 ● ●● ● ●●●
● ●● ● ●● ● ● ●● ● ●●●●●● ●●●●●● ● ● ●
●●
●● ●●●●●●● ●
●
●●
● ● ●
●●●●●● ●●
● ● ●●●● ●
● ●●● ●●● ●●●●● ●●
●●● ●●
●
● ● ● ●
● ● ●
●●● ●
● ●●●● ●● ●
● ●●
●●●● ●● ●●● ●●
yu1
100,000.0 ● ● ●● ●
●●
●
●●
●● ●
● ● ●
●
●
● ●●
●●
●
●●
●●
●
●●●
●●
●●● ●
● ● ● ●●●●●●● ● ●●●●●● ●● ●
●●●
● ●
●●●●
●
●●
●●●
●●
●
●● ●● ● ●
● ● FDP
● ● ●
● ●● ●●
● ● ●● ●●
● ●
●●
●●●
●●
●●●● ● ●● ● ●● ● ● ● ●
●
●
●●●●● ●
●● ● ●● ● ●
●●● ●● ●
●●
● ● ●
●●●●●●● ●●
● ● ●●● ●● ● ● ●●●●●●● ●● ●●
●●●● ●●
●
● ●● ● ● ●●
●● ●● ●● ●●● ● ●● ● ● ●●●
●●
● ●● ●●
● ●● ●
●● ●●●●●●● ●● ●● ● ●
●● ●●● ●● ●● ●
●● ●● ●●●● ●
● ● ●● ● ● ● ●● ● ● ● ●
●●●● ●●
● ●●●
●● ●●● ●●● ● ● Gruene
● ● ●
● ● ● ●● ●● ●● ●● ●
●●●●●● ●
●●● ●●
●
● ●●●
● ●
●● ●● ● ● ●●● ●●●● ●
●● ● ●●●
●
●
●●
● ●
●● ● ●
● ●●●● ●● ● ● ● ● ● ●●●●●● ●●● ● ● ●● ●● ●● ●●
●●
●● ● ●
● ●●●
●
●
●●●
●
●●
● ●● ●●●●
● ●● ● ●● ●●●●● ●
●
● ●● ●●
●●
● ● ●● ●●●
●
●● ●
● ●●● ● ●
●
● ●● ●●
● ●●
● ● ●
●●●●●●● ●●
●● ● ●●
● ● ● ● ●●● ● ● ● ●●
● ● ●
● ● ● ●●●●●●● ● ●●● ● ● ●● ●●●●●●● ● ●
●●●
●● ●
● ● ●● ●
●
●
●●●●
●● ●
● ●● ● ●●● ● ●● ● ● ●● ● ● ●●● ●●● ● ●●●●● ● ● Linke
●●● ● ● ●● ●●●● ●●●● ●
●● ●● ●●● ●●●●●● ● ●
●●●● ●●●● ● ● ●● ● ●● ●●●● ●●●●
0.1
10,000.0 ● ●●
●●●
●●● ●●●● ●●● ● ●●
● ● ●●● ●●●●●●●●●
0.1 ● ●●
●● ● ●● ● ● ●●●
● ●
● ● ●● ●
●
●● ●● ● ●● ●
●●●● ●
● ●● ● ● ● ● ●● ● ● ● ●
●
●●● ● ●● ● ●
● ●●●●●● ●● ●
●● ●
● ● ● ● ● ● ●●● ● ●● ● ●● ● SPD
● ● ●
●
●●● ● ● ●● ● ●
●
●
● ● ●
●● ● ●●
● ● ●
● ● ●● ● ●●● ●● ● ● ●
● ● ● ● ● ● ●
● ● ●● ●
●●● ● ●●●●● ● ●● ● ●
●
●● ● ● ● ●
● ●● ● ●●
●●●●
●● ●●●●●
●●
●
●●
●●●
●●●●
●
●●●●
●
●●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●●
●●
●●
●●●
●
●●●●●
●●●
●●●●
● ●●●● ●
●
●
●●●●
●● ●●●●
●●
●●●●
● ●●
●
●●●●
●●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●●
●●
●●
●●
●●
●
●●●●●
●●●
●●●
● ●
●
●●● ●
●● ●● ● ●
● ●
zu1
●
●● ● ● ●
●
●● ●●● ●
● ● ● ● ● ● ●
1,000.0 ● ● 100,000.0 ●
1,000.0 ● ●
●
● ● ● ● ● ● ●
●●
● ● ●
●●● ●
●● ● ● ●●
● ● ● ● ●●
●● ● ● ● ●●● ● ● ●
●●● ●● ● ●● ● ●●
●
● ● ● ●●● ● ● ●● ●● ●●
• Many 0’s, strong asymmetry. Normal model inadequate ●●

● ●● ●● ● ●● ●● ●● ●
●
●●● ●
● ●● ● ● ●● ● ●●●●
● ● ●●●
● ●● ● ● ● ● ● ●●● ●●
●
●●●
●
● ●● ●● ●
●●●●●●
●● ● ● ●
●● ● ●●● ●●● ● ●
●● ●●●
● ● ● ●
● ● ● ● ● ●● ●●●
●● ●●
●●
●●
● ●● ●
● ●●●●●●
●●●● ●●●●
● ● 10,000.0
●
● ●●
●
●● ●
●
●●
●● ● ●
● ●●
●●
● ● ● ● ● ●●●●
●
● ●●●
●
●●●●
● ● ●●
●● ●
●●
●● ●●●●●●●
● ●●●●●
●●●
● ● ● ● ● ●● ●
●● ●●●
●● ●
●
100.0 ●
● ●
●●
●● ●● ●
● ● ●● ●●●● ●●● ● ●
● ●●● ●
●●● ● ●●● ●●●●
● ● ●
●
●● ●
●● ● ● ●●●● ● ●●●● ●● ● ●●● ●●●●
•10.0BIC strongly prefers NegBin over Poisson
●
●
●●
●
● ●
● ●●
●● ●
●
●●
●●
●●●
● ●
●● ●
●
●●● ●
●
●
●●
●● ●●
●●
●●●●● ●●
●●●● ●
● ●●●
●
●
●
● ●
●
●● ●
●
●●
●
●●●●● ●
●
●●●
●● ● ●
●
●
● ●
●
●●● ●
●
●
●
● ●● ●
●
●
●
●
●●● ● ●
●
●
●●●●● ●
●●
●
●
●●
●
●
●
● ●
●●
●
●
● ●●●
●
● ●
●
●●
yu2
● ● ●● ● ●● ● ● ● ●●
● ●●● ● ●● ●● ●
●
●●●●●●
●● ●● ●
●
●●● ● ● ● ● ●●
●● ●●●●● ● ● ●●
●● ● ●
● ● ● ●● ●●● ● ● ●●
●● ●
● ● ●●
●●●● ● ●●
●● ● ●
●●● ● ●
●●
●
● ●●●
●●●●● ● ● ● ●● ● ●
● ●● ●● ●●● ● ●● ●
● ●● ● ●●
●
● ● ●●●
●● ●● ● ● ● ●●● ● ● ● ●
100,000.0 ●● ●●
●●●●●●●
● ●●
●●●● ● ●
●●
●●
●● ●●
●● ● ●
● ● ●●●● ● 1,000.0 ●● ●●
●
●●
● ● ● ● ●
• Also use BIC to select variables

● ● ●
●
●●●
●
●●●
● ●●●●●●
●●
●
●●
●●
●●●
● ● ● ● ●● ●
●●
●
●
●●
●
●
● ●●
●●●
●
●●●●
●●
● ● ●●
●● ● ●●● ●● ●
● ●●●● ● ●●●
●
●●
●
●●
●
●●●
●
●
●
●
●● ●
●● ● ●
●
●
●
● party
● ● ● ●● ● ● ●
●● ●●● ●● ●●●●
●● ●
● ● ●● ● ● ●● ● ● ● AfD
● ● ●● ●●● ● ●● ● ● ● ●
●
0.1 ● ●● ● ●●●●●● ● ● ● ● ● ●● ● ●
● ●● ● ● ● ●
● ● ● ●
●
●
● ●●
●●●
● ●
● ●● ●● ●
● ● ● ● ● ●●● ● ●●● ● ●● 100.0 ●
● ● ●●● ●● ● ●● ● ● ● ● ● CDU
● ●●●● ●● ● ● ●●● ● ●
10,000.0 ●● ● ●
●●●● ●●●●
●
●● ●● ●● ●
●●●● ●●● ● ● ● ●●● ● ●
●●● ●● ●●●●●● ●●●
●● ● ● ●● ● ●● ●
●●●●
●● ●●●●
●●
●●●●
● ●●
●
●●●●
●●●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●●
●
● ●
●●
●
●●
●●●●●●●●
●●
● ● ● ●●●● ●● ● ● ●● ●●● ● ● ● ● ●●●● ● ● ● CSU
● ● ● ● ●●●●●● ●● ● ● ● ● ● ● ●● ●
● ●● ●●●●
● ● ● ● ●● ● ●●●
●●●
●
●●●●●●● ● ●●● ● ●
●
●
●
●● ●● ● ● ● ●●
● ● ● ●● ●●●● ●●●● ●●●●●
100,000.0 ● ● ●● ● ●● ●●●●
●●●
● ●●● ●●● ● ● ●●
● 100,000.0 ● ● ● ●
● FDP
● ●● ● ● ● ● ●● ● ●
zu2
● ●● ● ●● ● ●●● ● ●
● ● ●
●
●
●●
●● ● ● ● ●
● ●
● ● ● ● ● ● ●
●●● ●● ●●
●●●●● ● ●●●● ●
● ● ● ●● ●● ● ●● ●● ● ●●
● ● ● ●● ● ● ● ●●● ●●
● ●
MLE results (SE)
Topic 1 uses words related to crime, violence & migration
Favourites Retweets
log-followers:candidate 0.745 (0.013) 0.648 (0.011)
log-followers:party 0.445 (0.013) 0.384 (0.011)
AfD - -
CDU −3.028 (0.107) −3.181 (0.094)
CSU −2.884 (0.193) −3.096 (0.173)
FDP −1.929 (0.115) −2.350 (0.100)
Gruene −2.490 (0.100) −2.541 (0.087)
Linke −1.996 (0.099) −1.961 (0.085)
SPD −1.903 (0.100) −2.172 (0.087)
topic1 −5.573 (0.450) −5.436 (0.396)
CDU:topic1 8.829 (2.464) 5.186 (2.124)
CSU:topic1 13.533 (3.369) 10.679 (2.902)
FDP:topic1 10.180 (2.321) 12.270 (2.021)
Gruene:topic1 1.092 (1.759) 3.037 (1.547)
Linke:topic1 −2.784 (1.185) 0.650 (0.933)
SPD:topic1 2.019 (1.408) 2.313 (1.208)
+ 10 other topics, candidate vs. party, candidate:party interaction

● ●
●● ● ● ●●● ● ● ● ● ●
●
● ●● ●●●●
● ● ●●
Outcomes vs. % of topic
● ● ●1 ● ●● ●● ● ●
● ● ● ●● ●●●●●●● ●
● ● ●●● ●● ● ●
10.0 ●● ● ●● ●●●● ●● ●
●●● ● ●● ●●● ●●●●●●●●●●● ● ● ●●● ●●●●● ●●● ●●●
●● ●● ● ● ●
● ●●●● ●● ●●●●●●●●●● ● ● ● ● ● ●● ● ●●
yu2
Favorites Retweets ●●● ●
● ● ●●●●●●●●●
● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●
candidate party candidate
●● ● ● ●●●●● ●● ●●●●● ●● ●●●●●●● ●●● ● ●

● ●● ●● ●●● ●●●●●● ●●●●●●●● ●●●●● ● ● ●●● ●
●● ●●●●● ●●●●● ●●●●●●● ●●● ● ●●● ● 1000 ●● ●● ●
● ●●●●●● ●● ●●● ●● ●
● ● ● ●●●●●●●●●●●●●●●●●●●●●● ●● ●●● ●
●● ●●● ●● ●● ● ● ●
party
●●● ● ● ● ●
● ●●
1000 ● ●● ● ●● ●●●●●●● ●●●● ● AfD
● ● ●● ●● ● ●
0.1 ●
●●
●
● CDU
● 100
party
AfD
●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●

CDU
CSU
● CSU
favourite_count
retweet_count
100
FDP
Gruene
Linke
100,000.0 ● ●●
SPD
● FDP
● ● ● ● alpha
● ● ●
● ● ●● ● ● ●●● ●
0.3
● ● ● 10
● ●●
● Gruene
●● ● ● ● ● ●● ●
10
● ●●● ●●●●●● ●●●● ●●●●●●● ● ● ●●
●●● ● ●●●● ●●● ●●●● ● ● ●●
● ●● ●●●● ●●● ●●● ● ● ●●●● ● ● ●
● ● Linke
●● ●●● ●●●●● ●●●● ●● ●●●● ● ●● ● ● ●●●
● ● ● ● ●●● ●●●●●●●●● ●● ●●●●●●●●●●● ●●●● ● ●● ● ● ●
10,000.0 ● ● ●●●● ●● ● ●●●●●●●●●●●●●●●●●● ●
● ● ● ●●●● ●● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●
● ● ● ●●● ●●● ● SPD
●●●● ●●● ● ●●●●●●●● ●●●●●●●● ● ●●● ● ● ● ●● ●● ● ●● ● ●●
●● ● ●● ● ●● ●●● ●●●●●● ●●●● ● ● ● ● ● ● ●● ●
● ●● ●●● ●●●● ● ●●● ●● ● ● ●
●● ● ●
1 1
●●● ●●● ● ●● ●●●● ●●● ● ● ● ●

● ●
●● ●●●● ●●● ● ● ●
zu1
0.01 0.03 0.10 0.30 0.01 0.01
0.03 0.03
0.10 0.10
0.30 0.30 0.01
●●●●●●
topic_1 ● topic_1
● ● ● ● ● ● ● ●
● ●
Models for zeroes
Zero-inflated Poisson. With prob 1 − ηi we observe yi = 0, where
logit(ηi ) = x0i µ
and with prob ηi we observe yi ∼ Poi(µi ), log(µi ) = x0i β
Hurdle models. With prob 1 − ηi we get yi = 0, else yi ∼ Poi(µi )I(yi ≥ 1)
Key distinction
• Zero-inflated: zeroes can be “structural” or from non-zero process
P(yi ) = 1 − ηi + e −µi
• Hurdle: all zeroes are “structural”
P(yi ) = 1 − ηi
Binomial/NegBin versions available

Example: Vaccine adverse events
Models for number of adverse events vs. treatment group + covariates
Nb Param BIC AIC

Poisson 13 14033 13951
Zero-inflated Poisson 26 13549 13386
Hurdle Poisson 26 13550 13386
NegBin 14 13303 13215
Zero-infl NegBin 27 13332 13162
Hurdle NegBin 27 13340 13170
s: Model selection for male respondents
Observed vs. predicted frequencies
ON THE USE OF ZERO-INFLATED AND HURDLE MODELS 475
Health care usage
(dataset NMES1988 from AER R package)
(example from http://data.library.virginia.edu/getting-started-with-hurdle-models)
n = 4, 406 individuals aged ≥ 66 covered by Medicare

• Outcome: number of physician visits
• Predictors: gender, years of education, number of chronic
conditions, hospital stays, private insurance, health (3 categories)
Poisson regression
MLE SE P-value
Intercept 1.029 0.024 <0.0001
hospital 0.165 0.006 <0.0001
health - poor 0.248 0.018 <0.0001
health - excellent -0.362 0.030 <0.0001
chronic 0.147 0.005 <0.0001
gender - male -0.112 0.013 <0.0001
school 0.026 0.002 <0.0001
insurance - yes 0.202 0.017 <0.0001
Histogram of number of visits
Poisson regression. Observed 0’s: 683. Predicted 0’s: 47
700
600
500
400
Frequency
300
200
100
0
0 6 13 22 31 40 49 58 68 89
Number of visits
Poisson hurdle model
Model for counts. SE same as before, but MLE differs
MLE SE P-value
Intercept 1.406 0.024 < 2e-16
hospital 0.159 0.006 < 2e-16
health - poor 0.254 0.018 < 2e-16
health - excellent -0.304 0.031 < 2e-16
chronic 0.102 0.005 < 2e-16
gender - male -0.062 0.013 1.86e-06
school 0.019 0.002 < 2e-16
insurance - yes 0.081 0.017 2.37e-06
Model for P(yi > 0 | xi ) (logistic regression)

MLE SE P-value
Intercept 0.043 0.140 0.7577
hospital 0.312 0.091 0.0006
health - poor -0.009 0.161 0.9568
health - excellent -0.290 0.143 0.0424
chronic 0.535 0.045 < 2e-16
gender - male -0.416 0.088 2.09e-06
school 0.059 0.012 1.05e-06
insurance - yes 0.747 0.101 1.30e-13
Assess model fit
Red line: counts predicted by model; Bars: observed counts
Poisson hurdle NegBin hurdle

700
700
600
600
500
500
Frequency
Frequency
400
400
300
300
200
200
100
100
0
0 20 40 60 80 0 20 40 60 80
Number of visits Number of visits

Compare inference
Model for counts

Hurdle-Poisson Hurdle-NB
MLE SE P-value MLE SE P-value
Intercept 1.406 0.024 <0.0001 1.198 0.059 <0.0001
hospital 0.159 0.006 <0.0001 0.212 0.021 <0.0001
health - poor 0.254 0.018 <0.0001 0.316 0.048 <0.0001
health - excellent -0.304 0.031 <0.0001 -0.332 0.066 <0.0001
chronic 0.102 0.005 <0.0001 0.126 0.012 <0.0001
gender - male -0.062 0.013 <0.0001 -0.068 0.032 0.0351
school 0.019 0.002 <0.0001 0.021 0.005 <0.0001
insurance - yes 0.081 0.017 <0.0001 0.100 0.043 0.0188
• Higher uncertainty in Hurdle-NB

• Same model for P(yi > 0 | xi ) (logistic regression, as before)
R code
library(AER)
data("NMES1988")
nmes= NMES1988[, c(1,6:8,13,15,18)]
plot(table(nmes$visits), ylab=’Frequency’, xlab=’Number of visits’)
mod1= glm(visits ~ ., data = nmes, family = "poisson")

summary(mod1)
mu= predict(mod1, type="response") # predict expected mean count
exp= sum(dpois(x=0, lambda=mu)) # sum prob of a 0 count for each mean
round(exp) # predicted 0’s
sum(nmes$visits < 1) # observed 0’s
#Poisson hurdle model

library(pscl)
library(countreg)
mod.hurdle= hurdle(visits ~ ., data=nmes,dist="poisson")
mod.hurdle.nb= hurdle(visits ~ ., data=nmes,dist="negbin")
summary(mod.hurdle)
summary(mod.hurdle.nb)
rootogram(mod.hurdle,max=80,xlab=’Nb of visits’,style=’standing’,scale=’raw’)
rootogram(mod.hurdle.nb,max=80,xlab=’Nb of visits’,style=’standing’,scale=’raw’
R software for count data
Zeileis A, Kleiber C, Jackman S (2008). Regression Models for Count

Data in R. Journal of Statistical Software, 27(8)
https://www.jstatsoft.org/article/view/v027i08
R package mpath: L1 , L2 , SCAD and MCP penalties for Poisson/NegBin

and zero-inflated Poisson/NegBin
Outline
1 Time series
2 Factor regression

(Gentzkow, Shapiro, Taddy. Measuring polarization in high-dim data. NBER 2017)
Regression framework for high-dimensional count outcomes

• Political polarization: partisanship of USA congressional speech
1873-2016
• Residential segregation by political affiliation
Consider polarization. Parties employ different terms
• Democrats: estate taxes, undocumented workers, tax breaks for the
wealthy...
• Republicans: death taxes, illegal aliens, tax reform...
• Orlando nightclub killing of 49 people in 2016: “mass shooting” for
Democrats, “terrorism” for Republicans
Has partisanship evolved in time? Can we predict party from speech?
Data from US congress records 1873-2016 (digital text from HeinOnline)
• Select speeches by democrats/republicans (7,732 speakers). Total
36,161 unique speaker-session
• Count two-word phrases (bigrams). p = 508, 351 phrases with
count ≥ 10 in at least one session
• Predictors: party, state, chamber, gender
Methodology can affect the findings

• Jensen et al (2012): partisanship increased recently but was even
higher in the past. Analysis based on correlations
• Peterson & Spirling (2016): measure partisanship via
machine-learning. Suggests polarization even in fictitious data
• Gentzkow et al show that MLE may lead to misleading results
Note: in 1996-2014 speech polarization > voting polarization (Lauderdale
& Herzog 2016)
Model
• yitj : phrase count for speaker i, session t, phrase j

P
• mit = j yitj total speech of i at session t
• zi = 1 if speaker i is republican, zi = −1 if democrat
• xit : speaker characteristics (possibly time-varying)
Let yit ∼ Mult (mit , qt (zi , xit )). Prob of uttering phrase j is
P
• qtj (zi , xit ) = e uitj / l e uilt (multinomial logistic regression)
• uitj = αjt + zi ϕjt + x0it βjt
Methodological issues
• Sparse counts, many parameters: regularization
• Computational: multinomial likelihood hard, find a proxy
Measuring partisanship
Idea: If qt (1, x) very different from qt (−1, x) ⇒ partisanship
Suppose neutral observer hears a word, what prob would she assign to
the speaker’s true party?
• Prior P(republican)= P(democrat)= 0.5
• If speaker republican, choose phrase j with prob qtj (1, x)
• Given word j, probability of the speaker being republican is
ρtj (x) = qtj (1, x)/(qtj (1, x) + qtj (−1, x))
Define partisanship of speech at x as expected ρtj

1 1
πt (x) = qt (1, x)0 ρt (x) + qt (−1, x)0 (1 − ρt (x))
2 2
Let st total speakers at session t. Average partisanship
st
1 X
π̄t = πt (xit )
st
i=1
MLE: q̂t from multinomial-logistic regression, ρ̂t trivial. Then
π̂t (x) = 0.5q̂t (1, x)0 ρ̂t (x) + 0.5q̂t (−1, x)0 (1 − ρ̂t (x))
• π̂t is biased (Jensen’s ineq), important for low counts

• q̂t and ρ̂t correlated ⇒ inflates Var(π̂t )
Option 1: use different data for q̂t and ρ̂t , then average
Option 2: regularization. Authors use LASSO penalty
X
λj |ϕjt | + 10−5 λj (|αjt | + |βjt |1 )
t
• λj chosen to minimize BIC (also tried 5-fold CV)

• 10−5 means xit less penalized than party. Tried alternatives to 10−5
Computation
P
Multinomial likelihood requires qit (zi , xit ) = e uitj / l e uitl ,
uitj = αjt + zi ϕjt + x0it βjt . Instead, authors use Poisson approx
0

yitj ∼ Poisson e αjt +zi ϕjt +xit βjt
P
• Avoids computing sums over phrases, i.e. l e uitl
• Parallel computation of uitj across phrases j
Other tricks to speed LASSO path etc.
Remark. If vj ∼ Poi(µj ) indep j = 1, . . . , J, then
 
X X 1
v1 , . . . , vJ | vj ∼ Mult  vj , P µ
j µj
j j
MLE
Figure 1: Average Partisanship and Polarization of Speech, Plug-in Estimates
Partisanship:
Paneloriginal & “random”
A: Partisanship data (permuted
from Maximum Likelihoodparty affiliation
Estimator (p̂ MLEz)i ) t
0.64 real random
0.62
Average partisanship
0.60
0.58
0.56
0.54
1870 1890 1910 1930 1950 1970 1990 2010
Bands are 90% intervals obtained

Panel B: by sub-sampling
Polarization from Jensen et al. (2012)
real random
3
0.54
1870 1890 1910 1930

Jensen et al (2012)
1950 1970 1990 2010
Panel B: Polarization from Jensen et al. (2012)

real random
3
2
Standardized polarization
−1
1870 1890 1910 1930 1950 1970 1990 2010
Notes: Panel A plots the average partisanship series from the maximum likelihood estimator p̂tMLE defined in Section 4.1. “Real” series is fr
actual data; “random” series is from hypothetical data in which each speaker’s party is randomly assigned with the probability that the spea
is Republican equal to the average share of speakers who are Republican in the sessions in which the speaker is active. The shaded reg
around each series represents a pointwise confidence interval obtained by subsampling (Politis et al. 1999). Specifically, we randomly partit
the set of speakers into 10 equal-sized subsamples (up to integer restrictions) and, for each subsample k, we compute the MLE estimate
k p k 1 1 10 l 1
Option 1. Obtain ρ̂, q̂ indep
Figure 2:i:Average
For each individual q̂ usesPartisanship
only data of from
Speech,i,Leave-out all j 6=(p̂itLO )
ρ̂ fromEstimate
0.54 real random
0.53
0.52
0.51
0.50
0.49
1870 1890 1910 1930 1950 1970 1990 2010
Notes: Figure shows the average partisanship series from the leave-out estimator p̂tLO defined in Section
4.1. “Real” series is from actual data; “random” series is from hypothetical data in which each speaker’s
party is randomly assigned with the probability that the speaker is Republican equal to the average share of
Option 2. LASSO
Figure 3: Average Partisanship of Speech, Penalized Estimates
Panel A: Preferred Specification

real
random
0.512
0.510
0.508
0.506
0.504
0.502
0.500
1870 1890 1910 1930 1950 1970 1990 2010
Panel B: Post-1976 with Key Events

Same pattern, smaller variance
0.512
C−SPAN C−SPAN2 Contract with America
0.504
0.502
From 1976
0.500
1870 1890 1910 1930 1950 1970 1990 2010

In 1994 Republicans won. Newt Gingrich led platform Contract with
America, used focus groups/polling to set rethoric resonating with voters
Panel B: Post-1976 with Key Events
0.512
C−SPAN C−SPAN2 Contract with America
0.508
0.504
0.500 Ford Carter Reagan Bush Clinton Bush Obama
1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016
Notes: Panel A shows the results from our preferred penalized estimator defined in Section 4.2. “Real” series is from
actual data; “random” series is from hypothetical data in which each speaker’s party is randomly assigned with the
probability that the speaker is Republican equal to the average share of speakers who are Republican in the sessions in
Consultant Frank Luntz was involved in the 1994 campaign. When asked
if “language can change a paradigm”, he answered
I don’t believe it – I know it. I’ve seen it with my own eyes... I watched
in 1994 when the group of Republicans got together and said: “We’re
going to do this completely differently than it’s ever been done before...”
(Luntz 2004)
Partisanship using multiple words
Partisanship measures predictive power of using 1 phrase. What if we
Figure 4: Informativeness of Speech by Speech Length and Session
used > 1 phrase?
1.0 1873−1874
1989−1990 One minute
2007−2008 of speech
0.9
Expected posterior
0.8
0.7
0.6
0.5
0 20 40 60 80 100
Number of phrases
Notes: For each speaker i and session t we calculate, given characteristics xit , the expected posterior that
an observer with a neutral prior would place on a speaker’s true party after hearing a given number of
Partisanship by topics
Partisanship if we only used words
Figure from one
7: Partisanship (manually defined) topic
by Topic
alcohol budget business
0.50 0.54 0.58
0.50 0.54 0.58
0.50 0.54 0.58

partisanship
Avg.
1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010
Freq.
0.001 0.015 0.025

0.000 0.000 0.000
crime defense economy

0.50 0.54 0.58
0.50 0.54 0.58
0.50 0.54 0.58

1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010
0.005 0.028 0.009
0.000 0.000 0.000
education elections environment

0.50 0.54 0.58
0.50 0.54 0.58
0.50 0.54 0.58

1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010
0.011 0.004 0.004
0.000 0.000 0.000
federalism foreign government

0.50 0.54 0.58
0.50 0.54 0.58
0.50 0.54 0.58

1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010
0.020 0.009 0.019
0.000 0.000 0.000
health immigration justice

0.50 0.54 0.58
0.50 0.54 0.58
0.50 0.54 0.58
1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010 1870 1890 1910 1930 1950 1970 1990 2010
Residential segregation
Goal: relationship between where an individual buys a property and

• Party affiliation (from American National Election Studies 2015 &
Pew Research Center 2016)
• Political contributions (from Federal Election Commission 2015)
Same model as before
• yitj if individual i buys a property at location j at time t (mit = 1
properties per individual)
• zi : party affiliation / party donated to
• No covariates xit
Panel A: By Party Identification
MLE By party affiliation
real random 0.62
0.62
0.60
0.60
0.58
0.58
0.56
0.56
Residential Segregation of Voters 0.54

0.54
0.52 0.52
l A: By Party Identification 0.50 0.50
1955 1960 1965 Preferred Specification

1970 1975 1980 1985 1990 1995 2000 2005 2010
MLE1955 1960 1965 197
0.62 real random
0.60
Panel B: By Party Campaign Contributio
MLE
Average partisanshipAverage partisanship
0.58
0.70 real random 0.70
0.56
0.54
0.65 0.65
0.52
0.60 0.60
0.50
00 2005 2010
0.55
1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
LASSO
0.55
0.62 real random
0.60
Panel B: By Party Campaign Contributio
MLE By donation
Average partisanship Average partisanship
0.58
0.70 real random 0.70

0.56
0.54
0.65 0.65
0.52
0.60 0.60
0.50
00 2005 2010 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
0.55 0.55
y Party Campaign Contributions 0.50 0.50
Preferred Specification 1980 1985 1990 1995 2000 2005 2010 2015
MLE1980 1985 19
0.70 real random
Notes: Plots in Panel A show the average residential partisanship series u

0.65
fication data with j indexing counties. Plots in Panel B show the average
party contribution data with j indexing zipcodes. For each panel, the plot
0.60
maximum likelihood estimator defined in 4.1. The plot on the right sho
penalized estimator without covariates and with settings y = 10 6 and
0.55
actual data; “random” series is from hypothetical data in which each respo
0.50
2010 2015
signed a party with the probability that the respondent/contributor
1980
LASSO is Rep
1985 1990 1995 2000 2005 2010 2015
of respondents/contributors who are Republican in that year.

Extensions Beyond Linear Regression: Topics in Data Science

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Extensions Beyond Linear Regression: Topics in Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Extensions Beyond Linear Regression: Topics in Data Science

Uploaded by

Copyright:

Available Formats

Extensions beyond linear regression

Topics in Data Science

David Rossell, UPF

• Section 1. Basu & Michailidis (LASSO theory for time series, at

3 Generalized linear models

4 Beyond standard GLMs

5 Application to polarization and segregation

Example: AR1 model

Defining y0 = 0, algebra shows

• σt,t−j = η j σt−j,t−j ⇒ cor(yt , yt−j ) ≈ η j

min(y − X β)0 Σ−1 (y − X β) + h(β, Σ)

Closed form given Σ, but we also need to integrate over Σ

Let D be (T − r ) × r matrix with lagged predictors dtj = yt−j

log p(yr +1:T | y1:r , β, φ, η) = log p(y | β, φ, η) − log p(y1:r | β, φ, η)

yt = x0t β + αt−1 + t , t ∼ N(0, φ) iid

Since t−1 = yt−1 − x0t−1 β − αt−2 , this implies

yt = x0t β − x0t−1 αβ + αyt−1 − α2 t−2 + t = . . . =

Regresses on lagged x’s and y’s imposing restrictions on the parameters

where ηy , ηx now unrestricted, Dy , Dx contain lagged y, X

Outcome: quarterly labour productivity in 1980-2000

Let’s run BVS with Zellner + BetaBin(1,1) priors

Top models Variable P(γj = 1 | y)

yt = A1 yt−1 + . . . + Ar yt−r + B0 xt + . . . + Bm xt−m + t

This implies y0 = (y1 , . . . , yT ) ∼ Normal

Let yt be endogenous, xt exogenous, t ∼ N (0, φI ). Model

A0 yt = A1 yt−1 + . . . + Ar yt−r + B0 xt + . . . + Bm xt−m + t

where Ãj = A−1 −1

• Ã1 , . . . , Ãr , B̃1 , . . . , B̃m and Σ are identifiable

log p(y | A0 , A1 , . . . , Ar , B0 , . . . , Bm ) + h(A0 )

Figure 1.1. Output of the search algorithm.

Option 2. Changepoints in βt : consider t1 < . . . < tK , reparameterize

• If t ∈ [t1 , t2 ) then yt = x0t θ1 + t

3 Generalized linear models

4 Beyond standard GLMs

5 Application to polarization and segregation

• X are p observed predictors, B regression coef

Complete with p(Z | B, M, Σ), add likelihood penalties / priors as usual

n = 48 job applicants, p = 15 characteristics (10-point scale). No X ’s

Outcome: Euro area GDP growth rate at quarter t (2005-2016)

yt = β0 + β1 yt−1 + βg0 xt,g + βs xt,s + βh xt,h + t

• xt,g : weekly Google trends data (search in broad categories)

Strategy 2: let zt be PCA from xt,g

yt = β0 + β1 yt−1 + βg0 zt + βs xt,s + βh xt,h + t

Note: zt given by eigendecomposition of X 0 X

Strategy 3. Let zt be Sparse PCA

Note: zt given by iterative LARS-LASSO algorithm

Loadings for the 1st factor

3 Generalized linear models

4 Beyond standard GLMs

5 Application to polarization and segregation

GLMs: common framework to deal with these cases

p(yi | xi , β, φ) = p(yi | x0i β, φ)

• g : known invertible link function

Logistic regression: p(yi | xi , β) = Bern(yi ; µi ), µi = P(yi = 1 | x0i β)

Poisson regression: p(yi | xi , β) = Poisson(yi ; µi ), µi = E (yi | xi )

Can lack flexibility to capture variance, since Var(yi | xi , β) = µi

Negative Binomial regression: p(yi | xi , β) = NegBin(yi ; µi , φ),

Multinomial regression, ordinal regression...

yt = x0t β + αt−1 + t , t ∼ N(0, φ) iid

Since t−1 = yt−1 − x0t−1 β − αt−2 , this implies

yt = x0t β − x0t−1 αβ + αyt−1 − α2 t−2 + t = . . . =

yt = A1 yt−1 + . . . + Ar yt−r + B0 xt + . . . + Bm xt−m + t

Let yt be endogenous, xt exogenous, t ∼ N (0, φI ). Model

A0 yt = A1 yt−1 + . . . + Ar yt−r + B0 xt + . . . + Bm xt−m + t

• If t ∈ [t1 , t2 ) then yt = x0t θ1 + t

yt = β0 + β1 yt−1 + βg0 xt,g + βs xt,s + βh xt,h + t

yt = β0 + β1 yt−1 + βg0 zt + βs xt,s + βh xt,h + t