Survey Data Analysis in Stata: Jeff Pitblado
Survey Data Analysis in Stata: Jeff Pitblado
Survey Data Analysis in Stata: Jeff Pitblado
Jeff Pitblado
Associate Director, Statistical Software
StataCorp LP
Stata Conference DC 2009
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44
Outline
1
Types of data
2
Survey data characteristics
3
Variance estimation
4
Estimation for subpopulations
5
Summary
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 2 / 44
Why survey data?
Collecting data can be expensive and time consuming.
Consider how you would collect the following data:
Smoking habits of teenagers
Birth weights for expectant mothers with high blood pressure
Using stages of clustered sampling can help cut down on the
expense and time.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 3 / 44
Types of data
Simple random sample (SRS) data
Observations are "independently" sampled from a data generating
process.
Typical assumption: independent and identically distributed (iid)
Make inferences about the data generating process
Sample variability is explained by the statistical model attributed to
the data generating process
Standard data
Well use this term to distinguish this data from survey data.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 4 / 44
Types of data
Correlated data
Individuals are assumed not independent.
Cause:
Observations are taken over time
Random effects assumptions
Cluster sampling
Treatment:
Time-series models
Longitudinal/panel data models
cluster() option
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 5 / 44
Types of data
Survey data
Individuals are sampled from a xed population according to a survey
design.
Distinguishing characteristics:
Complex nature under which individuals are sampled
Make inferences about the xed population
Sample variability is attributed to the survey design
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 6 / 44
Types of data
Standard data
Estimation commands for standard data:
proportion
regress
Well refer to these as standard estimation commands.
Survey data
Survey estimation commands are governed by the svy prex.
svy: proportion
svy: regress
svy requires that the data is svyset.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 7 / 44
Survey data characteristics
Single-stage syntax
svyset
_
psu
_
weight
_
,strata(varname)fpc(varname)
_
|| ssu
_
, strata(varname) fpc(varname)
_
|| ssu
_
, strata(varname) fpc(varname)
...
Stages are delimited by ||
SSU secondary/subsequent sampling units
FPC is required at stage s for stage s +1 to play a role in the
linearized variance estimator
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 19 / 44
Survey data characteristics
Poststratication
A method for adjusting sampling weights, usually to account for
underrepresented groups in the population.
Adjusts weights to sum to the poststratum sizes in the population
Reduces bias due to nonresponse and underrepresented groups
Can result in smaller variance estimates
Syntax
svyset ... poststrata(varname) postweight(varname)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 20 / 44
Survey data characteristics
Example: svyset for poststratication
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 21 / 44
Strata with a single sampling unit
Big problem for variance estimation
Consider a sample with only 1 observation
svy reports missing standard error estimates by default
Finding these lonely sampling units
Use svydes:
Describes the strata and sampling units
Helps nd strata with a single sampling unit
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 22 / 44
Strata with a single sampling unit
Example: svydes
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 23 / 44
Strata with a single sampling unit
Handling lonely sampling units
1
Drop them from the estimation sample.
2
svyset one of the ad-hoc adjustments in the singleunit()
option.
3
Somehow combine them with other strata.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 24 / 44
Certainty units
Sampling units that are guaranteed to be chosen by the design.
Certainty units are handled by treating each one as its own
stratum with an FPC of 1.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 25 / 44
Variance estimation
Stata has three variance estimation methods for survey data:
Linearization
Balanced repeated replication
The jackknife
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 26 / 44
Variance estimation
Linearization
A method for deriving a variance estimator using a rst order Taylor
approximation of the point estimator of interest.
Foundation: Variance of the total estimator
Syntax
svyset ...
_
vce(linearized)
Delta method
Huber/White/robust/sandwich estimator
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 27 / 44
Variance estimation
Total estimator Stratied two-stage design
y
hijk
observed value from a sampled individual
Strata: h = 1, . . . , L
PSU: i = 1, . . . , n
h
SSU: j = 1, . . . , m
hi
Individual: k = 1, . . . , m
hij
Y =
w
hijk
y
hijk
V(
Y) =
h
(1 f
h
)
n
h
n
h
1
i
(y
hi
y
h
)
2
+
h
f
h
i
(1 f
hi
)
m
hi
m
hi
1
j
(y
hij
y
hi
)
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44
Variance estimation
Total estimator Stratied two-stage design
y
hijk
observed value from a sampled individual
Strata: h = 1, . . . , L
PSU: i = 1, . . . , n
h
SSU: j = 1, . . . , m
hi
Individual: k = 1, . . . , m
hij
Y =
w
hijk
y
hijk
V(
Y) =
h
(1 f
h
)
n
h
n
h
1
i
(y
hi
y
h
)
2
+
h
f
h
i
(1 f
hi
)
m
hi
m
hi
1
j
(y
hij
y
hi
)
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44
Variance estimation
Total estimator Stratied two-stage design
y
hijk
observed value from a sampled individual
Strata: h = 1, . . . , L
PSU: i = 1, . . . , n
h
SSU: j = 1, . . . , m
hi
Individual: k = 1, . . . , m
hij
Y =
w
hijk
y
hijk
V(
Y) =
h
(1 f
h
)
n
h
n
h
1
i
(y
hi
y
h
)
2
+
h
f
h
i
(1 f
hi
)
m
hi
m
hi
1
j
(y
hij
y
hi
)
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44
Variance estimation
Example: svy: total
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 29 / 44
Variance estimation
Linearized variance for regression models
Model is t using estimating equations.
).
G() =
j
w
j
s
j
x
j
= 0
V(
) = D
V{
G()}|
=
b
).
G() =
j
w
j
s
j
x
j
= 0
V(
) = D
V{
G()}|
=
b
point estimates
(i )
i th replicate of the point estimates
(.)
average of the replicates
Default variance formula:
V(
) =
1
r
r
i =1
{
(i )
(.)
}{
(i )
(.)
}
V(
) =
1
r
r
i =1
{
(i )
}{
(i )
}
(h,i )
replicate of the point estimates from stratum h, PSU i
h
average of the replicates from stratum h
m
h
= (n
h
1)/n
h
delete-1 multiplier for stratum h
Default variance formula:
V(
) =
L
h=1
(1 f
h
) m
h
n
h
i =1
{
(h,i )
h
}{
(h,i )
h
}
V(
) =
L
h=1
(1 f
h
) m
h
n
h
i =1
{
(h,i )
}{
(h,i )
}
)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 38 / 44
Estimation for subpopulations
Focus on a subset of the population
Subpopulation variance estimation:
Assumes the same survey design for subsequent data collection.
The subpop() option.
Restricted-sample variance estimation:
Assumes the identied subset for subsequent data collection.
Ignores the fact that the sample size is a random quantity.
The if and in restrictions.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 39 / 44
Estimation for subpopulations
Total from SRS data
Data is y
1
, . . . , y
n
and S is the subset of observations.
j
(S) =
_
1, if j S
0, otherwise
Subpopulation (or restricted-sample) total:
Y
S
=
n
j =1
j
(S)w
j
y
j
Sampling weight and subpopulation size:
w
j
=
N
n
, N
S
=
n
j =1
j
(S)w
j
=
N
n
n
S
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 40 / 44
Estimation for subpopulations
Variance of a subpopulation total
Sample n without replacement from a population comprised of the N
S
subpopulation values with N N
S
additional zeroes.
V(
Y
S
) =
_
1
n
N
_
n
n 1
n
j =1
_
j
(S)y
j
1
n
Y
S
_
2
Variance of a restricted-sample total
Sample n
S
without replacement from the subpopulation of N
S
values.
V(
Y
S
) =
_
1
n
S
N
S
_
n
S
n
S
1
n
j =1
j
(S)
_
y
j
1
n
S
Y
S
_
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 41 / 44
Estimation for subpopulations
Example: svy, subpop()
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 42 / 44
Summary
1
Use svyset to specify the survey design for your data.
2
Use svydes to nd strata with a single PSU.
3
Choose your variance estimation method; you can svyset it.
4
Use the svy prex with estimation commands.
5
Use subpop() instead of if and in.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 43 / 44
References
Levy, P. and S. Lemeshow. 1999.
Sampling of Populations. 3rd ed.
New York: Wiley.
StataCorp. 2009.
Survey Data Reference Manual: Release 11.
College Station, TX: StataCorp LP.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 44 / 44