Survey Data Analysis in Stata: Jeff Pitblado

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Survey Data Analysis in Stata

Jeff Pitblado
Associate Director, Statistical Software
StataCorp LP
Stata Conference DC 2009
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 1 / 44
Outline
1
Types of data
2
Survey data characteristics
3
Variance estimation
4
Estimation for subpopulations
5
Summary
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 2 / 44
Why survey data?
Collecting data can be expensive and time consuming.
Consider how you would collect the following data:
Smoking habits of teenagers
Birth weights for expectant mothers with high blood pressure
Using stages of clustered sampling can help cut down on the
expense and time.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 3 / 44
Types of data
Simple random sample (SRS) data
Observations are "independently" sampled from a data generating
process.
Typical assumption: independent and identically distributed (iid)
Make inferences about the data generating process
Sample variability is explained by the statistical model attributed to
the data generating process
Standard data
Well use this term to distinguish this data from survey data.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 4 / 44
Types of data
Correlated data
Individuals are assumed not independent.
Cause:
Observations are taken over time
Random effects assumptions
Cluster sampling
Treatment:
Time-series models
Longitudinal/panel data models
cluster() option
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 5 / 44
Types of data
Survey data
Individuals are sampled from a xed population according to a survey
design.
Distinguishing characteristics:
Complex nature under which individuals are sampled
Make inferences about the xed population
Sample variability is attributed to the survey design
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 6 / 44
Types of data
Standard data
Estimation commands for standard data:
proportion
regress
Well refer to these as standard estimation commands.
Survey data
Survey estimation commands are governed by the svy prex.
svy: proportion
svy: regress
svy requires that the data is svyset.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 7 / 44
Survey data characteristics
Single-stage syntax
svyset
_
psu
_
weight
_
,strata(varname)fpc(varname)

Primary sampling units (PSU)


Sampling weights pweight
Strata
Finite population correction (FPC)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 8 / 44
Survey data characteristics
Sampling unit
An individual or collection of individuals from the population that can
be selected for observation.
Sampling groups of individuals is synonymous with cluster
sampling.
Cluster sampling usually results in inated variance estimates
compared to SRS.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 9 / 44
Survey data characteristics
Sampling weight
The reciprocal of the probability for an individual to be sampled.
Probabilities are derived from the survey design.
Sampling units
Strata
Typically considered to be the number of individuals in the
population that a sampled individual represents.
Reduces bias induced by the sampling design.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 10 / 44
Survey data characteristics
Strata
In stratied designs, the population is partitioned into well-dened
groups, called strata.
Sampling units are independently sampled from within each
stratum.
Stratication usually results in smaller variance estimates
compared to SRS.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 11 / 44
Survey data characteristics
Finite population correction (FPC)
An adjustment applied to the variance due to sampling without
replacement.
Sampling without replacement from a nite population reduces
sampling variability.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 12 / 44
Survey data characteristics
Example: svyset for single-stage designs
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 13 / 44
Survey data characteristics
Population 1000
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 14 / 44
Survey data characteristics
SRS sample 200
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 15 / 44
Survey data characteristics
Cluster sample 20 (208 obs)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 16 / 44
Survey data characteristics
Stratied sample 198
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 17 / 44
Survey data characteristics
Stratied-cluster sample 20 (215 obs)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 18 / 44
Survey data characteristics
Multistage syntax
svyset psu
_
weight
_
, strata(varname) fpc(varname)

_
|| ssu
_
, strata(varname) fpc(varname)

_
|| ssu
_
, strata(varname) fpc(varname)

...
Stages are delimited by ||
SSU secondary/subsequent sampling units
FPC is required at stage s for stage s +1 to play a role in the
linearized variance estimator
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 19 / 44
Survey data characteristics
Poststratication
A method for adjusting sampling weights, usually to account for
underrepresented groups in the population.
Adjusts weights to sum to the poststratum sizes in the population
Reduces bias due to nonresponse and underrepresented groups
Can result in smaller variance estimates
Syntax
svyset ... poststrata(varname) postweight(varname)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 20 / 44
Survey data characteristics
Example: svyset for poststratication
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 21 / 44
Strata with a single sampling unit
Big problem for variance estimation
Consider a sample with only 1 observation
svy reports missing standard error estimates by default
Finding these lonely sampling units
Use svydes:
Describes the strata and sampling units
Helps nd strata with a single sampling unit
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 22 / 44
Strata with a single sampling unit
Example: svydes
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 23 / 44
Strata with a single sampling unit
Handling lonely sampling units
1
Drop them from the estimation sample.
2
svyset one of the ad-hoc adjustments in the singleunit()
option.
3
Somehow combine them with other strata.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 24 / 44
Certainty units
Sampling units that are guaranteed to be chosen by the design.
Certainty units are handled by treating each one as its own
stratum with an FPC of 1.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 25 / 44
Variance estimation
Stata has three variance estimation methods for survey data:
Linearization
Balanced repeated replication
The jackknife
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 26 / 44
Variance estimation
Linearization
A method for deriving a variance estimator using a rst order Taylor
approximation of the point estimator of interest.
Foundation: Variance of the total estimator
Syntax
svyset ...
_
vce(linearized)

Delta method
Huber/White/robust/sandwich estimator
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 27 / 44
Variance estimation
Total estimator Stratied two-stage design
y
hijk
observed value from a sampled individual
Strata: h = 1, . . . , L
PSU: i = 1, . . . , n
h
SSU: j = 1, . . . , m
hi
Individual: k = 1, . . . , m
hij

Y =

w
hijk
y
hijk

V(

Y) =

h
(1 f
h
)
n
h
n
h
1

i
(y
hi
y
h
)
2
+

h
f
h

i
(1 f
hi
)
m
hi
m
hi
1

j
(y
hij
y
hi
)
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44
Variance estimation
Total estimator Stratied two-stage design
y
hijk
observed value from a sampled individual
Strata: h = 1, . . . , L
PSU: i = 1, . . . , n
h
SSU: j = 1, . . . , m
hi
Individual: k = 1, . . . , m
hij

Y =

w
hijk
y
hijk

V(

Y) =

h
(1 f
h
)
n
h
n
h
1

i
(y
hi
y
h
)
2
+

h
f
h

i
(1 f
hi
)
m
hi
m
hi
1

j
(y
hij
y
hi
)
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44
Variance estimation
Total estimator Stratied two-stage design
y
hijk
observed value from a sampled individual
Strata: h = 1, . . . , L
PSU: i = 1, . . . , n
h
SSU: j = 1, . . . , m
hi
Individual: k = 1, . . . , m
hij

Y =

w
hijk
y
hijk

V(

Y) =

h
(1 f
h
)
n
h
n
h
1

i
(y
hi
y
h
)
2
+

h
f
h

i
(1 f
hi
)
m
hi
m
hi
1

j
(y
hij
y
hi
)
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 28 / 44
Variance estimation
Example: svy: total
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 29 / 44
Variance estimation
Linearized variance for regression models
Model is t using estimating equations.

G() is a total estimator, use Taylor expansion to get



V(

).

G() =

j
w
j
s
j
x
j
= 0

V(

) = D

V{

G()}|
=
b

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 30 / 44


Variance estimation
Linearized variance for regression models
Model is t using estimating equations.

G() is a total estimator, use Taylor expansion to get



V(

).

G() =

j
w
j
s
j
x
j
= 0

V(

) = D

V{

G()}|
=
b

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 30 / 44


Variance estimation
Example: svy: logit
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 31 / 44
Variance estimation
Balanced repeated replication
For designs with two PSUs in each of L strata.
Compute replicates by dropping a PSU from each stratum.
Find a balanced subset of the 2
L
replicates. L r < L +4
The replicates are used to estimate the variance.
Syntax
svyset ... vce(brr)
_
mse

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 32 / 44


Variance estimation
BRR variance formulas

point estimates

(i )
i th replicate of the point estimates

(.)
average of the replicates
Default variance formula:

V(

) =
1
r
r

i =1
{

(i )

(.)
}{

(i )

(.)
}

Mean squared error (MSE) formula:

V(

) =
1
r
r

i =1
{

(i )


}{

(i )


}

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 33 / 44


Variance estimation
Example: svy brr: logit
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 34 / 44
Variance estimation
The jackknife
A replication method for variance estimation. Not restricted to a
specic survey design.
Delete-1 jackknife: drop 1 PSU
Delete-k jackknife: drop k PSUs within a stratum
Syntax
svyset ... vce(jackknife)
_
mse

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 35 / 44


Variance estimation
Jackknife variance formulas

(h,i )
replicate of the point estimates from stratum h, PSU i

h
average of the replicates from stratum h
m
h
= (n
h
1)/n
h
delete-1 multiplier for stratum h
Default variance formula:

V(

) =
L

h=1
(1 f
h
) m
h
n
h

i =1
{

(h,i )

h
}{

(h,i )

h
}

Mean squared error (MSE) formula:

V(

) =
L

h=1
(1 f
h
) m
h
n
h

i =1
{

(h,i )


}{

(h,i )


}

J. Pitblado (StataCorp) Survey Data Analysis DC 2009 36 / 44


Variance estimation
Example: svy jackknife: logit
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 37 / 44
Variance estimation
Replicate weight variable
A variable in the dataset that contains sampling weight values that
were adjusted for resampling the data using BRR or the jackknife.
Typically used to protect the privacy of the survey participants.
Eliminate the need to svyset the strata and PSU variables.
Syntax
svyset ... brrweight(varlist)
svyset ... jkrweight(varlist
_
, ... multiplier(#)

)
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 38 / 44
Estimation for subpopulations
Focus on a subset of the population
Subpopulation variance estimation:
Assumes the same survey design for subsequent data collection.
The subpop() option.
Restricted-sample variance estimation:
Assumes the identied subset for subsequent data collection.
Ignores the fact that the sample size is a random quantity.
The if and in restrictions.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 39 / 44
Estimation for subpopulations
Total from SRS data
Data is y
1
, . . . , y
n
and S is the subset of observations.

j
(S) =
_
1, if j S
0, otherwise
Subpopulation (or restricted-sample) total:

Y
S
=
n

j =1

j
(S)w
j
y
j
Sampling weight and subpopulation size:
w
j
=
N
n
, N
S
=
n

j =1

j
(S)w
j
=
N
n
n
S
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 40 / 44
Estimation for subpopulations
Variance of a subpopulation total
Sample n without replacement from a population comprised of the N
S
subpopulation values with N N
S
additional zeroes.

V(

Y
S
) =
_
1
n
N
_
n
n 1
n

j =1
_

j
(S)y
j

1
n

Y
S
_
2
Variance of a restricted-sample total
Sample n
S
without replacement from the subpopulation of N
S
values.

V(

Y
S
) =
_
1
n
S

N
S
_
n
S
n
S
1
n

j =1

j
(S)
_
y
j

1
n
S

Y
S
_
2
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 41 / 44
Estimation for subpopulations
Example: svy, subpop()
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 42 / 44
Summary
1
Use svyset to specify the survey design for your data.
2
Use svydes to nd strata with a single PSU.
3
Choose your variance estimation method; you can svyset it.
4
Use the svy prex with estimation commands.
5
Use subpop() instead of if and in.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 43 / 44
References
Levy, P. and S. Lemeshow. 1999.
Sampling of Populations. 3rd ed.
New York: Wiley.
StataCorp. 2009.
Survey Data Reference Manual: Release 11.
College Station, TX: StataCorp LP.
J. Pitblado (StataCorp) Survey Data Analysis DC 2009 44 / 44