research methodology series

An Introduction
to Secondary Data Analysis
Natalie Koziol, MA
CYFS Statistics and Measurement Consultant

Ann Arthur, MS
CYFS Statistics and Measurement Consultant
Overview of Secondary Data Analysis
Understanding & Preparing Secondary Data
Brief Overview of Sampling Design
Analyzing Secondary Data
Illustration of Secondary Data Analysis
Other Logistical Considerations
Overview of
Secondary Data Analysis
What is Secondary Data Analysis?
In the broadest sense, analysis of data collected by someone else
(p. ix; Boslaugh, 2007)
Analysis of secondary data, where secondary data can include any
data that are examined to answer a research question other than the
question(s) for which the data were initially collected
(p. 3; Vartanian, 2010)
In contrast to primary data analysis in which the same individual/team
of researchers designs, collects, and analyzes the data
Local Examples of Research Involving
Secondary Data Analysis
Starting Off Right: Effects of Rurality on Parents Involvement in
Childrens Early Learning (Sue Sheridan, PPO)
Data from the Early Childhood Longitudinal Study Birth Cohort
(ECLS-B) were used to examine the influence of setting on parental
involvement in preschool and the effects of involvement on
Kindergarten school readiness.
Testing Thresholds of Quality Care on Child Outcomes Globally & in
Subgroups: Secondary Analysis of QUINCE and Early Head Start Data
(Helen Raikes, PPO)
Data from two secondary datasets were used to examine the
potentially non-linear relationship between quality of child care and
childrens development
What are Secondary Data?
Come from many sources
Large government-funded datasets (the focus of this presentation)
University/college records
Statewide or district-level K-12 school records
Journal supplements
Authors websites
Available for a seemingly unlimited number of subject areas
Quantitative (the focus of this presentation) and qualitative
Restricted and public-use
Direct (e.g., biomarker data) and indirect observation (e.g., self-report)
Where Can I Find Secondary Data?
Searching for secondary datasets:
Inter-University Consortium for Political and Social Research
National Center for Education Statistics
U.S. Census Bureau
Simple Online Data Archive for Population Studies (SodaPop)
Examples of Large Secondary Datasets for
Education & Social Sciences Research
Common Core of Data (CCD)
Current Population Survey (CPS)
Early Childhood Longitudinal Study (ECLS): Birth (ECLS-B) and Kindergarten (ECLS-K) Cohort
General Social Survey (GSS)
Head Start Family and Child Experiences Survey (FACES)
Monitoring the Future (MTF)
National Assessment of Educational Progress (NAEP)
National Education Longitudinal Study (NELS)
National Household Education Surveys (NHES)
National Longitudinal Study of Adolescent Health (Add Health)
National Longitudinal Survey of Youth (NLSY)
National Survey of American Families (NSAF)
National Survey of Child and Adolescent Well-Being (NSCAW)
National Survey of Families and Households (NSFH)
NICHD Study of Early Child Care and Youth Development (SECCYD)
Programme for International Student Assessment (PISA)
Progress in International Reading Literacy Study (PIRLS)
Trends in International Mathematics and Science Study (TIMSS)
U.S. Panel Study of Income Dynamics (PSID): Child Development Supplement (CDS)
Advantages of Secondary Data Analysis
Study design and data collection already completed
Saves time and money
Access to international and cross-historical data that would
otherwise take several years and millions of dollars to collect
Ideal for use in classroom examples, semester projects, masters
theses, dissertations, supplemental studies
Data may be of higher quality
Studies funded by the government generally involve larger samples
that are more representative of the target population (greater
external validity!)
Oversampling of low prevalence groups/behaviors allows for
increased statistical precision
Datasets often contain considerable breadth (thousands of variables)
Disadvantages of Secondary Data Analysis
Study design and data collection already completed
Data may not facilitate particular research question
Information regarding study design and data collection procedures
may be scarce
Data may potentially lack depth (the greater the breadth the harder it is
to measure any one construct in depth)
Constructs may be operationally defined by a single survey item or
a subset of test items which can lead to reliability and validity
Post hoc attempts to construct measurement models may be
unsuccessful (survey items may not hang together)
Certain fields or departments (e.g., experimental programs) may place
less value on secondary data analysis
May require knowledge of survey statistics/methods which is not
generally provided by basic graduate statistics courses
Understanding & Preparing
Secondary Data
Understanding Secondary Data
Familiarize yourself with the original study and data!
Read all Users/Technical manuals
To whom are the results generalizable?
E.g., ECLS-B analyses involving data from kindergarten wave
can be used to make inferences about children born in the U.S.
in 2001 as they enter kindergarten (not to make inferences
about U.S. kindergarteners)
How are missing data handled?
What are the appropriate analysis weights?
What is the appropriate method (and what variables are necessary)
for computing adjusted standard errors?
What composite variables are available and how are they
Understanding Secondary Data
Familiarize yourself with the original study and data!
Examine questionnaires and interview protocols when available
Identify skip patterns to determine coding of missing data;
example from the ECLS-B preschool parent interview:
Understanding Secondary Data
Familiarize yourself with the original study and data!
Examine questionnaires and interview protocols when available
For examining trends or growth, determine whether the same
construct is being measured across time
Interview questions may be modified across time
Example from an Opinion Research Business (ORB) survey
on conflict deaths in Iraq (Spagat & Dougherty, 2010):
Yes/No: There has been a murder of a member of my
family/relative (February 2007)
Yes/No: There has been a death as a result of
conflict/violence of a household member (August 2007)
Respondents (e.g., parent/guardian) may change over time
Different scales may be used across time (e.g., different cognitive
measures are used for infants and kindergarteners)
Understanding Secondary Data
Familiarize yourself with the original study and data!
Check study website frequently for errors and/or updates
Example from

Ongoing panel (i.e. longitudinal) studies generally provide new

datasets after each wave of data collection
Always use the most up-to-date file! Scores developed using
item response theory may be recalibrated at each wave to
permit investigation of growth
Preparing Secondary Data
Document everything!
Save all syntax
Create an abridged codebook describing the original and recoded
variables of interest
Step 1: Transfer all potential data of interest to a new file in preferred
base program
Electronic codebooks (ECBs) greatly facilitate this process
Never alter the original datafile!
Step 2: Address missing data
Identify/label missing values in software program
When possible, use knowledge of skip patterns to recode missing
data as meaningful values
Select method for handling missing data (e.g., multiple imputation,
full-information maximum likelihood [FIML])
Preparing Secondary Data
Step 3: Recode variables
Reverse code negatively worded items if creating scale scores
Dummy code dichotomous variables into values of 0, 1 (original
dataset may use values of 1, 2)
Recode other categorical variables (e.g., dummy or effect coding)
Combine separate but like variables
E.g., ECLS-B contained 2 kindergarten waves (only 75% of
children were in kindergarten in 2006); to analyze
kindergarteners, need to combine variables from waves 4 and 5
using if-else commands
Recode variables so that all responses are based on the same units
Example from ECLS-B Preschool Center Director Questionnaire:
Preparing Secondary Data
Step 4: Create new variables
May need to recreate composite variables if disagree with original
E.g., An SES variable in the original datafile may be constructed
from income and parent education variables; secondary
researcher may want to construct new SES variable
Psychometric work
Create scores from individual items using factor analysis or item
response theory
Unfortunately, individual survey items do not always hang
To avoid potentially biased variance estimates,
(a) incorporate measurement models directly into analysis, or
(b) output plausible values (e.g., Mislevy et al., 1992)
A Brief Overview of
Sampling Design
Sampling Design
Ideally, we want a sample that is perfectly representative of our target
population (we want to use sample results to make inferences, or
generalizations, about a larger population)
Types of probability sampling
Simple random sampling
Randomly sample individuals
Stratified sampling
Divide population into strata (groups); within each stratum,
randomly sample individuals
Cluster sampling
Population contains naturally occurring groups (e.g.,
classrooms); randomly sample groups
Sampling Design

Simple Random Sampling Cluster Sampling

Class 1 Class 2 Class 3

Class 4 Class 5 Class 6

Class 7 Class 8 Class 9

Stratified Sampling

Grade 1

Grade 2

Grade 3
Sampling Design
Simple random sampling
Assumed when performing conventional statistical analyses
No guarantee of a representative sample
May not be feasible (e.g., costly, impractical)
Stratified sampling
More control over representativeness
Allows for intentional oversampling which permits greater statistical
precision (i.e., decreases standard errors)
Cluster sampling
May be necessary (e.g., educational interventions may only be
possible at the classroom level)
Decreases statistical precision (individuals within groups tend to be
more similar so we have less unique information)
Sampling Design
Statistical analyses should reflect sampling design
Point estimates (e.g., means) should be adjusted to take into account
unequal sampling probabilities
Standard errors should be adjusted to ensure correct level of
confidence in point estimates
Different statistical approaches exist for handling complex sampling
Multilevel modeling
Application of weights and alternative methods of variance estimation
Common approach when analyzing large secondary datasets due
to complexity of sampling design
Combination of approaches
Sampling Design
Sampling weights
The reciprocal of the inclusion probabilitythe number of
population units represented by unit i (p. 39; Lohr, 2010)
= where is the probability that unit i is in the sample

Necessary for obtaining accurate/generalizable point estimates
Construction of sampling weights is complex (based on multiple
stages of sampling, non-response, post-stratification, etc.)
Thankfully, large secondary datasets generally have pre-
constructed weights
However, multiple weights may exist for any one dataset
Appropriate selection and application of weights is the
responsibility of the secondary data analyst!
Sampling Design
Variance estimation
Alternative estimation necessary for computing correct standard
errors which influence tests of statistical significance
Does not influence point estimates
Multiple approaches
Taylor series linearization method
Involves specifying cluster and stratum variables
Replication methods
Balanced repeated replication (BRR)
Jackknife replication (JK1, JK2, JKn)
Choice of method depends on sampling design
Involves specifying series of replicate weights
Other methods (e.g., use of generalized variance functions)
Use the approach recommended in the Users manual
Analyzing Secondary Data
Analyzing Secondary Data
Based on research question, identify appropriate statistical analysis
Select software package that will implement analysis and account for
complex sampling
Examine unweighted descriptive statistics to identify coding errors and
determine adequacy of sample size
Identify weights
Make sure missing weights are set to 0
Identify variance estimation method (and corresponding variables)
Conduct diagnostic analyses (identify outliers, non-normality, etc.)
Conduct primary analysis and interpret results!
Analyzing Secondary Data
Other considerations
Inclusion of covariates
E.g., Age at time of child assessment (not always possible to
collect data at target age)
Analysis of subpopulations (see Lohr, 2010 for more information)
Additional specifications necessary (e.g., domain, subpop)
Dont delete cases
Protecting confidentiality
Some restricted-use datasets require unweighted sample sizes to
be rounded and/or estimates based on small sample sizes to be
Analysis involving multiple imputation or plausible values
(Asparouhov & Muthn, 2010; Enders, 2010)
Analyzing Secondary Data
Software specifically developed for analyzing complex survey data
Generally free
Generally user-friendly but may lack flexibility (limited to certain
datasets, limited statistical analyses)
Useful for initial data exploration (particularly restricted data)
NCES tools for computing descriptive statistics, regressions
Data Analysis System (DAS):
AM Statistical Software
Descriptives, regression, some latent variable estimation
Relatively easy to incorporate plausible values
Analyzing Secondary Data
General-purpose software that can account for complex sampling
Can be expensive (R is free)
Generally syntax-based rather than drop-down menu
More flexible
SAS (certain analyses require SUDAAN add-on)
Analyzing Secondary Data
General-purpose software that can account for complex sampling
SPSS (requires Complex Samples add-on)
Users%20Guide%20v6.pdf (pp. 499-505, 521)
Other software options
Example SAS Syntax
*Taylor series linearization method;
PROC SURVEYMEANS data=yourdata varmethod=taylor;
strata stratavar;
cluster clustervar;
var varofinterest;
weight wtvar;

PROC SURVEYREG data=yourdata varmethod=taylor;

strata stratavar;
cluster clustervar;
model outcomevar=predictorvar;
weight wtvar;

*Jackknife method;
PROC SURVEYMEANS data=yourdata varmethod=jk;
repweights repwt1-repwtn;
var varofinterest;
weight wtvar;

PROC SURVEYREG data=yourdata varmethod=jk;

model outcomevar=predictorvar;
repweights repwt1-repwtn;
weight wtvar;

*Jackknife syntax varies by type (JK 1, 2, or n)

Example Stata Syntax
/*Taylor series linearization method*/
svyset [pweight = wtvar], psu(clustervar) strata(stratavar) vce(linearized)
svy: mean varofinterest
svy: regress outcomevar predictorvar

/*Jackknife method*/
svyset [pweight = wtvar], jkrw(repwt1 - repwtn) vce(jack) mse
svy: mean varofinterest
svy: regress outcomevar predictorvar

*Jackknife syntax varies by type (JK 1, 2, or n)

Example R Syntax
#Taylor series linearization method
design1 <- svydesign(id=~clustervar, strata=~stratavar, weights=~wtvar,
svymean(~varofinterest, design.1)
regmodel <- svyglm(outcome~predictor, design=design.1)

#Jackknife method
design.2 <- svrepdesign(repweights=yourdatafile[,repwt1:repwtn], type="JK1",
svymean(~varofinterest, design.2)
regmodel <- svyglm(outcome~predictor, design=design.2)

*Jackknife syntax varies by type (JK 1, 2, or n)

Example Mplus Syntax
!Taylor series linearization method;
TITLE: Example complex sample syntax;
DATA: FILE=yourfile;
NAMES=outcomevar predictorvar stratavar clustervar wtvar;
USEVARIABLES=outcomevar predictorvar stratavar clustervar wtvar;
outcomevar ON predictorvar;

!Jackknife method;
TITLE: Example complex sample syntax;
DATA: FILE=yourfile;
NAMES=outcomevar predictorvar wtvar repwt1-repwtn;
USEVARIABLES=outcomevar predictorvar wtvar repwt1-repwtn;
outcomevar ON predictorvar;
*Jackknife syntax varies by type (JK 1, 2, or n)
An Illustration of
Secondary Data Analysis
Early Childhood Longitudinal Study Kindergarten Class of 1998-99
Two research questions:
Does the number of childrens books in the home predict a childs
Tell Stories score as measured in the fall of kindergarten?
What is the average trajectory of math achievement as measured in
kindergarten through 8th grade?
Illustration: Download Data & Import into
Base Program
*Download data from

Change directory
to match location
of datafile
Illustration: Identify Weights

*Descriptions from ECLS-K Users Manual

Illustration: Determine Variance
Estimation Method

*Description from ECLS-K Users Manual

LIBNAME spsslib SPSS "C:\Users\nkoziol\Documents\CYFS\ECLSK_SAS_File.por";
Illustration: Research Question 1
DATA work.eclsk;
SET spsslib.spssfile;
Conduct simple linear regression in SAS
PROC SURVEYMEANS data=eclsk varmethod=jk;
LIBNAME c1pw1-c1pw90
spsslib SPSS / jkcoefs = 0.999999999;
var c1scsto p1chlboo;
DATA work.eclsk;
weight c1pw0;
SET spsslib.spssfile;
Necessary for
PROC SURVEYREG data=eclsk varmethod=jk ;
PROC SURVEYMEANS data=eclsk varmethod=jk; JK2 method
model c1scsto=p1chlboo;
repweights c1pw1-c1pw90 / jkcoefs = 0.999999999;
var c1scstoc1pw1-c1pw90
p1chlboo; / jkcoefs = 0.999999999;
weight c1pw0;

SURVEYREG data=eclsk
Number ofvarmethod=jk ;
modelvariable childrens books
repweights c1pw1-c1pw90 / jkcoefs = 0.999999999;
weight c1pw0;
Illustration: Research Question 1
Results with weighting only (no variance adjustment)

Results with no weighting and no variance adjustment

*Results are for illustration purposes only; please do not cite or distribute.
Illustration: Research Question 2
Conduct 2nd order (quadratic) latent growth model in Mplus
Illustration: Research Question 2

*Results are for illustration purposes only; please do not cite or distribute.
Illustration: Research Question 2
Other Considerations
Training Opportunities
2-5 day government- or other institution-sponsored workshops
AERA Institute on Statistical Analysis & AERA Faculty Institute
IES sponsored workshops
ICPSR 1 week offerings
1 day pre-annual meeting workshops
E.g., Early Childhood Surveys at NCES: The ECLS and NHES Data
Users Workshop (2011 SRCD biennial meeting)
Funding Opportunities
AERA Dissertation and Research Grants
NIH Grants
E.g., R40 Maternal & Child Health Research Secondary Data
Analysis Studies Grants
R21 grant mechanism (exploration)
IES Grants (exploration; Goal 1)
NAEP Secondary Analysis Grants
RFPs that encourage use of secondary data, for example, the
Social and Behavioral Context for Academic Learning RFP
AIR Grants
AIR, NCES, & NSF (2011). National Summer Data Policy Institute training materials.
Washington, D.C.
Asparouhov, T., & Muthn, B. (2010). Plausible values for latent variables using Mplus.
Technical Report.
Boslaugh, S. (2007). Secondary data sources for public health: A practical guide. New
York, NY: Cambridge.
Enders, C. K. (2010). Applied missing data analysis. New York, NY: Guilford.
Lohr, S. L. (2010). Sampling: Design and analysis (2nd Ed.). Boston, MA: Brooks/Cole.
Kiecolt, K. J., & Nathan, L. E. (1985). Secondary analysis of survey data. Newbury Park,
Kish, L. (1965). Survey sampling. New York: Wiley.
McCall, R. B., & Appelbaum, M. I. (1991). Some issues of conducting secondary analyses.
Developmental Psychology, 27, 911-917.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population
characteristics from sparse matrix samples of item responses. Journal of Educational
Measurement, 29, 133-161.
NCES (2011). ECLS-B database training seminar materials. Washington, D.C.
Spagat, M., & Dougherty, J. (2010). Conflict deaths in Iraq: A methodological critique of
the ORB survey estimate. Survey Research Methods, 4, 3-15.
Thomas, S. L., & Heck, R. H. (2001). Analysis of large-scale secondary data in higher
education research: Potential perils associated with complex sampling designs.
Research in Higher Education, 42, 517-540.
Tourangeau, K., Nord, C., L, T., Sorongon, A. G., & Najarian, M. (2009). Early Childhood
Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K), Combined Users
Manual for the ECLS-K Eighth-Grade and K-8 Full Sample Data Files and Electronic
Codebooks (NCES 2009-004). National Center for Education Statistics, Institute of
Education Sciences, U.S. Department of Education. Washington, DC.
Trzesniewski, K. H., Donnellan, M. B., & Lucas, R. E. (Eds) (2011). Secondary data
analysis: An introduction for psychologists. Washington, D.C.: APA.
Vartanian, T. P. (2011). Secondary data analysis. New York, NY: Oxford.
For more information, please contact:
Natalie Koziol, [email protected]

