Using Plasauble Values For Missing Data

Communications in Statistics - Theory and Methods
ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20
Using plausible values in secondary analysis in

large–scale assessments
Inga Laukaityte & Marie Wiberg
To cite this article: Inga Laukaityte & Marie Wiberg (2016): Using plausible values in secondary
analysis in large–scale assessments, Communications in Statistics - Theory and Methods
To link to this article: http://dx.doi.org/10.1080/03610926.2016.1267764
Accepted author version posted online: 08

Dec 2016.
Submit your article to this journal
Article views: 20
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=lsta20
Download by: [University of Newcastle, Australia] Date: 12 April 2017, At: 07:56
ACCEPTED MANUSCRIPT
USING PLAUSIBLE VALUES IN SECONDARY ANALYSIS
Using plausible values in secondary analysis in large--scale assessments
Inga Laukaityte, Marie Wiberg
Department of Statistics, USBE, Umeå University, Umeå, Sweden
Corresponding author: Inga Laukaityte, Umeå School of Business and Economics, Department
of Statistics, Umeå University, Sweden. E-mail: [email protected]
Plausible values are typically used in large--scale assessment studies, in particular in the
Trends in International Mathematics and Science Study and the Programme for
International Student Assessment. Despite its large spread there are still some questions
regarding the use of plausible values and how such use affects statistical analyses. The
aim of this paper is to demonstrate the role of plausible values in large--scale assessment
surveys when multilevel modelling is used. Different user strategies concerning plausible
values for multilevel models as well as means and variances are examined. The results
show that some commonly used user strategies give incorrect results while others give
reasonable estimates but incorrect standard errors. These findings are important for
anyone wishing to make secondary analyses of large--scale assessment data, especially
those interested in using multilevel models to analyze the data.
Keywords: Achievement, design study, multilevel modelling, simulation studies, testing.
1 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Introduction
Large--scale assessment surveys contain large numbers of items as well as limited time and
numbers of students. Due to time limitations, students receive a subset (block) of all assessment
items. For this reason, the measurement of individual proficiency is achieved with a
measurement error (von Davier, Gonzalez & Mislevy, 2009). In order to reflect the uncertainty
of the measurement, several scores or imputations, called plausible values (PVs), are presented
for each individual. PVs have been successfully used to improve inference about latent variables
in large-scale assessments, and research is ongoing to improve the practice. There are two issues
to consider: the number of imputations to use, and the consequence of misspecification of
imputation models. This paper addresses an issue at the intersection of these issues, namely the
effect of number of PVs when inferences about multilevel models (MLMs) are desired but a
MLM has not been used to construct the PVs. Although, there are now beginning to appear
publications on how one would create a MLM PV system (Mislevy, 1991; Li, 2012; Yang and
Seltzer, 2016; Kuhfeld, 2016; and Rijmen, Jeon, von Davier, and Rabe-Hesketh, 2013), the
problem addressed here is important because the methods required are more complex and not
readily accomplished by most secondary analysts, and many large-scale assessments provide
public-use data with PVs generated with a single-level rather than a multilevel model. The aim of
this paper is to analyze the role of PVs in large--scale assessment surveys when MLMs are used.
In order to reach this goal, different user strategies (different numbers) of PVs for MLMs as well
as means and variances were examined using both, simulations and real data from Trends in
International Mathematics and Science Study (TIMSS) 2011. In line with the setup of large-scale
assessments, the PVs are generated from single-level imputation models.
ACCEPTED MANUSCRIPT
Most large--scale assessment databases usually provide five PVs (only NAEP (National
Assessment of Educational Progress) database provides 20 PVs), although there are no
particularly strong reasons for choosing five values in the literature (Wu, 2005). Wu (2005)
showed in simulations that very often even one PV is sufficient to adequately recover the
population parameter when examining means and variances. However, it is important to note,
that if only one PV is used, then an analyst has no information about the component of
uncertainty in an estimate that is due to the latent nature of the proficiency variable. Having more
than one PV, improves the accuracy of the estimate and the accuracy of the error of it. The
general use of five PVs probably dates back to Rubin’s (1987) relative efficiency formula,
1/(1+F/M), where F is the fraction of missing data and M the number of imputations. This means
that with 50% missing information it would result in point estimates, which were 91% as
efficient as those based on an infinite number. Graham, Olchowski & Gilreath (2007) noted that
this might be too few if we are interested in other estimates and recommended the use of 20
imputations for 10--30% missing information when examining loss of power in hypothesis
testing. Bodner (2008) gave similar recommendations for the estimation of the null hypothesis
significance test p-values and confidence interval half-widths. It is important to have in mind that
there are different types of missingness and thus different ways to compute a proportion of
missing information. Missingness can be planned (by design) or unplanned. This paper deals
only with planned missing information design, and thus the missingness is understood in terms
of information missing about the latent variables given the observed responses, as it is defined by
Orchard & Woodbury (1972). Mislevy (2013) showed how to calculate the proportion of
missingness in classical test theory when responses are inherently unobserved. It is also
ACCEPTED MANUSCRIPT
important to mention, that all the secondary biases entailed by under-specification of the
imputation model, decline roughly in proportion to test reliability.
The framework and use of PVs was first developed for the analyses of US NAEP data in
1983--84 (Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992; Beaton & Gonzalez,
1995). The theory of PVs is based on Rubin’s (1987) work on multiple imputations. PVs are now
used in all NAEP surveys, the TIMSS and the Programme for International Student Assessment
(PISA). The role of PVs in large--scale assessments is not univocal and the use of PVs has
recently been discussed. For instance, Wu (2005) and von Davier et al. (2009) have presented
theoretical studies to illustrate the advantages of PVs over maximum likelihood estimates for
estimating a range of population statistics. Carstens & Hastedt (2010) have also shown a
practical meaning of PVs using TIMSS 2007 data. They have mainly analyzed the effect on
estimated means, standard deviations, and standard errors when PVs are used incorrectly or other
IRT estimates are employed. Nevertheless, large--scale assessment data are not limited to
estimation of population statistics. This paper differs from Wu (2005), von Davier et al. (2009)
and Carstens & Hastedt (2010) articles as the focus is on MLMs (as described in e.g. Gelman &
Hill (2006), which neither of them covered. MLMs are becoming a common way of analyzing
complex survey data as these models take a multistage sampling design into account and enable
us to study the effect of cluster level variables on the individual outcomes. Monseur & Adams
(2009) explored different types of estimators for recovering variance components and a latent
correlation with PISA data. They showed that it is important to take the hierarchical structure of
data into account when secondary analyses are performed, and when student proficiency
estimates are generated. There is also a growing interest in the fully Bayesian models (e.g.,
ACCEPTED MANUSCRIPT
Gelman et al., 2013) as Bayesian models can take into account complex clustered structure
inherent in the sampling design that standard methods for multiple imputation can fail to capture.
They can incorporate multiple levels, covariates at different levels, and multivariate latent
variable models for cognitive responses (e.g., Johnson & Jenkins, 2005; Si & Reiter, 2013). All
previous mentioned studies have shown that PVs are the most efficient way of recovering
different parameter estimates such as mean, variance, correlation, etc. However, to the best of
our knowledge, there are no studies that show how many PVs are actually needed for the most
optimal and reliable parameter estimations in such large--scale assessments.
This paper is structured as follows. The next section describes proficiency estimation,
followed by a description of the data used, and the empirical and simulation studies set up. The
fourth section presents the results from the empirical and simulation studies and the last section
contains conclusions and implications.
Proficiency estimation
Due to the large number of items (to ensure reliable measurement of achievement) and time
limitations, students receive only a subset of all assessment items in studies like TIMSS, PISA,
NAEP, etc. In TIMSS 2011, all 434 mathematics and science items in grade 8 are partitioned
into a set of 14 student booklets. Student booklets are assembled from various combinations of
item blocks containing two blocks from mathematics and two blocks from science with 12--18
items per block (Mullis, Martin, Ruddock, O'Sullivan, & Preuschoff, 2011). The students’
responses from various booklets are linked together through the items because each item appears
in two booklets. Each student completes only one booklet, resulting in a large number of
ACCEPTED MANUSCRIPT
responses that are missing by design. Students’ proficiencies must thus be estimated taking the
incomplete information into account. Several methods exist for drawing proficiencies; in this
paper, four methods were used.
Maximum likelihood estimator
Assume we have a test of n items, which are binary scored, and that we model items l  1, ,n
with an item response model. Let  represent the proficiency (ability) of a student and let Pl ( )
be the probability of answering an item correctly and Ql ( ) answering an item incorrectly. If x
is the vector of n binary scored item responses then the maximum likelihood estimator (MLE) of
 is the value that maximizes the following likelihood function
n
L(x|)=  Pl ( ) xl Ql ( )1 xl
l 1 . (1)
The obtained MLE is a point estimator, which provides the same MLE proficiency
estimate for every student with the same total score (Wu, 2005). One disadvantage with using
MLE is that it gives an asymptotic bias when item response models are used (Lord, 1983). Other
disadvantage, if used with a Rasch item response model is that an examinee with a perfect
number--correct score or number--correct score of 0 will have MLE proficiency estimates of 
and  , respectively (Yen & Fitzpatrick, 2006).
Weighted maximum likelihood estimator
The weighted maximum likelihood estimator (WLE) was introduced by Warm (1989) in order to
correct the bias of the MLE and is defined as
ACCEPTED MANUSCRIPT
J
ˆWLE  ˆMLE  ,
2I 2
where I is the test information
P ( )  P ( )P ( ) ,
n '2 n ' ''
I J
l l l l l
, and
Pl ( )Ql ( ) Pl ( )Ql ( )
and J the first derivative of the likelihood function in Equation (1) with respect to  . The WLE
is also a point estimator and provides a proficiency estimate for each total score. It is further
asymptotically normally distributed with variance equal to the variance of MLE.
Expected a--posteriori estimator
The Bayesian expected a--posteriori estimator (EAP) (Bock & Aitken, 1981) is the mean of the
posterior distribution for each student. Let g ( ) be a chosen prior distribution of the proficiency
 and then we can define the EAP as
 n
  g( ) PX d il
ˆEAP ( x)   l 1
n
,
 g( ) P
 l 1
X il d
where PXil is the probability of student i scoring X il on item l conditional on the proficiency 
(Uebersax, 2002). Given a Rasch item response theory model, EAP also provides one estimate
for each total score as previously mentioned likelihood estimators. A disadvantage with EAP is
that in a given population the variability of the EAP estimates of proficiency is typically less than
the variability of  because of shrinkage toward the mean for the EAP estimates (Kolen, 2006).
ACCEPTED MANUSCRIPT
The EAP for a student can also be seen as the expected value of the PVs for that student,
assuming the EAP is computed with a prior distribution containing the same information and
model as in the PV imputation model. If background information is not used in estimation of
EAPs, they will suffer shrinkage toward the population mean.
Plausible values
To estimate students’ proficiency TIMSS uses the multiple imputation method, i.e. the PVs
approach (Rubin, 1987; Mislevy, 1991). The PVs are generated using students’ responses to the
items as well as by conditioning on all available background data. By conditioning on all
background data, relationships between these background variables and the estimated
proficiencies are appropriately accounted for in the PVs (Mislevy et al., 1992).
Denote the responses of student i to background questions by yi , and student i’s item
responses by xi . Then 5 PVs in TIMSS for each student i are drawn from the conditional
distribution (Martin & Mullis, 2012)
P i | xi , yi , ,   P  xi | i , yi , ,  P i | yi , ,   P  xi | i  P i | yi , , 

, (2)
where P  xi | i  is any chosen item response model, P i | yi , ,  is the regression of the
background variables,  is a matrix of regression coefficients for the background variables, and
 is a common variance matrix of residuals. The result of the PVs approach is a set of drawn
ˆ m , m  1,
values, i.e. PVs which we denote by D ,M, where M  1 is the number of PVs drawn.
ˆ m , and the final estimate is obtained by averaging all M

Thus, analysis must be done for every D
estimates:
ACCEPTED MANUSCRIPT
 Dˆ m
D m
.
M
The variance of D is estimated by summing two components: the within--imputation variance
ˆm , V  1
obtained by averaging estimated variances Vm of PVs D
M
V
m m
, and the between--
imputation variance, BM :
 1
Var  D  V  1   BM ,
 M
1

 Dˆ  D 
2
where BM  (Mislevy, 1991; Schafer, 1997). It is important to note, that the
M 1 m m
PVs approach would give consistent estimates of population parameters if the PVs were
generated using the imputation model that is compatible with the subsequent data analysis.
Although the design of most large-scale assessments is of hierarchical nature, the population
model underlying the generation of PVs is a single-level model that does not identify the
hierarchical structure of the data (Monseur & Adamas, 2009).
Conditioning variables
To estimate the characteristics of student populations and subpopulations (or subgroups, e.g.
gender, ethnicity, socioeconomic status), the drawing of the PVs (or EAP) must take the group
structures into account (Wu, 2005). The population model can include several conditioning
variables, which could yield a population distribution, which is a mixture of many normal
distributions. Such a distribution can be specified as
g( ) ~ N(   z1   z2  , 2 ) ,
ACCEPTED MANUSCRIPT
where z1,z2 , are conditioning or background variables. By including all available background
data in the model and correctly specifying the relationships to be addressed in secondary
analysis, relationships between the background variables and the estimated proficiencies will be
appropriately accounted for in the PVs (Mislevy et al., 1992; Martin & Mullis, 2012).
Statistical analysis
Data
The publicly available data from the international large--scale assessment study TIMSS 2011,
grade 8 in Mathematics (IEA, 2011) was used. Three countries were chosen for the real data
analysis based on their average mathematics achievement. These countries were chosen since
they represent different parts of the achievement scale, i.e. below the international mathematics
average score (Sweden), close to the average score (Slovenia), and above the average score
(USA). Simulated data, which mimic the TIMSS database, were also used as described in detail
later in the paper.
A full MLM was used in the empirical and simulation study. Constructed student level
and school level factors were used. The mathematics achievement for each student was estimated
as a function of the school factors controlling for student level factors. The student level and
school level factors were of secondary interest and were thus chosen based on a previous study
(Wiberg, Rolfsman & Laukaityte, 2013). The factors used, derived from the student and school
questionnaires, are presented below. The response variable is the students’ mathematics
achievement described by different combinations of the PVs.
ACCEPTED MANUSCRIPT
Student level factors
[ATM] Attitude towards mathematics: This was based on students’ responses to: (1) I enjoy
learning mathematics, (2) Mathematics is boring, and (3) I like mathematics. Possible
responses were: 1 = Agree a lot, 2 = Agree a little, 3 = Disagree a little, 4 = Disagree a
lot. Note that (2) was reversed coded. The responses were averaged and classified into
three categories: 1 = Low, average was greater than or equal to 3; 2 = Medium, average
was greater than 2 and less than 3; 3 = High, average was less than or equal to 2.
[SES] Socioeconomic status: This was based on students’ responses to the following indicators:
(1) Books at home: 1 = 0--10 books, 2 = 11--25 books, 3 = 26--10 books, 4 = 101--200
books, and 5 = > 200 books. Recoded into 1 = Low (0--25 books), 2 = Medium (26--200
books), and 3 = High (if > 200 books), (2) Possession of educational home resources
(computer and study desk). Categorized as 1 = if student had none or one item and 2 = if
student had two items. The two indicators were averaged and classified into: 1 = Low,
average was equal to 1, 2 = Medium, average was less than or equal to 2, and 3 = High,
average was greater than 2.
School level factors
[GA] Good attendance: The school principals’ answers to how severe each of the student
negative behaviors, arriving late at school and absenteeism, are among eighth--grade
students at the school. Possible responses were: 1 = Not a problem, 2 = Minor problem, 3
= Moderate problem, 4 = Serious problem. The responses were summed and classified
into three categories: 1 = Low, sum was greater than 6; 2 = Medium, sum was greater
than 3 and less than or equal to 6; 3 = High, sum was less than or equal to 3.
ACCEPTED MANUSCRIPT
[SLOC] School location: The school principals’ answers regarding the number of people living
in the area where the school is located. Responses were classified into two categories: 0 =
Rural areas (0--50,000) and 1 = Urban areas (>50,000).
The characteristics of the constructed factors are given in detail in Table 1. These factors
are used in the further empirical and simulation studies. In the simulation study the factors are
used as conditional variables. If conditional variables are categorical and have more than two
categories, they must be recoded into dummy variables. In order to simplify the simulation study
the factors ATM, SES and GA were simulated as binary conditioning variables. In large--scale
assessments, sampling weights are used to avoid bias in the parameter estimates, which can arise
due to unequal probabilities of selecting a school, a class and a student, or some units’ non--
response. The full MLM contained student--related factors weighted with student weights on the
student level. Student weights were calculated by multiplying student and classroom weighting
factors by their respective weighting adjustments. On the school level, school--related factors
and aggregated means of the student level measures (aATM and aSES) weighted with school
weights were used. Thus, the full MLM was defined as
Level 1 (within schools):
Yij  0 j  1 j  ( ATM )ij  2 j  (SES)ij  rij , i  1, , N ,
Level 2 (between schools):
0 j   00   01  (GA) j   02  (SLOC) j   03  (aATM ) j   04  (aSES) j  u0 j , j  1,, J

1 j  10 , 2 j   20 ,
ACCEPTED MANUSCRIPT
where Yij denotes mathematics achievement for student i within school j, rij is the error term
representing a unique effect associated with student i in school j, u0 j is the error term
representing a unique effect associated with school j, and 1 j , 2 j and  0k (k  1, ,4) are the
regression parameters for level 1 and level 2 explanatory variables, respectively.
Empirical study
Real data from TIMSS 2011 were used to construct the student level and school level factors,
which were modeled with the full MLM for the three countries using MPLUS 7 (Muthén &
Muthén, 2012). The main interest was to examine how the MLM analysis was affected by using
different PV strategies. The TIMSS 2011 database provides five PVs for mathematics
achievement. The missing data range at the student level was low, ranging from a minimum of
1% for SES in Slovenia to a maximum of 5% for ATM in Sweden. For the sake of simplicity,
listwise deletion was therefore used (Tabachnik and Fidell, 2007). The full--information
maximum likelihood procedure was used to handle the missing data at school level because
excluding a school means excluding all students at that school (Schafer & Graham, 2002).
Simulation study
The purpose of the simulation study was to investigate properties of PVs in MLMs and to
compare them with other estimators such as WLE, MLE and EAP. The simulation study
consisted of two parts. In the first part, the intention was to compare the effectiveness of
recovering the population mean and variance using a different number of PVs (1, 5, 7, 10, 20, 40,
and 100), and also different proficiency estimators (WLE, MLE, EAP, and PVs). For this
purpose, a population model with four simulated binary conditioning variables (ATM, SES, GA
ACCEPTED MANUSCRIPT
and SLOC) was used. The conditioning variables were simulated using a random generation
function for the Bernoulli distribution in R software (R Development Core Team, 2014). The
proportions of categories were taken from the real data from Sweden. Using the Conquest
software, the population mean and variance for each of the earlier mentioned estimators were
then compared under the following conditions:
 mean set at 0 or 2 on both (student and school) levels in the proficiency (ability)
distributions, with the between-- and within--variances set at 1,
 different test lengths (20 and 40 items),
 different number of test takers (4,000 and 8,000 students).
Item parameters were randomly generated from a uniform distribution U(--2;2). The simulations
were repeated 100 times.
In the second part of the simulation study, the full MLMs were used to compare
parameter estimates obtained by using a different number of PVs in two different cases.
 using 50% of the items. The responses of 4,000 students (800 schools with 50
students in each) were simulated for a 40--item test with difficulty distribution U (--
2;2), using a simple Rasch item response theory model. A part of the responses was
deleted following TIMSS matrix--sample design as follows. The 40 simulated
dichotomous items were randomly assigned to four blocks -- A, B, C, and D -- with
10 items per block. Every student received two blocks, i.e. 20 items out of 40. Blocks
were combined in the following way: (AB), (BC), (CD), (DA). Under this design,
ACCEPTED MANUSCRIPT
every student responds to 50% of the items. Proficiencies were drawn from the
conditional distribution in Equation (2).
 using 25% of the items. In this case, the same responses simulated for the previous
case (before the elimination of some responses) were used. A part of the responses
was deleted following TIMSS matrix--sample design. The 40 simulated dichotomous
items were randomly assigned to eight blocks -- A, B, C, D, E, F, G and H -- with 5
items per block. Every student received two blocks, i.e. 10 items out of 40. Blocks
were combined in the following way: (AB), (BC), (CD), (DE), (EF), (FG), (GH),
(HA). Under this design, every student responds to 25% of the items. Proficiencies
were drawn from the conditional distribution in Equation (2).
In addition to the simulated items, four binary conditioning variables, ATM, SES, GA
and SLOC, simulated in the first part of the simulation study, were used. Simulations were
repeated 30 times, as they are very time consuming to conduct. For each of the 30 replicates an
item response model was estimated from the data using Conquest 3.0 (Wu, Adams, Wilson &
Haldane, 2007) and a different number (1, 3, 5, 7, 10, 20, 40, 50, 60, and 100) of the PVs as well
as MLE, WLE and EAP were drawn. The PVs and EAP were drawn considering the simulated
item responses and taking into account the four conditioning variables in the same way as in
large--scale assessments (using a single-level population model). The MLE and WLE estimates
depend only on the item response data, and are thus not influenced by the conditioning variables.
After proficiencies had been drawn, the full MLM for each replicate was run using MPLUS 7
(Muthén & Muthén, 2012).
ACCEPTED MANUSCRIPT
Results
Empirical study
The empirical study shows that the results obtained using only one PV out of five differ greatly
from one another. The parameters estimates are also different from those obtained using all five
PVs, see Table 2. By saying that all five PVs are used, we mean that the analysis is carried out
separately with each of the five imputed PV data sets. Results are then combined at the level of
the targeted inference, as is shown in the subsection about PVs. The case of using the average of
five PVs gives the closest results to the case of using all five PVs. The parameter estimates for
the student level factors are even identical. The largest difference between the average of the PVs
and all five PVs is in the estimation of the within--school variance. In the models for all the three
countries, the within--school variances are much smaller when the average of PVs is used. When
using a single PV the parameter estimates and standard errors are smaller or larger than the ones
obtained all five PVs depending on which PV is chosen. We should also note that there is no
clear difference in the PV estimation for differently performing countries.
Simulation study
The simulation results for the 20--items test are presented in Table 3. When abilities are drawn
by setting the mean at 0 and the between-- and within--variances at 1, the PVs give estimates
closest to the generating values, especially with regard to variance. Increasing the number of
students to 8,000 by increasing the number of schools twice yields better variance estimation but
more biased means. If instead of increasing the number of schools, we increase the size of the
schools twice, the mean and variance estimations become more biased. It is also of interest to
ACCEPTED MANUSCRIPT
note that if we compare a different number of the PVs used, there is no clear evidence that a
larger number of PVs would give results that are more accurate.
If we increase the mean of the proficiency distribution to 2 on both levels, i.e. if we
examine a test with the average ability of the population higher than the average item difficulty,
estimates become more biased, especially the variances. As in the previous case, the increase in
the number of students slightly reduced the variance estimation bias, but increased the bias for
the mean.
The increase in test length to 40 items did reduce the bias in the estimates when 8,000
students were used but estimation became less precise for a test with 4,000 students and abilities
with a mean of 0. The PVs again recover the variances better than the point estimators do.
Results for a test with 40 items have been omitted, as they are similar to those in the 20--items
test. The analysis of the real and simulated data showed that using a single PV is not the best
choice for the estimation of the population parameters. Estimates vary from one PV to another.
The obvious question is therefore what number of PVs should be chosen to get reliable results.
The case when students get a test containing 50% of all possible items is presented in
Table 4. Generating values for the means of the four conditioning variables, the within-- and the
between--variances were set to be equal to 1. As Table 4 shows, in all presented cases with PVs
we have very similar estimates with slightly larger differences in the within--school variance.
The standard errors of the estimates become relatively stable starting with ten PVs. In this case,
ten PVs therefore appear to be sufficient to obtain reliable estimates. If we compare the PVs with
other estimators, we can see that the MLE and EAP estimates of the student-- and school level
ACCEPTED MANUSCRIPT
factors are very close to those of the PVs. However, the EAP underestimates the within-- and
between--school variances. The WLE underestimates the between--school variance and
overestimates the within--school variance. The MLE overestimates the within-- and between--
school variances. The within--school variance closest to the true value was obtained by WLE
estimator, the between--school variance by MLE estimator.
In the case when students get a test containing 25% of all possible items, then more than
20 PVs should be drawn (see Table 5). The decrease in number of provided items did not change
the relations between the PVs, MLE, WLE and EAP estimators, however the estimation of
parameter estimates and especially variances became worse.
Conclusion and Implications
The interest in international large--scale assessment databases is increasing rapidly around the
world. Most of such databases include PVs in order to perform appropriate calculations and
thereby obtain valid conclusions. Because of the complexity of modeling or software limitations,
some researchers tend to use only the first or a randomly selected PV, which leads to biased
results. Analysis of the real data showed that biased results are obtained if PVs are used
inappropriately in the analysis. When using only one or a few PVs, parameter estimates vary
greatly and the quality of estimation very much depends on which PV is chosen. Further, our
study showed that the estimation results in MLM using the average of PVs are very close to
those obtained using all five PVs when analyzing TIMSS 2011 data, but as expected the standard
errors and the within--school variance differ. Our results are in line with those previously
obtained by Carstens & Hastedt (2010), who examined strategies for mean and variances,
ACCEPTED MANUSCRIPT
although they did not analyze MLM.
From the simulation results, it was evident that PV--based estimates had a better recovery
of the population parameters than any of the point estimators, although in general the differences
between all estimates were quite small. The MLM analysis showed that in the cases when tests
contained 25% and 50% of all possible items only small differences were observed for the
different numbers of PVs. It appears that using about ten and twenty PVs in respective cases is
sufficient to get reliable estimates. Note that we used a 40--items test as a full--length test and
removing 50% or 75% of the items perhaps mean that we may have removed too much
information, which causes this behavior. On the other side, in TIMSS 2007 every student only
receives 11--17% of all possible items, which indicates that the examined cases are realistic.
Also note, as previously mentioned that the secondary biases, which are caused by under-
specification of the imputation model (in this case, the PV generation model), tend to decline
roughly proportionally to the test reliability.
In a nutshell, researchers who analyze large--scale assessments should not use only one
PV or the average of PVs to avoid an underestimation of the standard errors. This study has
shown that this is not only true when they are interested in means and variances but also when
using MLM. This is important to emphasize in the light of the increasing popularity of using
MLM with large--scale assessments data. From the simulation study, we could also conclude that
it is possible for us to increase the precision of the estimates in some cases if more than five PVs
are used. It is not, however, clear exactly which all these cases are. In the future, one should
elaborate more under which conditions we need more PVs and exactly how many PVs are
ACCEPTED MANUSCRIPT
actually needed under these conditions to obtain reliable parameter estimates. It would also be of
great interest to perform a similar simulation study with PVs generated using a multilevel latent
variable PVs approach.
ACCEPTED MANUSCRIPT
References
Beaton, A. E., & Gonzalez, E. (1995). NAEP primer. Chestnut Hill, MA: Boston College:
Boston.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Application of an EM algorithm. Psychometrika, 46, 443–459.
Bodner, T. E. (2008). What improves with increased missing data imputations? Structural
Equation Modeling: A Multidisciplinary Journal, 15, 651–675.
Carstens, R., & Hastedt, D. (2010). The effect of not using plausible values when they should be:
An illustration using TIMSS 2007 grade 8 mathematics data. Proceedings of the IRC–
2010. Retrieved from http://www.iea.nl/fileadmin/user_upload/IRC/IRC_2010/
Papers/ IRC2010_Carstens_Hastedt.pdf
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis (3rd ed.). Boca Raton: Chapman and Hall/CRC.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
models. New York, NY: Cambridge University Press.
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really
needed? Some practical clarifications of multiple imputation theory. Prevention Science,
8, 206–213.
Johnson, M. S., & Jenkins. F. (2005). A Bayesian hierarchical model for large-scale educational
surveys: An application to the National Assessment of Educational Progress. Research
Report RR-04-38. Princeton, NJ: ETS.
ACCEPTED MANUSCRIPT
IEA. (2011). TIMSS 2011 international database and user guide. Retrieved from
http://timssandpirls.bc.edu/timss2011/international–database.html
Kolen, M. J. (2006). Item Response Theory. In R. L., Brennan (Ed.), Educational Measurement
(Fourth Edition) (pp. 155–186). Westport, CT: Praeger.
Kuhfeld, M. R. (2016). Multilevel Item Factor Analysis and Student Perceptions of Teacher
Effectiveness (Doctoral dissertation). UCLA. Retrieved from http://escholarship.org/uc/
item/076175k5
Li, T. (2012). Randomization-based inference about latent variables from complex samples: The
case of two-stage sampling (Doctoral dissertation). University of Maryland, College
Park. Retrieved from http://drum.lib.umd.edu/bitstream/handle/1903/12514/
Li_umd_0117E_12668.pdf?sequence=1&isAllowed=y
Lord, F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their
parallel–forms reliability. Psychometrika, 48, 233–245.
Martin, M. O., & Mullis, I. V. S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS
2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Mislevy, R. J. (1991). Randomization–based inference about latent variables from complex
samples. Psychometrika, 56, 177–196.
Mislevy, R. J. (2013). On the proportion of missing data in classical test theory. Research
Memorandum ETS RM-13-06. Princeton, NJ: Educational Testing Service.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population
characteristics from sparse matrix samples of item responses. Journal of Educational
Measurement, 29, 133–161.
ACCEPTED MANUSCRIPT
Monseur, C., & Adamas, R. (2009). Plausible values: How to deal with their limitations. Journal
of applied measurement, 10(3), 320–334.
Muthén, L. K., & Muthén, B. O. (2012). MPLUS (Version 7) [computer software]. Los Angeles,
CA.
Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O'Sullivan, C. Y., & Preuschoff, C. (2011).
TIMSS 2011 Assessment Frameworks. Chestnut Hill, MA: TIMSS & PIRLS International
Study Center, Boston College.
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and
applications. In L. M. Le Cam, J. Neyman, & E. L. Scott (Eds.), Proceedings of the 6th
Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 697–715).
Berkeley: University of California Press.
R Development Core Team (2014) [Computer software]. R: A language and environment for
statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved
from http://www.R–project.org/.
Rijmen, F., Jeon, M., von Davier, M., & Rabe-Hesketh, S. (2014). A General Psychometric
Approach for Educational Survey Assessments: Flexible Statistical Models and Efficient
Estimation Methods. In L. Rutkowski, M. von Davier, & D. Rutkowski (eds.), Handbook
of International Large-Scale Assessment: Background, Technical Issues, and Methods of
Data Analysis (pp. 583–606). Boca Raton: CRC Press.
Rubin, D. B. (1987). Multiple imputations for non–response in surveys. New York: Wiley.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chopman & Hall.
ACCEPTED MANUSCRIPT
Schafer, J. L., & Graham, J.W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7(2), 147–177.
Si, Y., & Reiter, J. P. (2013). Nonparametric Bayesian Multiple Imputation for Incomplete
Categorical Variables in Large-Scale Assessment Surveys. Journal of Educational and
Behavioral Statistics, 38(5), 499–521. DOI: 10.3102/1076998613480394
Tabachnik, B.G., & Fidell, L.S (2007). Using Multivariate Statistics, (5th ed.). Pearson
Education, Inc.
Uebersax, J. S. (2002). Expected A Posteriori (EAP) Measures. Rasch Measurement
Transactions, 16(3), 891.
Von Davier, M., Gonzalez, E., & Mislevy, R.J. (2009). What are plausible values and why are
they useful? IERI Monograph Series. Issues and Methodologies in Large–Scale
Assessments, 2, 9–36. Retrieved from http://www.ierinstitute.org/fileadmin/
Documents/ IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf
Warm T.A. (1989) Weighted Likelihood Estimation of Ability in Item Response Theory.
Psychometrika, 54, 427–450.
Wiberg, M., Rolfsman, E., & Laukaityte, I. (2013). School effectiveness in mathematics in
Sweden and Norway 2003, 2007 and 2011. Proceedings of the IRC–2013. Retrieved from
http://www.iea.nl/fileadmin/user_upload/IRC/IRC_2013/Papers/IRC–2013_
Wiberg_etal.pdf
Wu, M. (2005). The role of plausible values in large–scale surveys. Studies in Educational
Evaluation, 31, 114–128.
ACCEPTED MANUSCRIPT
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ConQuest: Generalised item
response modeling software. Version 2.0 [Computer program & Manual], Camberwell:
Australian Council for Educational Research.
Yang, J. S., & Seltzer, M. (2016). Handling measurement error in predictors with a multilevel
latent variable plausible values approach. In S. N. Beretvas, J. Harring, & L. Stapleton
(Eds.), Advances in Multilevel Molding for Educational Research: Addressing Practical
Issues Found in Real-World Applications. Charlotte, NC: Information Age Publishing,
Inc.
Yen, W. M., & Fitzpatrick, A. R. (2006). Item Response Theory. In R. L., Brennan (Ed.),
Educational Measurement (4th ed.) (pp. 112–153). Westport, CT: Praeger.
ACCEPTED MANUSCRIPT
Table 1. Student level and school level characteristics of investigated factors from simulated and
real data.
Real data
Simulated data
Sweden Slovenia USA
Student factors
ATM 78 2.15 (0.01) 1.75 (0.02) 2.28 (0.01)
SES 26 2.20 (0.01) 2.09 (0.01) 2.05 (0.01)
School factors
GA 62 2.14 (0.07) 2.57 (0.04) 2.31 (0.05)
SLOC 42 33.3 14.0 41.9
Mathematics
484 (1.9) 505 (2.2) 509 (2.6)
achievement
Note: Factors are described by the mean and standard deviation within parentheses for the first
three factors and the proportion for the SLOC (% of urban schools). Simulated factors are
described by the proportion (%) of High.
ACCEPTED MANUSCRIPT
Table 2. Estimates (standard errors in parentheses) with the full MLMs using different PV
strategies (real data).
Average
Parameter 5 PVs 1st PV 2nd PV 3rd PV 4th PV 5th PV
of 5 PVs
Sweden
Intercept 244.51** 228.37** 265.52** 258.27** 231.99** 240.68** 244.22**

* * * * * * *
(49.65) (45.57) (42.72) (47.03) (45.58) (50.84) (45.47)
Fixed effects
ATM 26.27*** 26.55*** 26.00*** 25.89*** 25.77** 27.16*** 26.27***

*
(1.85) (1.56) (1.63) (1.78) (1.81) (1.87) (1.64)
SES 28.66*** 28.11*** 27.99*** 27.62*** 29.02** 30.58*** 28.66***

*
(2.76) (2.20) (2.23) (2.56) (2.49) (2.68) (2.30)
aATM 14.48 14.68 9.60 15.77 16.49 15.56 14.78
(14.48) (15.34) (13.02) (13.91) (13.91) (16.11) (14.02)
aSES 72.76*** 78.48*** 68.82*** 66.05*** 76.79** 73.16*** 72.41***

*
(15.93) (13.94) (14.53) (14.58) (15.20) (16.00) (14.69)
GA 23.90*** 25.49*** 23.06*** 22.96*** 23.23** 24.45*** 24.07***

*
(5.54) (5.59) (4.45) (5.80) (5.29) (5.88) (5.24)
SLOC --1.51 --2.34 --0.36 --1.73 --1.45 --1.54 --1.49
ACCEPTED MANUSCRIPT
(4.82) (4.80) (4.61) (4.75) (4.76) (4.84) (4.70)
Variance
Within 3543.362 3477.103 3495.71 3598.32 3569.83 3575.55 3134.89

schools
Between 264.106 269.534 260.96 269.49 262.65 260.77 265.39

schools
Slovenia
Intercept 363.08** 359.93** 372.86** 347.80** 360.31* 374.38** 362.71**

* * * * ** * *
(61.55) (61.25) (57.65) (60.90) (60.53) (61.57) (59.94)
Fixed effects
ATM 21.79*** 21.38*** 20.23*** 21.20*** 22.45** 23.67*** 21.78***

*
(2.32) (1.85) (1.84) (1.93) (1.66) (1.79) (1.73)
SES 34.28*** 32.68*** 35.96*** 32.74*** 34.98** 35.05*** 34.28***

*
(3.36) (2.82) (2.79) (3.09) (2.84) (3.15) (2.78)
aATM 3.12 --1.03 4.47 5.78 5.84 0.68 3.07
(10.48) (10.13) (9.86) (10.42) (9.39) (9.73) (9.70)
aSES 63.32* 68.05** 58.20* 67.95** 61.41* 60.90* 63.48**
(24.64) (24.63) (22.87) (24.41) (24.47) (24.33) (23.90)
GA 1.92 1.84 1.44 2.56 2.54 1.23 1.95
(3.78) (3.72) (3.68) (3.95) (3.58) (3.69) (3.67)
ACCEPTED MANUSCRIPT
SLOC --5.77 --5.90 --5.68 --7.73 --3.55 --6.01 --5.80
(7.95) (7.87) (7.48) (7.75) (7.93) (7.88) (7.71)
Variance
Within 4059.69 3981.17 4031.81 4146.52 4035.34 4102.21 3679.95

schools
Between 390.99 410.13 362.66 421.29 377.39 390.35 392.29

schools
USA
Intercept 220.19** 199.58** 199.64** 210.00** 260.46* 231.34** 220.89**

* * * * ** * *
(40.11) (23.18) (23.48) (25.35) (35.64) (31.67) (26.70)
Fixed effects
ATM 15.80*** 14.83*** 16.91*** 15.42*** 15.44** 16.38*** 15.80***

*
(1.90) (1.69) (1.52) (1.65) (1.74) (1.75) (1.58)
SES 8.72*** 9.79*** 9.25*** 7.95*** 7.51*** 9.09*** 8.72***
(2.32) (2.14) (2.40) (2.01) (1.66) (2.08) (1.92)
aATM 1.26 4.57 5.07 2.83 --6.03 --0.15 1.10
(13.55) (12.47) (12.30) (12.58) (12.71) (12.89) (12.45)
aSES 145.95** 152.13** 152.32** 147.17** 135.76* 142.31** 145.73**

* * * * ** * *
(15.97) (13.63) (13.35) (13.80) (14.60) (14.57) (13.70)
GA --4.00 --4.26 --5.01 --1.99 --4.39 --4.33 --3.96
ACCEPTED MANUSCRIPT
(7.78) (7.64) (7.53) (7.76) (7.62) (7.84) (7.64)
SLOC --6.87 --6.78 --5.99 --7.36 --8.54 --5.66 --6.84
(8.73) (8.27) (8.55) (8.82) (8.96) (8.60) (8.60)
Variance
Within 2407.17 2383.80 2433.81 2424.12 2370.40 2423.70 2005.14

schools
Between 1652.78 1579.69 1610.71 1677.85 1737.79 1658.27 1647.42

schools
*** -- p--value < 0.001
** -- p--value < 0.01
* -- p--value < 0.05.
ACCEPTED MANUSCRIPT
Table 3. Estimated population mean and variance of proficiency (standard errors in parentheses)
for a 20--item test with difficulties from U(--2:2), the between-- and within--school variances set
to 1, and using four conditional variables.
Estima
tes
averag Generat
WL 5 7 10 20 40 100
ed over MLE EAP 1 PV ing
E PVs PVs PVs PVs PVs PVs
100 value
replicat
ions
4,000 students, abilities with the mean set at 0 on both levels
Mean 2.084 1.928 2.037 2.054 2.029 2.029 2.030 2.030 2.030 2.029
4 3 2 2 5 3 2 3 2 9
2.02
(0.028 (0.026 (0.024 (0.031 (0.029 (0.029 (0.029 (0.028 (0.029 (0.028
9) 4) 7) 4) 1) 1) 1) 8) 0) 8)
Variance 3.340 2.796 2.445 2.928 2.857 2.857 2.861 2.859 2.858 2.858
5 2 0 3 2 9 5 1 6 8
(0.074 (0.062 (0.054 (0.076 (0.077 (0.077 (0.077 (0.075 (0.075 ~2.90
7) 5) 7) 7) 6) 1) 3) 2) 5) (0.075
4)
Mean 4.153 3.699 5.054 5.048 5.0330 5.0329 5.0328 5.0329 5.0329 5.0329
8 8 6 4
~6.0
(0.007 (0.006 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010
1) 3) 2) 1) 0) 0) 0) 0) 0) 0)
Variance 0.200 0.158 0.419 0.400 0.3974 0.3984 0.3984 0.3984 0.3984 0.3984
~2.90
8 0 1 2
ACCEPTED MANUSCRIPT
(0.004 (0.003 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009
5) 5) 4) 1) 0) 0) 0) 0) 0) 0)
Mean 3.33 3.03 3.37 3.36 3.359 3.35 3.359 3.35 3.35 3.35
60 94 02 34 3 96 1 96 97 95
(0.018 (0.016 (0.017 (0.021 (0.021 (0.020 (0.020 (0.020 (0.020 2.02
5) 6) 3) 7) 0) 9) (0.020 7) 6) 5)
7)
Variance 2.742 2.223 2.403 2.885 2.866 2.867 2.867 2.867 2.868 2.867
3 0 9 0 5 7 3 3 3 6
~2.90
(0.043 (0.035 (0.038 (0.055 (0.053 (0.052 (0.052 (0.051 (0.051 (0.051
4) 2) 0) 0) 3) 7) 4) 8) 7) 7)
8,000 students, abilities with the mean set to 2 on both levels
Mean 4.292 3.858 4.901 4.910 4.916 4.916 4.916 4.916 4.916 4.916
4 5 1 7 8 7 6 7 8 7
~6.0
(0.006 (0.005 (0.004 (0.006 (0.006 (0.006 (0.006 (0.006 (0.006 (0.006
1) 4) 1) 1) 3) 3) 3) 3) 3) 3)
Variance 0.293 0.236 0.133 0.287 0.289 0.289 0.289 0.289 0.289 0.289
5 2 2 8 9 9 9 9 9 9
~2.90
(0.004 (0.003 (0.002 (0.005 (0.005 (0.005 (0.005 (0.005 (0.005 (0.005
6) 7) 1) 1) 1) 1) 0) 0) 0) 0)
ACCEPTED MANUSCRIPT
Table 4. Estimates (standard errors in parentheses) obtained from the full MLM model using a
40--item test containing 50% of all possible items; averaged over 30 simulations.
Param ML WL EA PV1 PV2 PV4 PV5 PV6 PV1

PV1 PV3 PV5 PV7
eter E E P 0 0 0 0 0 00
Interc 0.30 0.29 0.19 0.19 0.19 0.19 0.19 0.20 0.19 0.19 0.19 0.19 0.20
ept 3 5 5 9 7 5 2 2 8 8 8 8 0
(0.2 (0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
20) 03) 76) 78) 79) 78) 80) 80) 79) 79) 79) 79) 79)
Fixed effects
ATM 1.02 0.93 1.03 1.03 1.03 1.03 1.04 1.03 1.03 1.03 1.03 1.03 1.03
8 9 8 7 6 5 2 5 6 9 7 7 7
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
50) 46) 39) 50) 53) 50) 51) 53) 52) 52) 52) 52) 52)
SES 0.93 0.84 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
8 7 2 1 0 3 0 1 0 1 1 3 1
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
49) 45) 38) 51) 53) 56) 56) 55) 54) 53) 54) 53) 54)
GA 0.56 0.51 0.60 0.59 0.60 0.60 0.60 0.59 0.59 0.59 0.59 0.59 0.59
5 9 0 5 2 1 0 4 9 8 8 9 7
(0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2
56) 34) 06) 07) 07) 08) 08) 08) 08) 08) 08) 08) 08)
SLOC 1.06 0.96 1.12 1.12 1.13 1.13 1.12 1.13 1.13 1.13 1.13 1.13 1.13
7 1 8 1 4 3 8 3 1 2 2 2 2
(0.2 (0.2 (0.1 (0.2 (0.2 (0.1 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2
43) 21) 98) 00) 00) 99) 00) 00) 00) 00) 00) 00) 00)
Varia
nce
ACCEPTED MANUSCRIPT
Withi 1.43 1.18 0.88 1.32 1.35 1.35 1.35 1.35 1.35 1.34 1.35 1.34 1.35
n 9 3 8 9 7 2 0 3 4 9 3 9 5
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
37) 32) 31) 37) 47) 43) 41) 41) 42) 42) 42) 42) 41)
Betwe 1.02 0.85 0.66 0.65 0.65 0.65 0.66 0.66 0.66 0.65 0.66 0.65 0.66
en 9 2 2 8 6 8 2 2 1 9 1 9 1
(0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
54) 28) 01) 02) 02) 03) 03) 03) 03) 03) 02) 02) 03)
Note: The true values for ATM, SES, GA, and SCLOC, the within-- and between--variances
were set to be equal to 1.
ACCEPTED MANUSCRIPT
Table 5. Estimates (standard errors in parentheses) obtained from the full MLM model using a
40--item test containing 25% of all possible items; averaged over 30 simulations.
Param ML WL EA PV1 PV2 PV4 PV5 PV6 PV1

PV1 PV3 PV5 PV7
eter E E P 0 0 0 0 0 00
Interc 0.32 0.29 0.17 0.17 0.18 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17
ept 6 8 6 9 1 8 9 3 6 6 5 3 6
(0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
07) 88) 46) 51) 54) 53) 53) 52) 52) 52) 52) 52) 52)
Fixed effects
ATM 0.96 0.86 1.03 1.02 1.02 1.02 1.02 1.03 1.02 1.03 1.03 1.03 1.02
0 6 0 8 4 5 9 1 8 0 0 0 9
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
53) 47) 36) 54) 58) 54) 56) 56) 54) 55) 55) 55) 55)
SES 0.83 0.74 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
1 8 4 5 9 2 4 2 3 2 4 4 5
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
50) 45) 35) 58) 61) 58) 59) 58) 57) 57) 57) 57) 57)
GA 0.50 0.45 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56
0 0 5 4 7 5 1 7 4 5 5 6 4
(0.2 (0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
33) 10) 69) 72) 73) 74) 73) 73) 73) 73) 73) 73) 73)
SLOC 0.99 0.89 1.15 1.16 1.15 1.16 1.16 1.16 1.16 1.16 1.16 1.16 1.16
1 2 4 0 7 3 0 1 2 0 2 1 2
(0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
19) 97) 62) 66) 67) 67) 67) 67) 67) 66) 66) 66) 66)
Varia
nce
ACCEPTED MANUSCRIPT
Withi 1.47 1.20 0.74 1.42 1.42 1.42 1.41 1.41 1.42 1.42 1.42 1.42 1.42
n 8 7 9 7 1 0 8 8 0 3 1 1 1
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
55) 44) 31) 45) 49) 47) 46) 47) 47) 46) 46) 47) 46)
Betwe 0.84 0.68 0.44 0.43 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44
en 0 2 3 8 2 1 2 1 1 2 1 2 2
(0.1 (0.1 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
23) 00) 68) 70) 72) 72) 71) 71) 71) 71) 71) 71) 71)
Note: The true values for ATM, SES, GA, and SCLOC, the within-- sand between--variances
were set to be equal to 1.

Using Plasauble Values For Missing Data

Uploaded by

Copyright:

Available Formats

Using Plasauble Values For Missing Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Plasauble Values For Missing Data

Uploaded by

Copyright:

Available Formats

Communications in Statistics - Theory and Methods

ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20

Using plausible values in secondary analysis in

Inga Laukaityte & Marie Wiberg

To link to this article: http://dx.doi.org/10.1080/03610926.2016.1267764

Accepted author version posted online: 08

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

USING PLAUSIBLE VALUES IN SECONDARY ANALYSIS

Using plausible values in secondary analysis in large--scale assessments

Inga Laukaityte, Marie Wiberg

Department of Statistics, USBE, Umeå University, Umeå, Sweden

of Statistics, Umeå University, Sweden. E-mail: [email protected]

anyone wishing to make secondary analyses of large--scale assessment data, especially

those interested in using multilevel models to analyze the data.

Keywords: Achievement, design study, multilevel modelling, simulation studies, testing.

to consider: the number of imputations to use, and the consequence of misspecification of

assessments, the PVs are generated from single-level imputation models.

Assessment of Educational Progress) database provides 20 PVs), although there are no

imputation model, decline roughly in proportion to test reliability.

optimal and reliable parameter estimations in such large--scale assessments.

contains conclusions and implications.

paper, four methods were used.

Maximum likelihood estimator

be the probability of answering an item correctly and Ql ( ) answering an item incorrectly. If x

 is the value that maximizes the following likelihood function

number--correct score or number--correct score of 0 will have MLE proficiency estimates of 

and  , respectively (Yen & Fitzpatrick, 2006).

Weighted maximum likelihood estimator

correct the bias of the MLE and is defined as

where I is the test information

asymptotically normally distributed with variance equal to the variance of MLE.

Expected a--posteriori estimator

 and then we can define the EAP as

model as in the PV imputation model. If background information is not used in estimation of

EAPs, they will suffer shrinkage toward the population mean.

items as well as by conditioning on all available background data. By conditioning on all

distribution (Martin & Mullis, 2012)

P i | xi , yi , ,   P  xi | i , yi , ,  P i | yi , ,   P  xi | i  P i | yi , , 

ˆ m , and the final estimate is obtained by averaging all M

The variance of D is estimated by summing two components: the within--imputation variance

hierarchical structure of the data (Monseur & Adamas, 2009).

distributions. Such a distribution can be specified as

later in the paper.

achievement described by different combinations of the PVs.

Student level factors

responses were: 1 = Agree a lot, 2 = Agree a little, 3 = Disagree a little, 4 = Disagree a

average was greater than 2.

School level factors

Rural areas (0--50,000) and 1 = Urban areas (>50,000).

weights were used. Thus, the full MLM was defined as

Level 1 (within schools):

Yij  0 j  1 j  ( ATM )ij  2 j  (SES)ij  rij , i  1, , N ,

Level 2 (between schools):

0 j   00   01  (GA) j   02  (SLOC) j   03  (aATM ) j   04  (aSES) j  u0 j , j  1,, J

regression parameters for level 1 and level 2 explanatory variables, respectively.

then compared under the following conditions:

Intercept 244.51 228.37 265.52 258.27 231.99 240.68 244.22**

ATM 26.27* 26.55* 26.00* 25.89* 25.77 27.16* 26.27***