Using Plasauble Values For Missing Data
Using Plasauble Values For Missing Data
Using Plasauble Values For Missing Data
To cite this article: Inga Laukaityte & Marie Wiberg (2016): Using plausible values in secondary
analysis in large–scale assessments, Communications in Statistics - Theory and Methods
Article views: 20
Download by: [University of Newcastle, Australia] Date: 12 April 2017, At: 07:56
ACCEPTED MANUSCRIPT
Corresponding author: Inga Laukaityte, Umeå School of Business and Economics, Department
Plausible values are typically used in large--scale assessment studies, in particular in the
Trends in International Mathematics and Science Study and the Programme for
International Student Assessment. Despite its large spread there are still some questions
regarding the use of plausible values and how such use affects statistical analyses. The
aim of this paper is to demonstrate the role of plausible values in large--scale assessment
surveys when multilevel modelling is used. Different user strategies concerning plausible
values for multilevel models as well as means and variances are examined. The results
show that some commonly used user strategies give incorrect results while others give
reasonable estimates but incorrect standard errors. These findings are important for
1 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Introduction
Large--scale assessment surveys contain large numbers of items as well as limited time and
numbers of students. Due to time limitations, students receive a subset (block) of all assessment
items. For this reason, the measurement of individual proficiency is achieved with a
measurement error (von Davier, Gonzalez & Mislevy, 2009). In order to reflect the uncertainty
of the measurement, several scores or imputations, called plausible values (PVs), are presented
for each individual. PVs have been successfully used to improve inference about latent variables
in large-scale assessments, and research is ongoing to improve the practice. There are two issues
imputation models. This paper addresses an issue at the intersection of these issues, namely the
effect of number of PVs when inferences about multilevel models (MLMs) are desired but a
MLM has not been used to construct the PVs. Although, there are now beginning to appear
publications on how one would create a MLM PV system (Mislevy, 1991; Li, 2012; Yang and
Seltzer, 2016; Kuhfeld, 2016; and Rijmen, Jeon, von Davier, and Rabe-Hesketh, 2013), the
problem addressed here is important because the methods required are more complex and not
readily accomplished by most secondary analysts, and many large-scale assessments provide
public-use data with PVs generated with a single-level rather than a multilevel model. The aim of
this paper is to analyze the role of PVs in large--scale assessment surveys when MLMs are used.
In order to reach this goal, different user strategies (different numbers) of PVs for MLMs as well
as means and variances were examined using both, simulations and real data from Trends in
International Mathematics and Science Study (TIMSS) 2011. In line with the setup of large-scale
2 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Most large--scale assessment databases usually provide five PVs (only NAEP (National
particularly strong reasons for choosing five values in the literature (Wu, 2005). Wu (2005)
showed in simulations that very often even one PV is sufficient to adequately recover the
population parameter when examining means and variances. However, it is important to note,
that if only one PV is used, then an analyst has no information about the component of
uncertainty in an estimate that is due to the latent nature of the proficiency variable. Having more
than one PV, improves the accuracy of the estimate and the accuracy of the error of it. The
general use of five PVs probably dates back to Rubin’s (1987) relative efficiency formula,
1/(1+F/M), where F is the fraction of missing data and M the number of imputations. This means
that with 50% missing information it would result in point estimates, which were 91% as
efficient as those based on an infinite number. Graham, Olchowski & Gilreath (2007) noted that
this might be too few if we are interested in other estimates and recommended the use of 20
imputations for 10--30% missing information when examining loss of power in hypothesis
testing. Bodner (2008) gave similar recommendations for the estimation of the null hypothesis
significance test p-values and confidence interval half-widths. It is important to have in mind that
there are different types of missingness and thus different ways to compute a proportion of
missing information. Missingness can be planned (by design) or unplanned. This paper deals
only with planned missing information design, and thus the missingness is understood in terms
of information missing about the latent variables given the observed responses, as it is defined by
Orchard & Woodbury (1972). Mislevy (2013) showed how to calculate the proportion of
missingness in classical test theory when responses are inherently unobserved. It is also
3 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
important to mention, that all the secondary biases entailed by under-specification of the
The framework and use of PVs was first developed for the analyses of US NAEP data in
1983--84 (Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992; Beaton & Gonzalez,
1995). The theory of PVs is based on Rubin’s (1987) work on multiple imputations. PVs are now
used in all NAEP surveys, the TIMSS and the Programme for International Student Assessment
(PISA). The role of PVs in large--scale assessments is not univocal and the use of PVs has
recently been discussed. For instance, Wu (2005) and von Davier et al. (2009) have presented
theoretical studies to illustrate the advantages of PVs over maximum likelihood estimates for
estimating a range of population statistics. Carstens & Hastedt (2010) have also shown a
practical meaning of PVs using TIMSS 2007 data. They have mainly analyzed the effect on
estimated means, standard deviations, and standard errors when PVs are used incorrectly or other
IRT estimates are employed. Nevertheless, large--scale assessment data are not limited to
estimation of population statistics. This paper differs from Wu (2005), von Davier et al. (2009)
and Carstens & Hastedt (2010) articles as the focus is on MLMs (as described in e.g. Gelman &
Hill (2006), which neither of them covered. MLMs are becoming a common way of analyzing
complex survey data as these models take a multistage sampling design into account and enable
us to study the effect of cluster level variables on the individual outcomes. Monseur & Adams
(2009) explored different types of estimators for recovering variance components and a latent
correlation with PISA data. They showed that it is important to take the hierarchical structure of
data into account when secondary analyses are performed, and when student proficiency
estimates are generated. There is also a growing interest in the fully Bayesian models (e.g.,
4 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Gelman et al., 2013) as Bayesian models can take into account complex clustered structure
inherent in the sampling design that standard methods for multiple imputation can fail to capture.
They can incorporate multiple levels, covariates at different levels, and multivariate latent
variable models for cognitive responses (e.g., Johnson & Jenkins, 2005; Si & Reiter, 2013). All
previous mentioned studies have shown that PVs are the most efficient way of recovering
different parameter estimates such as mean, variance, correlation, etc. However, to the best of
our knowledge, there are no studies that show how many PVs are actually needed for the most
This paper is structured as follows. The next section describes proficiency estimation,
followed by a description of the data used, and the empirical and simulation studies set up. The
fourth section presents the results from the empirical and simulation studies and the last section
Proficiency estimation
Due to the large number of items (to ensure reliable measurement of achievement) and time
limitations, students receive only a subset of all assessment items in studies like TIMSS, PISA,
NAEP, etc. In TIMSS 2011, all 434 mathematics and science items in grade 8 are partitioned
into a set of 14 student booklets. Student booklets are assembled from various combinations of
item blocks containing two blocks from mathematics and two blocks from science with 12--18
items per block (Mullis, Martin, Ruddock, O'Sullivan, & Preuschoff, 2011). The students’
responses from various booklets are linked together through the items because each item appears
in two booklets. Each student completes only one booklet, resulting in a large number of
5 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
responses that are missing by design. Students’ proficiencies must thus be estimated taking the
incomplete information into account. Several methods exist for drawing proficiencies; in this
Assume we have a test of n items, which are binary scored, and that we model items l 1, ,n
with an item response model. Let represent the proficiency (ability) of a student and let Pl ( )
is the vector of n binary scored item responses then the maximum likelihood estimator (MLE) of
n
L(x|)= Pl ( ) xl Ql ( )1 xl
l 1 . (1)
The obtained MLE is a point estimator, which provides the same MLE proficiency
estimate for every student with the same total score (Wu, 2005). One disadvantage with using
MLE is that it gives an asymptotic bias when item response models are used (Lord, 1983). Other
disadvantage, if used with a Rasch item response model is that an examinee with a perfect
The weighted maximum likelihood estimator (WLE) was introduced by Warm (1989) in order to
6 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
J
ˆWLE ˆMLE ,
2I 2
P ( ) P ( )P ( ) ,
n '2 n ' ''
I J
l l l l l
, and
Pl ( )Ql ( ) Pl ( )Ql ( )
and J the first derivative of the likelihood function in Equation (1) with respect to . The WLE
is also a point estimator and provides a proficiency estimate for each total score. It is further
The Bayesian expected a--posteriori estimator (EAP) (Bock & Aitken, 1981) is the mean of the
posterior distribution for each student. Let g ( ) be a chosen prior distribution of the proficiency
n
g( ) PX d il
ˆEAP ( x) l 1
n
,
g( ) P
l 1
X il d
where PXil is the probability of student i scoring X il on item l conditional on the proficiency
(Uebersax, 2002). Given a Rasch item response theory model, EAP also provides one estimate
for each total score as previously mentioned likelihood estimators. A disadvantage with EAP is
that in a given population the variability of the EAP estimates of proficiency is typically less than
the variability of because of shrinkage toward the mean for the EAP estimates (Kolen, 2006).
7 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
The EAP for a student can also be seen as the expected value of the PVs for that student,
assuming the EAP is computed with a prior distribution containing the same information and
Plausible values
To estimate students’ proficiency TIMSS uses the multiple imputation method, i.e. the PVs
approach (Rubin, 1987; Mislevy, 1991). The PVs are generated using students’ responses to the
background data, relationships between these background variables and the estimated
proficiencies are appropriately accounted for in the PVs (Mislevy et al., 1992).
Denote the responses of student i to background questions by yi , and student i’s item
responses by xi . Then 5 PVs in TIMSS for each student i are drawn from the conditional
where P xi | i is any chosen item response model, P i | yi , , is the regression of the
background variables, is a matrix of regression coefficients for the background variables, and
is a common variance matrix of residuals. The result of the PVs approach is a set of drawn
ˆ m , m 1,
values, i.e. PVs which we denote by D ,M, where M 1 is the number of PVs drawn.
estimates:
8 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Dˆ m
D m
.
M
ˆm , V 1
obtained by averaging estimated variances Vm of PVs D
M
V
m m
, and the between--
imputation variance, BM :
1
Var D V 1 BM ,
M
1
Dˆ D
2
where BM (Mislevy, 1991; Schafer, 1997). It is important to note, that the
M 1 m m
PVs approach would give consistent estimates of population parameters if the PVs were
generated using the imputation model that is compatible with the subsequent data analysis.
Although the design of most large-scale assessments is of hierarchical nature, the population
model underlying the generation of PVs is a single-level model that does not identify the
Conditioning variables
To estimate the characteristics of student populations and subpopulations (or subgroups, e.g.
gender, ethnicity, socioeconomic status), the drawing of the PVs (or EAP) must take the group
structures into account (Wu, 2005). The population model can include several conditioning
variables, which could yield a population distribution, which is a mixture of many normal
g( ) ~ N( z1 z2 , 2 ) ,
9 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
where z1,z2 , are conditioning or background variables. By including all available background
data in the model and correctly specifying the relationships to be addressed in secondary
analysis, relationships between the background variables and the estimated proficiencies will be
appropriately accounted for in the PVs (Mislevy et al., 1992; Martin & Mullis, 2012).
Statistical analysis
Data
The publicly available data from the international large--scale assessment study TIMSS 2011,
grade 8 in Mathematics (IEA, 2011) was used. Three countries were chosen for the real data
analysis based on their average mathematics achievement. These countries were chosen since
they represent different parts of the achievement scale, i.e. below the international mathematics
average score (Sweden), close to the average score (Slovenia), and above the average score
(USA). Simulated data, which mimic the TIMSS database, were also used as described in detail
A full MLM was used in the empirical and simulation study. Constructed student level
and school level factors were used. The mathematics achievement for each student was estimated
as a function of the school factors controlling for student level factors. The student level and
school level factors were of secondary interest and were thus chosen based on a previous study
(Wiberg, Rolfsman & Laukaityte, 2013). The factors used, derived from the student and school
questionnaires, are presented below. The response variable is the students’ mathematics
10 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
[ATM] Attitude towards mathematics: This was based on students’ responses to: (1) I enjoy
learning mathematics, (2) Mathematics is boring, and (3) I like mathematics. Possible
lot. Note that (2) was reversed coded. The responses were averaged and classified into
three categories: 1 = Low, average was greater than or equal to 3; 2 = Medium, average
was greater than 2 and less than 3; 3 = High, average was less than or equal to 2.
[SES] Socioeconomic status: This was based on students’ responses to the following indicators:
(1) Books at home: 1 = 0--10 books, 2 = 11--25 books, 3 = 26--10 books, 4 = 101--200
books, and 5 = > 200 books. Recoded into 1 = Low (0--25 books), 2 = Medium (26--200
books), and 3 = High (if > 200 books), (2) Possession of educational home resources
(computer and study desk). Categorized as 1 = if student had none or one item and 2 = if
student had two items. The two indicators were averaged and classified into: 1 = Low,
average was equal to 1, 2 = Medium, average was less than or equal to 2, and 3 = High,
[GA] Good attendance: The school principals’ answers to how severe each of the student
negative behaviors, arriving late at school and absenteeism, are among eighth--grade
students at the school. Possible responses were: 1 = Not a problem, 2 = Minor problem, 3
= Moderate problem, 4 = Serious problem. The responses were summed and classified
into three categories: 1 = Low, sum was greater than 6; 2 = Medium, sum was greater
than 3 and less than or equal to 6; 3 = High, sum was less than or equal to 3.
11 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
[SLOC] School location: The school principals’ answers regarding the number of people living
in the area where the school is located. Responses were classified into two categories: 0 =
The characteristics of the constructed factors are given in detail in Table 1. These factors
are used in the further empirical and simulation studies. In the simulation study the factors are
used as conditional variables. If conditional variables are categorical and have more than two
categories, they must be recoded into dummy variables. In order to simplify the simulation study
the factors ATM, SES and GA were simulated as binary conditioning variables. In large--scale
assessments, sampling weights are used to avoid bias in the parameter estimates, which can arise
due to unequal probabilities of selecting a school, a class and a student, or some units’ non--
response. The full MLM contained student--related factors weighted with student weights on the
student level. Student weights were calculated by multiplying student and classroom weighting
factors by their respective weighting adjustments. On the school level, school--related factors
and aggregated means of the student level measures (aATM and aSES) weighted with school
12 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
where Yij denotes mathematics achievement for student i within school j, rij is the error term
representing a unique effect associated with student i in school j, u0 j is the error term
representing a unique effect associated with school j, and 1 j , 2 j and 0k (k 1, ,4) are the
Empirical study
Real data from TIMSS 2011 were used to construct the student level and school level factors,
which were modeled with the full MLM for the three countries using MPLUS 7 (Muthén &
Muthén, 2012). The main interest was to examine how the MLM analysis was affected by using
different PV strategies. The TIMSS 2011 database provides five PVs for mathematics
achievement. The missing data range at the student level was low, ranging from a minimum of
1% for SES in Slovenia to a maximum of 5% for ATM in Sweden. For the sake of simplicity,
listwise deletion was therefore used (Tabachnik and Fidell, 2007). The full--information
maximum likelihood procedure was used to handle the missing data at school level because
excluding a school means excluding all students at that school (Schafer & Graham, 2002).
Simulation study
The purpose of the simulation study was to investigate properties of PVs in MLMs and to
compare them with other estimators such as WLE, MLE and EAP. The simulation study
consisted of two parts. In the first part, the intention was to compare the effectiveness of
recovering the population mean and variance using a different number of PVs (1, 5, 7, 10, 20, 40,
and 100), and also different proficiency estimators (WLE, MLE, EAP, and PVs). For this
purpose, a population model with four simulated binary conditioning variables (ATM, SES, GA
13 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
and SLOC) was used. The conditioning variables were simulated using a random generation
function for the Bernoulli distribution in R software (R Development Core Team, 2014). The
proportions of categories were taken from the real data from Sweden. Using the Conquest
software, the population mean and variance for each of the earlier mentioned estimators were
mean set at 0 or 2 on both (student and school) levels in the proficiency (ability)
Item parameters were randomly generated from a uniform distribution U(--2;2). The simulations
In the second part of the simulation study, the full MLMs were used to compare
parameter estimates obtained by using a different number of PVs in two different cases.
using 50% of the items. The responses of 4,000 students (800 schools with 50
students in each) were simulated for a 40--item test with difficulty distribution U (--
2;2), using a simple Rasch item response theory model. A part of the responses was
10 items per block. Every student received two blocks, i.e. 20 items out of 40. Blocks
were combined in the following way: (AB), (BC), (CD), (DA). Under this design,
14 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
every student responds to 50% of the items. Proficiencies were drawn from the
using 25% of the items. In this case, the same responses simulated for the previous
case (before the elimination of some responses) were used. A part of the responses
items per block. Every student received two blocks, i.e. 10 items out of 40. Blocks
were combined in the following way: (AB), (BC), (CD), (DE), (EF), (FG), (GH),
(HA). Under this design, every student responds to 25% of the items. Proficiencies
In addition to the simulated items, four binary conditioning variables, ATM, SES, GA
and SLOC, simulated in the first part of the simulation study, were used. Simulations were
repeated 30 times, as they are very time consuming to conduct. For each of the 30 replicates an
item response model was estimated from the data using Conquest 3.0 (Wu, Adams, Wilson &
Haldane, 2007) and a different number (1, 3, 5, 7, 10, 20, 40, 50, 60, and 100) of the PVs as well
as MLE, WLE and EAP were drawn. The PVs and EAP were drawn considering the simulated
item responses and taking into account the four conditioning variables in the same way as in
large--scale assessments (using a single-level population model). The MLE and WLE estimates
depend only on the item response data, and are thus not influenced by the conditioning variables.
After proficiencies had been drawn, the full MLM for each replicate was run using MPLUS 7
15 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Results
Empirical study
The empirical study shows that the results obtained using only one PV out of five differ greatly
from one another. The parameters estimates are also different from those obtained using all five
PVs, see Table 2. By saying that all five PVs are used, we mean that the analysis is carried out
separately with each of the five imputed PV data sets. Results are then combined at the level of
the targeted inference, as is shown in the subsection about PVs. The case of using the average of
five PVs gives the closest results to the case of using all five PVs. The parameter estimates for
the student level factors are even identical. The largest difference between the average of the PVs
and all five PVs is in the estimation of the within--school variance. In the models for all the three
countries, the within--school variances are much smaller when the average of PVs is used. When
using a single PV the parameter estimates and standard errors are smaller or larger than the ones
obtained all five PVs depending on which PV is chosen. We should also note that there is no
Simulation study
The simulation results for the 20--items test are presented in Table 3. When abilities are drawn
by setting the mean at 0 and the between-- and within--variances at 1, the PVs give estimates
closest to the generating values, especially with regard to variance. Increasing the number of
students to 8,000 by increasing the number of schools twice yields better variance estimation but
more biased means. If instead of increasing the number of schools, we increase the size of the
schools twice, the mean and variance estimations become more biased. It is also of interest to
16 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
note that if we compare a different number of the PVs used, there is no clear evidence that a
larger number of PVs would give results that are more accurate.
examine a test with the average ability of the population higher than the average item difficulty,
estimates become more biased, especially the variances. As in the previous case, the increase in
the number of students slightly reduced the variance estimation bias, but increased the bias for
the mean.
The increase in test length to 40 items did reduce the bias in the estimates when 8,000
students were used but estimation became less precise for a test with 4,000 students and abilities
with a mean of 0. The PVs again recover the variances better than the point estimators do.
Results for a test with 40 items have been omitted, as they are similar to those in the 20--items
test. The analysis of the real and simulated data showed that using a single PV is not the best
choice for the estimation of the population parameters. Estimates vary from one PV to another.
The obvious question is therefore what number of PVs should be chosen to get reliable results.
The case when students get a test containing 50% of all possible items is presented in
Table 4. Generating values for the means of the four conditioning variables, the within-- and the
between--variances were set to be equal to 1. As Table 4 shows, in all presented cases with PVs
we have very similar estimates with slightly larger differences in the within--school variance.
The standard errors of the estimates become relatively stable starting with ten PVs. In this case,
ten PVs therefore appear to be sufficient to obtain reliable estimates. If we compare the PVs with
other estimators, we can see that the MLE and EAP estimates of the student-- and school level
17 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
factors are very close to those of the PVs. However, the EAP underestimates the within-- and
overestimates the within--school variance. The MLE overestimates the within-- and between--
school variances. The within--school variance closest to the true value was obtained by WLE
In the case when students get a test containing 25% of all possible items, then more than
20 PVs should be drawn (see Table 5). The decrease in number of provided items did not change
the relations between the PVs, MLE, WLE and EAP estimators, however the estimation of
The interest in international large--scale assessment databases is increasing rapidly around the
world. Most of such databases include PVs in order to perform appropriate calculations and
thereby obtain valid conclusions. Because of the complexity of modeling or software limitations,
some researchers tend to use only the first or a randomly selected PV, which leads to biased
results. Analysis of the real data showed that biased results are obtained if PVs are used
inappropriately in the analysis. When using only one or a few PVs, parameter estimates vary
greatly and the quality of estimation very much depends on which PV is chosen. Further, our
study showed that the estimation results in MLM using the average of PVs are very close to
those obtained using all five PVs when analyzing TIMSS 2011 data, but as expected the standard
errors and the within--school variance differ. Our results are in line with those previously
obtained by Carstens & Hastedt (2010), who examined strategies for mean and variances,
18 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
From the simulation results, it was evident that PV--based estimates had a better recovery
of the population parameters than any of the point estimators, although in general the differences
between all estimates were quite small. The MLM analysis showed that in the cases when tests
contained 25% and 50% of all possible items only small differences were observed for the
different numbers of PVs. It appears that using about ten and twenty PVs in respective cases is
sufficient to get reliable estimates. Note that we used a 40--items test as a full--length test and
removing 50% or 75% of the items perhaps mean that we may have removed too much
information, which causes this behavior. On the other side, in TIMSS 2007 every student only
receives 11--17% of all possible items, which indicates that the examined cases are realistic.
Also note, as previously mentioned that the secondary biases, which are caused by under-
specification of the imputation model (in this case, the PV generation model), tend to decline
In a nutshell, researchers who analyze large--scale assessments should not use only one
PV or the average of PVs to avoid an underestimation of the standard errors. This study has
shown that this is not only true when they are interested in means and variances but also when
using MLM. This is important to emphasize in the light of the increasing popularity of using
MLM with large--scale assessments data. From the simulation study, we could also conclude that
it is possible for us to increase the precision of the estimates in some cases if more than five PVs
are used. It is not, however, clear exactly which all these cases are. In the future, one should
elaborate more under which conditions we need more PVs and exactly how many PVs are
19 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
actually needed under these conditions to obtain reliable parameter estimates. It would also be of
great interest to perform a similar simulation study with PVs generated using a multilevel latent
20 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
References
Beaton, A. E., & Gonzalez, E. (1995). NAEP primer. Chestnut Hill, MA: Boston College:
Boston.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
Bodner, T. E. (2008). What improves with increased missing data imputations? Structural
Carstens, R., & Hastedt, D. (2010). The effect of not using plausible values when they should be:
An illustration using TIMSS 2007 grade 8 mathematics data. Proceedings of the IRC–
Papers/ IRC2010_Carstens_Hastedt.pdf
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).
Bayesian data analysis (3rd ed.). Boca Raton: Chapman and Hall/CRC.
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical
Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really
8, 206–213.
Johnson, M. S., & Jenkins. F. (2005). A Bayesian hierarchical model for large-scale educational
21 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
IEA. (2011). TIMSS 2011 international database and user guide. Retrieved from
http://timssandpirls.bc.edu/timss2011/international–database.html
Kolen, M. J. (2006). Item Response Theory. In R. L., Brennan (Ed.), Educational Measurement
Kuhfeld, M. R. (2016). Multilevel Item Factor Analysis and Student Perceptions of Teacher
item/076175k5
Li, T. (2012). Randomization-based inference about latent variables from complex samples: The
Li_umd_0117E_12668.pdf?sequence=1&isAllowed=y
Lord, F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their
Martin, M. O., & Mullis, I. V. S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS
2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Mislevy, R. J. (2013). On the proportion of missing data in classical test theory. Research
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population
22 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Monseur, C., & Adamas, R. (2009). Plausible values: How to deal with their limitations. Journal
Muthén, L. K., & Muthén, B. O. (2012). MPLUS (Version 7) [computer software]. Los Angeles,
CA.
Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O'Sullivan, C. Y., & Preuschoff, C. (2011).
TIMSS 2011 Assessment Frameworks. Chestnut Hill, MA: TIMSS & PIRLS International
Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and
R Development Core Team (2014) [Computer software]. R: A language and environment for
from http://www.R–project.org/.
Rijmen, F., Jeon, M., von Davier, M., & Rabe-Hesketh, S. (2014). A General Psychometric
Approach for Educational Survey Assessments: Flexible Statistical Models and Efficient
Rubin, D. B. (1987). Multiple imputations for non–response in surveys. New York: Wiley.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chopman & Hall.
23 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Schafer, J. L., & Graham, J.W. (2002). Missing data: Our view of the state of the art.
Si, Y., & Reiter, J. P. (2013). Nonparametric Bayesian Multiple Imputation for Incomplete
Tabachnik, B.G., & Fidell, L.S (2007). Using Multivariate Statistics, (5th ed.). Pearson
Education, Inc.
Von Davier, M., Gonzalez, E., & Mislevy, R.J. (2009). What are plausible values and why are
Documents/ IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf
Warm T.A. (1989) Weighted Likelihood Estimation of Ability in Item Response Theory.
Wiberg, M., Rolfsman, E., & Laukaityte, I. (2013). School effectiveness in mathematics in
Sweden and Norway 2003, 2007 and 2011. Proceedings of the IRC–2013. Retrieved from
http://www.iea.nl/fileadmin/user_upload/IRC/IRC_2013/Papers/IRC–2013_
Wiberg_etal.pdf
Wu, M. (2005). The role of plausible values in large–scale surveys. Studies in Educational
24 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ConQuest: Generalised item
response modeling software. Version 2.0 [Computer program & Manual], Camberwell:
Yang, J. S., & Seltzer, M. (2016). Handling measurement error in predictors with a multilevel
Inc.
Yen, W. M., & Fitzpatrick, A. R. (2006). Item Response Theory. In R. L., Brennan (Ed.),
25 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Table 1. Student level and school level characteristics of investigated factors from simulated and
real data.
Real data
Simulated data
Sweden Slovenia USA
Student factors
School factors
Mathematics
484 (1.9) 505 (2.2) 509 (2.6)
achievement
Note: Factors are described by the mean and standard deviation within parentheses for the first
three factors and the proportion for the SLOC (% of urban schools). Simulated factors are
26 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Table 2. Estimates (standard errors in parentheses) with the full MLMs using different PV
Average
Parameter 5 PVs 1st PV 2nd PV 3rd PV 4th PV 5th PV
of 5 PVs
Sweden
Fixed effects
27 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Variance
Slovenia
Fixed effects
28 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Variance
USA
Fixed effects
29 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Variance
30 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Table 3. Estimated population mean and variance of proficiency (standard errors in parentheses)
for a 20--item test with difficulties from U(--2:2), the between-- and within--school variances set
Estima
tes
averag Generat
WL 5 7 10 20 40 100
ed over MLE EAP 1 PV ing
E PVs PVs PVs PVs PVs PVs
100 value
replicat
ions
Mean 2.084 1.928 2.037 2.054 2.029 2.029 2.030 2.030 2.030 2.029
4 3 2 2 5 3 2 3 2 9
2.02
(0.028 (0.026 (0.024 (0.031 (0.029 (0.029 (0.029 (0.028 (0.029 (0.028
9) 4) 7) 4) 1) 1) 1) 8) 0) 8)
Variance 3.340 2.796 2.445 2.928 2.857 2.857 2.861 2.859 2.858 2.858
5 2 0 3 2 9 5 1 6 8
(0.074 (0.062 (0.054 (0.076 (0.077 (0.077 (0.077 (0.075 (0.075 ~2.90
7) 5) 7) 7) 6) 1) 3) 2) 5) (0.075
4)
Mean 4.153 3.699 5.054 5.048 5.0330 5.0329 5.0328 5.0329 5.0329 5.0329
8 8 6 4
~6.0
(0.007 (0.006 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010
1) 3) 2) 1) 0) 0) 0) 0) 0) 0)
Variance 0.200 0.158 0.419 0.400 0.3974 0.3984 0.3984 0.3984 0.3984 0.3984
~2.90
8 0 1 2
31 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
(0.004 (0.003 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009
5) 5) 4) 1) 0) 0) 0) 0) 0) 0)
Mean 3.33 3.03 3.37 3.36 3.359 3.35 3.359 3.35 3.35 3.35
60 94 02 34 3 96 1 96 97 95
(0.018 (0.016 (0.017 (0.021 (0.021 (0.020 (0.020 (0.020 (0.020 2.02
5) 6) 3) 7) 0) 9) (0.020 7) 6) 5)
7)
Variance 2.742 2.223 2.403 2.885 2.866 2.867 2.867 2.867 2.868 2.867
3 0 9 0 5 7 3 3 3 6
~2.90
(0.043 (0.035 (0.038 (0.055 (0.053 (0.052 (0.052 (0.051 (0.051 (0.051
4) 2) 0) 0) 3) 7) 4) 8) 7) 7)
Mean 4.292 3.858 4.901 4.910 4.916 4.916 4.916 4.916 4.916 4.916
4 5 1 7 8 7 6 7 8 7
~6.0
(0.006 (0.005 (0.004 (0.006 (0.006 (0.006 (0.006 (0.006 (0.006 (0.006
1) 4) 1) 1) 3) 3) 3) 3) 3) 3)
Variance 0.293 0.236 0.133 0.287 0.289 0.289 0.289 0.289 0.289 0.289
5 2 2 8 9 9 9 9 9 9
~2.90
(0.004 (0.003 (0.002 (0.005 (0.005 (0.005 (0.005 (0.005 (0.005 (0.005
6) 7) 1) 1) 1) 1) 0) 0) 0) 0)
32 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Table 4. Estimates (standard errors in parentheses) obtained from the full MLM model using a
40--item test containing 50% of all possible items; averaged over 30 simulations.
Interc 0.30 0.29 0.19 0.19 0.19 0.19 0.19 0.20 0.19 0.19 0.19 0.19 0.20
ept 3 5 5 9 7 5 2 2 8 8 8 8 0
(0.2 (0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
20) 03) 76) 78) 79) 78) 80) 80) 79) 79) 79) 79) 79)
Fixed effects
ATM 1.02 0.93 1.03 1.03 1.03 1.03 1.04 1.03 1.03 1.03 1.03 1.03 1.03
8 9 8 7 6 5 2 5 6 9 7 7 7
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
50) 46) 39) 50) 53) 50) 51) 53) 52) 52) 52) 52) 52)
SES 0.93 0.84 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
8 7 2 1 0 3 0 1 0 1 1 3 1
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
49) 45) 38) 51) 53) 56) 56) 55) 54) 53) 54) 53) 54)
GA 0.56 0.51 0.60 0.59 0.60 0.60 0.60 0.59 0.59 0.59 0.59 0.59 0.59
5 9 0 5 2 1 0 4 9 8 8 9 7
(0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2
56) 34) 06) 07) 07) 08) 08) 08) 08) 08) 08) 08) 08)
SLOC 1.06 0.96 1.12 1.12 1.13 1.13 1.12 1.13 1.13 1.13 1.13 1.13 1.13
7 1 8 1 4 3 8 3 1 2 2 2 2
(0.2 (0.2 (0.1 (0.2 (0.2 (0.1 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2
43) 21) 98) 00) 00) 99) 00) 00) 00) 00) 00) 00) 00)
Varia
nce
33 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Withi 1.43 1.18 0.88 1.32 1.35 1.35 1.35 1.35 1.35 1.34 1.35 1.34 1.35
n 9 3 8 9 7 2 0 3 4 9 3 9 5
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
37) 32) 31) 37) 47) 43) 41) 41) 42) 42) 42) 42) 41)
Betwe 1.02 0.85 0.66 0.65 0.65 0.65 0.66 0.66 0.66 0.65 0.66 0.65 0.66
en 9 2 2 8 6 8 2 2 1 9 1 9 1
(0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
54) 28) 01) 02) 02) 03) 03) 03) 03) 03) 02) 02) 03)
Note: The true values for ATM, SES, GA, and SCLOC, the within-- and between--variances
34 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Table 5. Estimates (standard errors in parentheses) obtained from the full MLM model using a
40--item test containing 25% of all possible items; averaged over 30 simulations.
Interc 0.32 0.29 0.17 0.17 0.18 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17
ept 6 8 6 9 1 8 9 3 6 6 5 3 6
(0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
07) 88) 46) 51) 54) 53) 53) 52) 52) 52) 52) 52) 52)
Fixed effects
ATM 0.96 0.86 1.03 1.02 1.02 1.02 1.02 1.03 1.02 1.03 1.03 1.03 1.02
0 6 0 8 4 5 9 1 8 0 0 0 9
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
53) 47) 36) 54) 58) 54) 56) 56) 54) 55) 55) 55) 55)
SES 0.83 0.74 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
1 8 4 5 9 2 4 2 3 2 4 4 5
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
50) 45) 35) 58) 61) 58) 59) 58) 57) 57) 57) 57) 57)
GA 0.50 0.45 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56
0 0 5 4 7 5 1 7 4 5 5 6 4
(0.2 (0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
33) 10) 69) 72) 73) 74) 73) 73) 73) 73) 73) 73) 73)
SLOC 0.99 0.89 1.15 1.16 1.15 1.16 1.16 1.16 1.16 1.16 1.16 1.16 1.16
1 2 4 0 7 3 0 1 2 0 2 1 2
(0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
19) 97) 62) 66) 67) 67) 67) 67) 67) 66) 66) 66) 66)
Varia
nce
35 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Withi 1.47 1.20 0.74 1.42 1.42 1.42 1.41 1.41 1.42 1.42 1.42 1.42 1.42
n 8 7 9 7 1 0 8 8 0 3 1 1 1
(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
55) 44) 31) 45) 49) 47) 46) 47) 47) 46) 46) 47) 46)
Betwe 0.84 0.68 0.44 0.43 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44
en 0 2 3 8 2 1 2 1 1 2 1 2 2
(0.1 (0.1 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
23) 00) 68) 70) 72) 72) 71) 71) 71) 71) 71) 71) 71)
Note: The true values for ATM, SES, GA, and SCLOC, the within-- sand between--variances
36 ACCEPTED MANUSCRIPT