Using Plasauble Values For Missing Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Communications in Statistics - Theory and Methods

ISSN: 0361-0926 (Print) 1532-415X (Online) Journal homepage: http://www.tandfonline.com/loi/lsta20

Using plausible values in secondary analysis in


large–scale assessments

Inga Laukaityte & Marie Wiberg

To cite this article: Inga Laukaityte & Marie Wiberg (2016): Using plausible values in secondary
analysis in large–scale assessments, Communications in Statistics - Theory and Methods

To link to this article: http://dx.doi.org/10.1080/03610926.2016.1267764

Accepted author version posted online: 08


Dec 2016.

Submit your article to this journal

Article views: 20

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=lsta20

Download by: [University of Newcastle, Australia] Date: 12 April 2017, At: 07:56
ACCEPTED MANUSCRIPT

USING PLAUSIBLE VALUES IN SECONDARY ANALYSIS

Using plausible values in secondary analysis in large--scale assessments

Inga Laukaityte, Marie Wiberg

Department of Statistics, USBE, Umeå University, Umeå, Sweden

Corresponding author: Inga Laukaityte, Umeå School of Business and Economics, Department

of Statistics, Umeå University, Sweden. E-mail: [email protected]

Plausible values are typically used in large--scale assessment studies, in particular in the

Trends in International Mathematics and Science Study and the Programme for

International Student Assessment. Despite its large spread there are still some questions

regarding the use of plausible values and how such use affects statistical analyses. The

aim of this paper is to demonstrate the role of plausible values in large--scale assessment

surveys when multilevel modelling is used. Different user strategies concerning plausible

values for multilevel models as well as means and variances are examined. The results

show that some commonly used user strategies give incorrect results while others give

reasonable estimates but incorrect standard errors. These findings are important for

anyone wishing to make secondary analyses of large--scale assessment data, especially

those interested in using multilevel models to analyze the data.

Keywords: Achievement, design study, multilevel modelling, simulation studies, testing.

1 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Introduction

Large--scale assessment surveys contain large numbers of items as well as limited time and

numbers of students. Due to time limitations, students receive a subset (block) of all assessment

items. For this reason, the measurement of individual proficiency is achieved with a

measurement error (von Davier, Gonzalez & Mislevy, 2009). In order to reflect the uncertainty

of the measurement, several scores or imputations, called plausible values (PVs), are presented

for each individual. PVs have been successfully used to improve inference about latent variables

in large-scale assessments, and research is ongoing to improve the practice. There are two issues

to consider: the number of imputations to use, and the consequence of misspecification of

imputation models. This paper addresses an issue at the intersection of these issues, namely the

effect of number of PVs when inferences about multilevel models (MLMs) are desired but a

MLM has not been used to construct the PVs. Although, there are now beginning to appear

publications on how one would create a MLM PV system (Mislevy, 1991; Li, 2012; Yang and

Seltzer, 2016; Kuhfeld, 2016; and Rijmen, Jeon, von Davier, and Rabe-Hesketh, 2013), the

problem addressed here is important because the methods required are more complex and not

readily accomplished by most secondary analysts, and many large-scale assessments provide

public-use data with PVs generated with a single-level rather than a multilevel model. The aim of

this paper is to analyze the role of PVs in large--scale assessment surveys when MLMs are used.

In order to reach this goal, different user strategies (different numbers) of PVs for MLMs as well

as means and variances were examined using both, simulations and real data from Trends in

International Mathematics and Science Study (TIMSS) 2011. In line with the setup of large-scale

assessments, the PVs are generated from single-level imputation models.

2 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Most large--scale assessment databases usually provide five PVs (only NAEP (National

Assessment of Educational Progress) database provides 20 PVs), although there are no

particularly strong reasons for choosing five values in the literature (Wu, 2005). Wu (2005)

showed in simulations that very often even one PV is sufficient to adequately recover the

population parameter when examining means and variances. However, it is important to note,

that if only one PV is used, then an analyst has no information about the component of

uncertainty in an estimate that is due to the latent nature of the proficiency variable. Having more

than one PV, improves the accuracy of the estimate and the accuracy of the error of it. The

general use of five PVs probably dates back to Rubin’s (1987) relative efficiency formula,

1/(1+F/M), where F is the fraction of missing data and M the number of imputations. This means

that with 50% missing information it would result in point estimates, which were 91% as

efficient as those based on an infinite number. Graham, Olchowski & Gilreath (2007) noted that

this might be too few if we are interested in other estimates and recommended the use of 20

imputations for 10--30% missing information when examining loss of power in hypothesis

testing. Bodner (2008) gave similar recommendations for the estimation of the null hypothesis

significance test p-values and confidence interval half-widths. It is important to have in mind that

there are different types of missingness and thus different ways to compute a proportion of

missing information. Missingness can be planned (by design) or unplanned. This paper deals

only with planned missing information design, and thus the missingness is understood in terms

of information missing about the latent variables given the observed responses, as it is defined by

Orchard & Woodbury (1972). Mislevy (2013) showed how to calculate the proportion of

missingness in classical test theory when responses are inherently unobserved. It is also

3 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

important to mention, that all the secondary biases entailed by under-specification of the

imputation model, decline roughly in proportion to test reliability.

The framework and use of PVs was first developed for the analyses of US NAEP data in

1983--84 (Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992; Beaton & Gonzalez,

1995). The theory of PVs is based on Rubin’s (1987) work on multiple imputations. PVs are now

used in all NAEP surveys, the TIMSS and the Programme for International Student Assessment

(PISA). The role of PVs in large--scale assessments is not univocal and the use of PVs has

recently been discussed. For instance, Wu (2005) and von Davier et al. (2009) have presented

theoretical studies to illustrate the advantages of PVs over maximum likelihood estimates for

estimating a range of population statistics. Carstens & Hastedt (2010) have also shown a

practical meaning of PVs using TIMSS 2007 data. They have mainly analyzed the effect on

estimated means, standard deviations, and standard errors when PVs are used incorrectly or other

IRT estimates are employed. Nevertheless, large--scale assessment data are not limited to

estimation of population statistics. This paper differs from Wu (2005), von Davier et al. (2009)

and Carstens & Hastedt (2010) articles as the focus is on MLMs (as described in e.g. Gelman &

Hill (2006), which neither of them covered. MLMs are becoming a common way of analyzing

complex survey data as these models take a multistage sampling design into account and enable

us to study the effect of cluster level variables on the individual outcomes. Monseur & Adams

(2009) explored different types of estimators for recovering variance components and a latent

correlation with PISA data. They showed that it is important to take the hierarchical structure of

data into account when secondary analyses are performed, and when student proficiency

estimates are generated. There is also a growing interest in the fully Bayesian models (e.g.,

4 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Gelman et al., 2013) as Bayesian models can take into account complex clustered structure

inherent in the sampling design that standard methods for multiple imputation can fail to capture.

They can incorporate multiple levels, covariates at different levels, and multivariate latent

variable models for cognitive responses (e.g., Johnson & Jenkins, 2005; Si & Reiter, 2013). All

previous mentioned studies have shown that PVs are the most efficient way of recovering

different parameter estimates such as mean, variance, correlation, etc. However, to the best of

our knowledge, there are no studies that show how many PVs are actually needed for the most

optimal and reliable parameter estimations in such large--scale assessments.

This paper is structured as follows. The next section describes proficiency estimation,

followed by a description of the data used, and the empirical and simulation studies set up. The

fourth section presents the results from the empirical and simulation studies and the last section

contains conclusions and implications.

Proficiency estimation

Due to the large number of items (to ensure reliable measurement of achievement) and time

limitations, students receive only a subset of all assessment items in studies like TIMSS, PISA,

NAEP, etc. In TIMSS 2011, all 434 mathematics and science items in grade 8 are partitioned

into a set of 14 student booklets. Student booklets are assembled from various combinations of

item blocks containing two blocks from mathematics and two blocks from science with 12--18

items per block (Mullis, Martin, Ruddock, O'Sullivan, & Preuschoff, 2011). The students’

responses from various booklets are linked together through the items because each item appears

in two booklets. Each student completes only one booklet, resulting in a large number of

5 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

responses that are missing by design. Students’ proficiencies must thus be estimated taking the

incomplete information into account. Several methods exist for drawing proficiencies; in this

paper, four methods were used.

Maximum likelihood estimator

Assume we have a test of n items, which are binary scored, and that we model items l  1, ,n

with an item response model. Let  represent the proficiency (ability) of a student and let Pl ( )

be the probability of answering an item correctly and Ql ( ) answering an item incorrectly. If x

is the vector of n binary scored item responses then the maximum likelihood estimator (MLE) of

 is the value that maximizes the following likelihood function

n
L(x|)=  Pl ( ) xl Ql ( )1 xl
l 1 . (1)

The obtained MLE is a point estimator, which provides the same MLE proficiency

estimate for every student with the same total score (Wu, 2005). One disadvantage with using

MLE is that it gives an asymptotic bias when item response models are used (Lord, 1983). Other

disadvantage, if used with a Rasch item response model is that an examinee with a perfect

number--correct score or number--correct score of 0 will have MLE proficiency estimates of 

and  , respectively (Yen & Fitzpatrick, 2006).

Weighted maximum likelihood estimator

The weighted maximum likelihood estimator (WLE) was introduced by Warm (1989) in order to

correct the bias of the MLE and is defined as

6 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

J
ˆWLE  ˆMLE  ,
2I 2

where I is the test information

P ( )  P ( )P ( ) ,
n '2 n ' ''

I J
l l l l l
, and
Pl ( )Ql ( ) Pl ( )Ql ( )

and J the first derivative of the likelihood function in Equation (1) with respect to  . The WLE

is also a point estimator and provides a proficiency estimate for each total score. It is further

asymptotically normally distributed with variance equal to the variance of MLE.

Expected a--posteriori estimator

The Bayesian expected a--posteriori estimator (EAP) (Bock & Aitken, 1981) is the mean of the

posterior distribution for each student. Let g ( ) be a chosen prior distribution of the proficiency

 and then we can define the EAP as

 n

  g( ) PX d il

ˆEAP ( x)   l 1
n
,
 g( ) P
 l 1
X il d

where PXil is the probability of student i scoring X il on item l conditional on the proficiency 

(Uebersax, 2002). Given a Rasch item response theory model, EAP also provides one estimate

for each total score as previously mentioned likelihood estimators. A disadvantage with EAP is

that in a given population the variability of the EAP estimates of proficiency is typically less than

the variability of  because of shrinkage toward the mean for the EAP estimates (Kolen, 2006).

7 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

The EAP for a student can also be seen as the expected value of the PVs for that student,

assuming the EAP is computed with a prior distribution containing the same information and

model as in the PV imputation model. If background information is not used in estimation of

EAPs, they will suffer shrinkage toward the population mean.

Plausible values

To estimate students’ proficiency TIMSS uses the multiple imputation method, i.e. the PVs

approach (Rubin, 1987; Mislevy, 1991). The PVs are generated using students’ responses to the

items as well as by conditioning on all available background data. By conditioning on all

background data, relationships between these background variables and the estimated

proficiencies are appropriately accounted for in the PVs (Mislevy et al., 1992).

Denote the responses of student i to background questions by yi , and student i’s item

responses by xi . Then 5 PVs in TIMSS for each student i are drawn from the conditional

distribution (Martin & Mullis, 2012)

P i | xi , yi , ,   P  xi | i , yi , ,  P i | yi , ,   P  xi | i  P i | yi , , 


, (2)

where P  xi | i  is any chosen item response model, P i | yi , ,  is the regression of the

background variables,  is a matrix of regression coefficients for the background variables, and

 is a common variance matrix of residuals. The result of the PVs approach is a set of drawn

ˆ m , m  1,
values, i.e. PVs which we denote by D ,M, where M  1 is the number of PVs drawn.

ˆ m , and the final estimate is obtained by averaging all M


Thus, analysis must be done for every D

estimates:

8 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

 Dˆ m
D m
.
M

The variance of D is estimated by summing two components: the within--imputation variance

ˆm , V  1
obtained by averaging estimated variances Vm of PVs D
M
V
m m
, and the between--

imputation variance, BM :

 1
Var  D  V  1   BM ,
 M

1

 Dˆ  D 
2
where BM  (Mislevy, 1991; Schafer, 1997). It is important to note, that the
M 1 m m

PVs approach would give consistent estimates of population parameters if the PVs were

generated using the imputation model that is compatible with the subsequent data analysis.

Although the design of most large-scale assessments is of hierarchical nature, the population

model underlying the generation of PVs is a single-level model that does not identify the

hierarchical structure of the data (Monseur & Adamas, 2009).

Conditioning variables

To estimate the characteristics of student populations and subpopulations (or subgroups, e.g.

gender, ethnicity, socioeconomic status), the drawing of the PVs (or EAP) must take the group

structures into account (Wu, 2005). The population model can include several conditioning

variables, which could yield a population distribution, which is a mixture of many normal

distributions. Such a distribution can be specified as

g( ) ~ N(   z1   z2  , 2 ) ,

9 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

where z1,z2 , are conditioning or background variables. By including all available background

data in the model and correctly specifying the relationships to be addressed in secondary

analysis, relationships between the background variables and the estimated proficiencies will be

appropriately accounted for in the PVs (Mislevy et al., 1992; Martin & Mullis, 2012).

Statistical analysis

Data

The publicly available data from the international large--scale assessment study TIMSS 2011,

grade 8 in Mathematics (IEA, 2011) was used. Three countries were chosen for the real data

analysis based on their average mathematics achievement. These countries were chosen since

they represent different parts of the achievement scale, i.e. below the international mathematics

average score (Sweden), close to the average score (Slovenia), and above the average score

(USA). Simulated data, which mimic the TIMSS database, were also used as described in detail

later in the paper.

A full MLM was used in the empirical and simulation study. Constructed student level

and school level factors were used. The mathematics achievement for each student was estimated

as a function of the school factors controlling for student level factors. The student level and

school level factors were of secondary interest and were thus chosen based on a previous study

(Wiberg, Rolfsman & Laukaityte, 2013). The factors used, derived from the student and school

questionnaires, are presented below. The response variable is the students’ mathematics

achievement described by different combinations of the PVs.

10 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Student level factors

[ATM] Attitude towards mathematics: This was based on students’ responses to: (1) I enjoy

learning mathematics, (2) Mathematics is boring, and (3) I like mathematics. Possible

responses were: 1 = Agree a lot, 2 = Agree a little, 3 = Disagree a little, 4 = Disagree a

lot. Note that (2) was reversed coded. The responses were averaged and classified into

three categories: 1 = Low, average was greater than or equal to 3; 2 = Medium, average

was greater than 2 and less than 3; 3 = High, average was less than or equal to 2.

[SES] Socioeconomic status: This was based on students’ responses to the following indicators:

(1) Books at home: 1 = 0--10 books, 2 = 11--25 books, 3 = 26--10 books, 4 = 101--200

books, and 5 = > 200 books. Recoded into 1 = Low (0--25 books), 2 = Medium (26--200

books), and 3 = High (if > 200 books), (2) Possession of educational home resources

(computer and study desk). Categorized as 1 = if student had none or one item and 2 = if

student had two items. The two indicators were averaged and classified into: 1 = Low,

average was equal to 1, 2 = Medium, average was less than or equal to 2, and 3 = High,

average was greater than 2.

School level factors

[GA] Good attendance: The school principals’ answers to how severe each of the student

negative behaviors, arriving late at school and absenteeism, are among eighth--grade

students at the school. Possible responses were: 1 = Not a problem, 2 = Minor problem, 3

= Moderate problem, 4 = Serious problem. The responses were summed and classified

into three categories: 1 = Low, sum was greater than 6; 2 = Medium, sum was greater

than 3 and less than or equal to 6; 3 = High, sum was less than or equal to 3.

11 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

[SLOC] School location: The school principals’ answers regarding the number of people living

in the area where the school is located. Responses were classified into two categories: 0 =

Rural areas (0--50,000) and 1 = Urban areas (>50,000).

The characteristics of the constructed factors are given in detail in Table 1. These factors

are used in the further empirical and simulation studies. In the simulation study the factors are

used as conditional variables. If conditional variables are categorical and have more than two

categories, they must be recoded into dummy variables. In order to simplify the simulation study

the factors ATM, SES and GA were simulated as binary conditioning variables. In large--scale

assessments, sampling weights are used to avoid bias in the parameter estimates, which can arise

due to unequal probabilities of selecting a school, a class and a student, or some units’ non--

response. The full MLM contained student--related factors weighted with student weights on the

student level. Student weights were calculated by multiplying student and classroom weighting

factors by their respective weighting adjustments. On the school level, school--related factors

and aggregated means of the student level measures (aATM and aSES) weighted with school

weights were used. Thus, the full MLM was defined as

Level 1 (within schools):

Yij  0 j  1 j  ( ATM )ij  2 j  (SES)ij  rij , i  1, , N ,

Level 2 (between schools):

0 j   00   01  (GA) j   02  (SLOC) j   03  (aATM ) j   04  (aSES) j  u0 j , j  1,, J


1 j  10 , 2 j   20 ,

12 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

where Yij denotes mathematics achievement for student i within school j, rij is the error term

representing a unique effect associated with student i in school j, u0 j is the error term

representing a unique effect associated with school j, and 1 j , 2 j and  0k (k  1, ,4) are the

regression parameters for level 1 and level 2 explanatory variables, respectively.

Empirical study

Real data from TIMSS 2011 were used to construct the student level and school level factors,

which were modeled with the full MLM for the three countries using MPLUS 7 (Muthén &

Muthén, 2012). The main interest was to examine how the MLM analysis was affected by using

different PV strategies. The TIMSS 2011 database provides five PVs for mathematics

achievement. The missing data range at the student level was low, ranging from a minimum of

1% for SES in Slovenia to a maximum of 5% for ATM in Sweden. For the sake of simplicity,

listwise deletion was therefore used (Tabachnik and Fidell, 2007). The full--information

maximum likelihood procedure was used to handle the missing data at school level because

excluding a school means excluding all students at that school (Schafer & Graham, 2002).

Simulation study

The purpose of the simulation study was to investigate properties of PVs in MLMs and to

compare them with other estimators such as WLE, MLE and EAP. The simulation study

consisted of two parts. In the first part, the intention was to compare the effectiveness of

recovering the population mean and variance using a different number of PVs (1, 5, 7, 10, 20, 40,

and 100), and also different proficiency estimators (WLE, MLE, EAP, and PVs). For this

purpose, a population model with four simulated binary conditioning variables (ATM, SES, GA

13 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

and SLOC) was used. The conditioning variables were simulated using a random generation

function for the Bernoulli distribution in R software (R Development Core Team, 2014). The

proportions of categories were taken from the real data from Sweden. Using the Conquest

software, the population mean and variance for each of the earlier mentioned estimators were

then compared under the following conditions:

 mean set at 0 or 2 on both (student and school) levels in the proficiency (ability)

distributions, with the between-- and within--variances set at 1,

 different test lengths (20 and 40 items),

 different number of test takers (4,000 and 8,000 students).

Item parameters were randomly generated from a uniform distribution U(--2;2). The simulations

were repeated 100 times.

In the second part of the simulation study, the full MLMs were used to compare

parameter estimates obtained by using a different number of PVs in two different cases.

 using 50% of the items. The responses of 4,000 students (800 schools with 50

students in each) were simulated for a 40--item test with difficulty distribution U (--

2;2), using a simple Rasch item response theory model. A part of the responses was

deleted following TIMSS matrix--sample design as follows. The 40 simulated

dichotomous items were randomly assigned to four blocks -- A, B, C, and D -- with

10 items per block. Every student received two blocks, i.e. 20 items out of 40. Blocks

were combined in the following way: (AB), (BC), (CD), (DA). Under this design,

14 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

every student responds to 50% of the items. Proficiencies were drawn from the

conditional distribution in Equation (2).

 using 25% of the items. In this case, the same responses simulated for the previous

case (before the elimination of some responses) were used. A part of the responses

was deleted following TIMSS matrix--sample design. The 40 simulated dichotomous

items were randomly assigned to eight blocks -- A, B, C, D, E, F, G and H -- with 5

items per block. Every student received two blocks, i.e. 10 items out of 40. Blocks

were combined in the following way: (AB), (BC), (CD), (DE), (EF), (FG), (GH),

(HA). Under this design, every student responds to 25% of the items. Proficiencies

were drawn from the conditional distribution in Equation (2).

In addition to the simulated items, four binary conditioning variables, ATM, SES, GA

and SLOC, simulated in the first part of the simulation study, were used. Simulations were

repeated 30 times, as they are very time consuming to conduct. For each of the 30 replicates an

item response model was estimated from the data using Conquest 3.0 (Wu, Adams, Wilson &

Haldane, 2007) and a different number (1, 3, 5, 7, 10, 20, 40, 50, 60, and 100) of the PVs as well

as MLE, WLE and EAP were drawn. The PVs and EAP were drawn considering the simulated

item responses and taking into account the four conditioning variables in the same way as in

large--scale assessments (using a single-level population model). The MLE and WLE estimates

depend only on the item response data, and are thus not influenced by the conditioning variables.

After proficiencies had been drawn, the full MLM for each replicate was run using MPLUS 7

(Muthén & Muthén, 2012).

15 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Results

Empirical study

The empirical study shows that the results obtained using only one PV out of five differ greatly

from one another. The parameters estimates are also different from those obtained using all five

PVs, see Table 2. By saying that all five PVs are used, we mean that the analysis is carried out

separately with each of the five imputed PV data sets. Results are then combined at the level of

the targeted inference, as is shown in the subsection about PVs. The case of using the average of

five PVs gives the closest results to the case of using all five PVs. The parameter estimates for

the student level factors are even identical. The largest difference between the average of the PVs

and all five PVs is in the estimation of the within--school variance. In the models for all the three

countries, the within--school variances are much smaller when the average of PVs is used. When

using a single PV the parameter estimates and standard errors are smaller or larger than the ones

obtained all five PVs depending on which PV is chosen. We should also note that there is no

clear difference in the PV estimation for differently performing countries.

Simulation study

The simulation results for the 20--items test are presented in Table 3. When abilities are drawn

by setting the mean at 0 and the between-- and within--variances at 1, the PVs give estimates

closest to the generating values, especially with regard to variance. Increasing the number of

students to 8,000 by increasing the number of schools twice yields better variance estimation but

more biased means. If instead of increasing the number of schools, we increase the size of the

schools twice, the mean and variance estimations become more biased. It is also of interest to

16 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

note that if we compare a different number of the PVs used, there is no clear evidence that a

larger number of PVs would give results that are more accurate.

If we increase the mean of the proficiency distribution to 2 on both levels, i.e. if we

examine a test with the average ability of the population higher than the average item difficulty,

estimates become more biased, especially the variances. As in the previous case, the increase in

the number of students slightly reduced the variance estimation bias, but increased the bias for

the mean.

The increase in test length to 40 items did reduce the bias in the estimates when 8,000

students were used but estimation became less precise for a test with 4,000 students and abilities

with a mean of 0. The PVs again recover the variances better than the point estimators do.

Results for a test with 40 items have been omitted, as they are similar to those in the 20--items

test. The analysis of the real and simulated data showed that using a single PV is not the best

choice for the estimation of the population parameters. Estimates vary from one PV to another.

The obvious question is therefore what number of PVs should be chosen to get reliable results.

The case when students get a test containing 50% of all possible items is presented in

Table 4. Generating values for the means of the four conditioning variables, the within-- and the

between--variances were set to be equal to 1. As Table 4 shows, in all presented cases with PVs

we have very similar estimates with slightly larger differences in the within--school variance.

The standard errors of the estimates become relatively stable starting with ten PVs. In this case,

ten PVs therefore appear to be sufficient to obtain reliable estimates. If we compare the PVs with

other estimators, we can see that the MLE and EAP estimates of the student-- and school level

17 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

factors are very close to those of the PVs. However, the EAP underestimates the within-- and

between--school variances. The WLE underestimates the between--school variance and

overestimates the within--school variance. The MLE overestimates the within-- and between--

school variances. The within--school variance closest to the true value was obtained by WLE

estimator, the between--school variance by MLE estimator.

In the case when students get a test containing 25% of all possible items, then more than

20 PVs should be drawn (see Table 5). The decrease in number of provided items did not change

the relations between the PVs, MLE, WLE and EAP estimators, however the estimation of

parameter estimates and especially variances became worse.

Conclusion and Implications

The interest in international large--scale assessment databases is increasing rapidly around the

world. Most of such databases include PVs in order to perform appropriate calculations and

thereby obtain valid conclusions. Because of the complexity of modeling or software limitations,

some researchers tend to use only the first or a randomly selected PV, which leads to biased

results. Analysis of the real data showed that biased results are obtained if PVs are used

inappropriately in the analysis. When using only one or a few PVs, parameter estimates vary

greatly and the quality of estimation very much depends on which PV is chosen. Further, our

study showed that the estimation results in MLM using the average of PVs are very close to

those obtained using all five PVs when analyzing TIMSS 2011 data, but as expected the standard

errors and the within--school variance differ. Our results are in line with those previously

obtained by Carstens & Hastedt (2010), who examined strategies for mean and variances,

18 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

although they did not analyze MLM.

From the simulation results, it was evident that PV--based estimates had a better recovery

of the population parameters than any of the point estimators, although in general the differences

between all estimates were quite small. The MLM analysis showed that in the cases when tests

contained 25% and 50% of all possible items only small differences were observed for the

different numbers of PVs. It appears that using about ten and twenty PVs in respective cases is

sufficient to get reliable estimates. Note that we used a 40--items test as a full--length test and

removing 50% or 75% of the items perhaps mean that we may have removed too much

information, which causes this behavior. On the other side, in TIMSS 2007 every student only

receives 11--17% of all possible items, which indicates that the examined cases are realistic.

Also note, as previously mentioned that the secondary biases, which are caused by under-

specification of the imputation model (in this case, the PV generation model), tend to decline

roughly proportionally to the test reliability.

In a nutshell, researchers who analyze large--scale assessments should not use only one

PV or the average of PVs to avoid an underestimation of the standard errors. This study has

shown that this is not only true when they are interested in means and variances but also when

using MLM. This is important to emphasize in the light of the increasing popularity of using

MLM with large--scale assessments data. From the simulation study, we could also conclude that

it is possible for us to increase the precision of the estimates in some cases if more than five PVs

are used. It is not, however, clear exactly which all these cases are. In the future, one should

elaborate more under which conditions we need more PVs and exactly how many PVs are

19 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

actually needed under these conditions to obtain reliable parameter estimates. It would also be of

great interest to perform a similar simulation study with PVs generated using a multilevel latent

variable PVs approach.

20 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

References

Beaton, A. E., & Gonzalez, E. (1995). NAEP primer. Chestnut Hill, MA: Boston College:

Boston.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:

Application of an EM algorithm. Psychometrika, 46, 443–459.

Bodner, T. E. (2008). What improves with increased missing data imputations? Structural

Equation Modeling: A Multidisciplinary Journal, 15, 651–675.

Carstens, R., & Hastedt, D. (2010). The effect of not using plausible values when they should be:

An illustration using TIMSS 2007 grade 8 mathematics data. Proceedings of the IRC–

2010. Retrieved from http://www.iea.nl/fileadmin/user_upload/IRC/IRC_2010/

Papers/ IRC2010_Carstens_Hastedt.pdf

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013).

Bayesian data analysis (3rd ed.). Boca Raton: Chapman and Hall/CRC.

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical

models. New York, NY: Cambridge University Press.

Graham, J. W., Olchowski, A. E., & Gilreath, T. D. (2007). How many imputations are really

needed? Some practical clarifications of multiple imputation theory. Prevention Science,

8, 206–213.

Johnson, M. S., & Jenkins. F. (2005). A Bayesian hierarchical model for large-scale educational

surveys: An application to the National Assessment of Educational Progress. Research

Report RR-04-38. Princeton, NJ: ETS.

21 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

IEA. (2011). TIMSS 2011 international database and user guide. Retrieved from

http://timssandpirls.bc.edu/timss2011/international–database.html

Kolen, M. J. (2006). Item Response Theory. In R. L., Brennan (Ed.), Educational Measurement

(Fourth Edition) (pp. 155–186). Westport, CT: Praeger.

Kuhfeld, M. R. (2016). Multilevel Item Factor Analysis and Student Perceptions of Teacher

Effectiveness (Doctoral dissertation). UCLA. Retrieved from http://escholarship.org/uc/

item/076175k5

Li, T. (2012). Randomization-based inference about latent variables from complex samples: The

case of two-stage sampling (Doctoral dissertation). University of Maryland, College

Park. Retrieved from http://drum.lib.umd.edu/bitstream/handle/1903/12514/

Li_umd_0117E_12668.pdf?sequence=1&isAllowed=y

Lord, F. M. (1983). Unbiased estimators of ability parameters, of their variance, and of their

parallel–forms reliability. Psychometrika, 48, 233–245.

Martin, M. O., & Mullis, I. V. S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS

2011. Chestnut Hill, MA: TIMSS & PIRLS International Study Center, Boston College.

Mislevy, R. J. (1991). Randomization–based inference about latent variables from complex

samples. Psychometrika, 56, 177–196.

Mislevy, R. J. (2013). On the proportion of missing data in classical test theory. Research

Memorandum ETS RM-13-06. Princeton, NJ: Educational Testing Service.

Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population

characteristics from sparse matrix samples of item responses. Journal of Educational

Measurement, 29, 133–161.

22 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Monseur, C., & Adamas, R. (2009). Plausible values: How to deal with their limitations. Journal

of applied measurement, 10(3), 320–334.

Muthén, L. K., & Muthén, B. O. (2012). MPLUS (Version 7) [computer software]. Los Angeles,

CA.

Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O'Sullivan, C. Y., & Preuschoff, C. (2011).

TIMSS 2011 Assessment Frameworks. Chestnut Hill, MA: TIMSS & PIRLS International

Study Center, Boston College.

Orchard, T., & Woodbury, M. A. (1972). A missing information principle: Theory and

applications. In L. M. Le Cam, J. Neyman, & E. L. Scott (Eds.), Proceedings of the 6th

Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 697–715).

Berkeley: University of California Press.

R Development Core Team (2014) [Computer software]. R: A language and environment for

statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved

from http://www.R–project.org/.

Rijmen, F., Jeon, M., von Davier, M., & Rabe-Hesketh, S. (2014). A General Psychometric

Approach for Educational Survey Assessments: Flexible Statistical Models and Efficient

Estimation Methods. In L. Rutkowski, M. von Davier, & D. Rutkowski (eds.), Handbook

of International Large-Scale Assessment: Background, Technical Issues, and Methods of

Data Analysis (pp. 583–606). Boca Raton: CRC Press.

Rubin, D. B. (1987). Multiple imputations for non–response in surveys. New York: Wiley.

Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chopman & Hall.

23 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Schafer, J. L., & Graham, J.W. (2002). Missing data: Our view of the state of the art.

Psychological Methods, 7(2), 147–177.

Si, Y., & Reiter, J. P. (2013). Nonparametric Bayesian Multiple Imputation for Incomplete

Categorical Variables in Large-Scale Assessment Surveys. Journal of Educational and

Behavioral Statistics, 38(5), 499–521. DOI: 10.3102/1076998613480394

Tabachnik, B.G., & Fidell, L.S (2007). Using Multivariate Statistics, (5th ed.). Pearson

Education, Inc.

Uebersax, J. S. (2002). Expected A Posteriori (EAP) Measures. Rasch Measurement

Transactions, 16(3), 891.

Von Davier, M., Gonzalez, E., & Mislevy, R.J. (2009). What are plausible values and why are

they useful? IERI Monograph Series. Issues and Methodologies in Large–Scale

Assessments, 2, 9–36. Retrieved from http://www.ierinstitute.org/fileadmin/

Documents/ IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf

Warm T.A. (1989) Weighted Likelihood Estimation of Ability in Item Response Theory.

Psychometrika, 54, 427–450.

Wiberg, M., Rolfsman, E., & Laukaityte, I. (2013). School effectiveness in mathematics in

Sweden and Norway 2003, 2007 and 2011. Proceedings of the IRC–2013. Retrieved from

http://www.iea.nl/fileadmin/user_upload/IRC/IRC_2013/Papers/IRC–2013_

Wiberg_etal.pdf

Wu, M. (2005). The role of plausible values in large–scale surveys. Studies in Educational

Evaluation, 31, 114–128.

24 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ConQuest: Generalised item

response modeling software. Version 2.0 [Computer program & Manual], Camberwell:

Australian Council for Educational Research.

Yang, J. S., & Seltzer, M. (2016). Handling measurement error in predictors with a multilevel

latent variable plausible values approach. In S. N. Beretvas, J. Harring, & L. Stapleton

(Eds.), Advances in Multilevel Molding for Educational Research: Addressing Practical

Issues Found in Real-World Applications. Charlotte, NC: Information Age Publishing,

Inc.

Yen, W. M., & Fitzpatrick, A. R. (2006). Item Response Theory. In R. L., Brennan (Ed.),

Educational Measurement (4th ed.) (pp. 112–153). Westport, CT: Praeger.

25 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Table 1. Student level and school level characteristics of investigated factors from simulated and

real data.

Real data
Simulated data
Sweden Slovenia USA

Student factors

ATM 78 2.15 (0.01) 1.75 (0.02) 2.28 (0.01)

SES 26 2.20 (0.01) 2.09 (0.01) 2.05 (0.01)

School factors

GA 62 2.14 (0.07) 2.57 (0.04) 2.31 (0.05)

SLOC 42 33.3 14.0 41.9

Mathematics
484 (1.9) 505 (2.2) 509 (2.6)
achievement

Note: Factors are described by the mean and standard deviation within parentheses for the first

three factors and the proportion for the SLOC (% of urban schools). Simulated factors are

described by the proportion (%) of High.

26 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Table 2. Estimates (standard errors in parentheses) with the full MLMs using different PV

strategies (real data).

Average
Parameter 5 PVs 1st PV 2nd PV 3rd PV 4th PV 5th PV
of 5 PVs

Sweden

Intercept 244.51** 228.37** 265.52** 258.27** 231.99** 240.68** 244.22**


* * * * * * *

(49.65) (45.57) (42.72) (47.03) (45.58) (50.84) (45.47)

Fixed effects

ATM 26.27*** 26.55*** 26.00*** 25.89*** 25.77** 27.16*** 26.27***


*

(1.85) (1.56) (1.63) (1.78) (1.81) (1.87) (1.64)

SES 28.66*** 28.11*** 27.99*** 27.62*** 29.02** 30.58*** 28.66***


*

(2.76) (2.20) (2.23) (2.56) (2.49) (2.68) (2.30)

aATM 14.48 14.68 9.60 15.77 16.49 15.56 14.78

(14.48) (15.34) (13.02) (13.91) (13.91) (16.11) (14.02)

aSES 72.76*** 78.48*** 68.82*** 66.05*** 76.79** 73.16*** 72.41***


*

(15.93) (13.94) (14.53) (14.58) (15.20) (16.00) (14.69)

GA 23.90*** 25.49*** 23.06*** 22.96*** 23.23** 24.45*** 24.07***


*

(5.54) (5.59) (4.45) (5.80) (5.29) (5.88) (5.24)

SLOC --1.51 --2.34 --0.36 --1.73 --1.45 --1.54 --1.49

27 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

(4.82) (4.80) (4.61) (4.75) (4.76) (4.84) (4.70)

Variance

Within 3543.362 3477.103 3495.71 3598.32 3569.83 3575.55 3134.89


schools

Between 264.106 269.534 260.96 269.49 262.65 260.77 265.39


schools

Slovenia

Intercept 363.08** 359.93** 372.86** 347.80** 360.31* 374.38** 362.71**


* * * * ** * *

(61.55) (61.25) (57.65) (60.90) (60.53) (61.57) (59.94)

Fixed effects

ATM 21.79*** 21.38*** 20.23*** 21.20*** 22.45** 23.67*** 21.78***


*

(2.32) (1.85) (1.84) (1.93) (1.66) (1.79) (1.73)

SES 34.28*** 32.68*** 35.96*** 32.74*** 34.98** 35.05*** 34.28***


*

(3.36) (2.82) (2.79) (3.09) (2.84) (3.15) (2.78)

aATM 3.12 --1.03 4.47 5.78 5.84 0.68 3.07

(10.48) (10.13) (9.86) (10.42) (9.39) (9.73) (9.70)

aSES 63.32* 68.05** 58.20* 67.95** 61.41* 60.90* 63.48**

(24.64) (24.63) (22.87) (24.41) (24.47) (24.33) (23.90)

GA 1.92 1.84 1.44 2.56 2.54 1.23 1.95

(3.78) (3.72) (3.68) (3.95) (3.58) (3.69) (3.67)

28 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

SLOC --5.77 --5.90 --5.68 --7.73 --3.55 --6.01 --5.80

(7.95) (7.87) (7.48) (7.75) (7.93) (7.88) (7.71)

Variance

Within 4059.69 3981.17 4031.81 4146.52 4035.34 4102.21 3679.95


schools

Between 390.99 410.13 362.66 421.29 377.39 390.35 392.29


schools

USA

Intercept 220.19** 199.58** 199.64** 210.00** 260.46* 231.34** 220.89**


* * * * ** * *

(40.11) (23.18) (23.48) (25.35) (35.64) (31.67) (26.70)

Fixed effects

ATM 15.80*** 14.83*** 16.91*** 15.42*** 15.44** 16.38*** 15.80***


*

(1.90) (1.69) (1.52) (1.65) (1.74) (1.75) (1.58)

SES 8.72*** 9.79*** 9.25*** 7.95*** 7.51*** 9.09*** 8.72***

(2.32) (2.14) (2.40) (2.01) (1.66) (2.08) (1.92)

aATM 1.26 4.57 5.07 2.83 --6.03 --0.15 1.10

(13.55) (12.47) (12.30) (12.58) (12.71) (12.89) (12.45)

aSES 145.95** 152.13** 152.32** 147.17** 135.76* 142.31** 145.73**


* * * * ** * *

(15.97) (13.63) (13.35) (13.80) (14.60) (14.57) (13.70)

GA --4.00 --4.26 --5.01 --1.99 --4.39 --4.33 --3.96

29 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

(7.78) (7.64) (7.53) (7.76) (7.62) (7.84) (7.64)

SLOC --6.87 --6.78 --5.99 --7.36 --8.54 --5.66 --6.84

(8.73) (8.27) (8.55) (8.82) (8.96) (8.60) (8.60)

Variance

Within 2407.17 2383.80 2433.81 2424.12 2370.40 2423.70 2005.14


schools

Between 1652.78 1579.69 1610.71 1677.85 1737.79 1658.27 1647.42


schools

*** -- p--value < 0.001

** -- p--value < 0.01

* -- p--value < 0.05.

30 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Table 3. Estimated population mean and variance of proficiency (standard errors in parentheses)

for a 20--item test with difficulties from U(--2:2), the between-- and within--school variances set

to 1, and using four conditional variables.

Estima
tes
averag Generat
WL 5 7 10 20 40 100
ed over MLE EAP 1 PV ing
E PVs PVs PVs PVs PVs PVs
100 value
replicat
ions

4,000 students, abilities with the mean set at 0 on both levels

Mean 2.084 1.928 2.037 2.054 2.029 2.029 2.030 2.030 2.030 2.029
4 3 2 2 5 3 2 3 2 9
2.02
(0.028 (0.026 (0.024 (0.031 (0.029 (0.029 (0.029 (0.028 (0.029 (0.028
9) 4) 7) 4) 1) 1) 1) 8) 0) 8)

Variance 3.340 2.796 2.445 2.928 2.857 2.857 2.861 2.859 2.858 2.858
5 2 0 3 2 9 5 1 6 8

(0.074 (0.062 (0.054 (0.076 (0.077 (0.077 (0.077 (0.075 (0.075 ~2.90
7) 5) 7) 7) 6) 1) 3) 2) 5) (0.075
4)

4,000 students, abilities with the mean set at 2 on both levels

Mean 4.153 3.699 5.054 5.048 5.0330 5.0329 5.0328 5.0329 5.0329 5.0329
8 8 6 4
~6.0
(0.007 (0.006 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010 (0.010
1) 3) 2) 1) 0) 0) 0) 0) 0) 0)

Variance 0.200 0.158 0.419 0.400 0.3974 0.3984 0.3984 0.3984 0.3984 0.3984
~2.90
8 0 1 2

31 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

(0.004 (0.003 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009 (0.009
5) 5) 4) 1) 0) 0) 0) 0) 0) 0)

8,000 students, abilities with the mean set at 0 on both levels

Mean 3.33 3.03 3.37 3.36 3.359 3.35 3.359 3.35 3.35 3.35
60 94 02 34 3 96 1 96 97 95

(0.018 (0.016 (0.017 (0.021 (0.021 (0.020 (0.020 (0.020 (0.020 2.02
5) 6) 3) 7) 0) 9) (0.020 7) 6) 5)
7)

Variance 2.742 2.223 2.403 2.885 2.866 2.867 2.867 2.867 2.868 2.867
3 0 9 0 5 7 3 3 3 6
~2.90
(0.043 (0.035 (0.038 (0.055 (0.053 (0.052 (0.052 (0.051 (0.051 (0.051
4) 2) 0) 0) 3) 7) 4) 8) 7) 7)

8,000 students, abilities with the mean set to 2 on both levels

Mean 4.292 3.858 4.901 4.910 4.916 4.916 4.916 4.916 4.916 4.916
4 5 1 7 8 7 6 7 8 7
~6.0
(0.006 (0.005 (0.004 (0.006 (0.006 (0.006 (0.006 (0.006 (0.006 (0.006
1) 4) 1) 1) 3) 3) 3) 3) 3) 3)

Variance 0.293 0.236 0.133 0.287 0.289 0.289 0.289 0.289 0.289 0.289
5 2 2 8 9 9 9 9 9 9
~2.90
(0.004 (0.003 (0.002 (0.005 (0.005 (0.005 (0.005 (0.005 (0.005 (0.005
6) 7) 1) 1) 1) 1) 0) 0) 0) 0)

32 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Table 4. Estimates (standard errors in parentheses) obtained from the full MLM model using a

40--item test containing 50% of all possible items; averaged over 30 simulations.

Param ML WL EA PV1 PV2 PV4 PV5 PV6 PV1


PV1 PV3 PV5 PV7
eter E E P 0 0 0 0 0 00

Interc 0.30 0.29 0.19 0.19 0.19 0.19 0.19 0.20 0.19 0.19 0.19 0.19 0.20
ept 3 5 5 9 7 5 2 2 8 8 8 8 0

(0.2 (0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
20) 03) 76) 78) 79) 78) 80) 80) 79) 79) 79) 79) 79)

Fixed effects

ATM 1.02 0.93 1.03 1.03 1.03 1.03 1.04 1.03 1.03 1.03 1.03 1.03 1.03
8 9 8 7 6 5 2 5 6 9 7 7 7

(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
50) 46) 39) 50) 53) 50) 51) 53) 52) 52) 52) 52) 52)

SES 0.93 0.84 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
8 7 2 1 0 3 0 1 0 1 1 3 1

(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
49) 45) 38) 51) 53) 56) 56) 55) 54) 53) 54) 53) 54)

GA 0.56 0.51 0.60 0.59 0.60 0.60 0.60 0.59 0.59 0.59 0.59 0.59 0.59
5 9 0 5 2 1 0 4 9 8 8 9 7

(0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2
56) 34) 06) 07) 07) 08) 08) 08) 08) 08) 08) 08) 08)

SLOC 1.06 0.96 1.12 1.12 1.13 1.13 1.12 1.13 1.13 1.13 1.13 1.13 1.13
7 1 8 1 4 3 8 3 1 2 2 2 2

(0.2 (0.2 (0.1 (0.2 (0.2 (0.1 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2 (0.2
43) 21) 98) 00) 00) 99) 00) 00) 00) 00) 00) 00) 00)

Varia
nce

33 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Withi 1.43 1.18 0.88 1.32 1.35 1.35 1.35 1.35 1.35 1.34 1.35 1.34 1.35
n 9 3 8 9 7 2 0 3 4 9 3 9 5

(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
37) 32) 31) 37) 47) 43) 41) 41) 42) 42) 42) 42) 41)

Betwe 1.02 0.85 0.66 0.65 0.65 0.65 0.66 0.66 0.66 0.65 0.66 0.65 0.66
en 9 2 2 8 6 8 2 2 1 9 1 9 1

(0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
54) 28) 01) 02) 02) 03) 03) 03) 03) 03) 02) 02) 03)

Note: The true values for ATM, SES, GA, and SCLOC, the within-- and between--variances

were set to be equal to 1.

34 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Table 5. Estimates (standard errors in parentheses) obtained from the full MLM model using a

40--item test containing 25% of all possible items; averaged over 30 simulations.

Param ML WL EA PV1 PV2 PV4 PV5 PV6 PV1


PV1 PV3 PV5 PV7
eter E E P 0 0 0 0 0 00

Interc 0.32 0.29 0.17 0.17 0.18 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17
ept 6 8 6 9 1 8 9 3 6 6 5 3 6

(0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
07) 88) 46) 51) 54) 53) 53) 52) 52) 52) 52) 52) 52)

Fixed effects

ATM 0.96 0.86 1.03 1.02 1.02 1.02 1.02 1.03 1.02 1.03 1.03 1.03 1.02
0 6 0 8 4 5 9 1 8 0 0 0 9

(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
53) 47) 36) 54) 58) 54) 56) 56) 54) 55) 55) 55) 55)

SES 0.83 0.74 0.98 0.98 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
1 8 4 5 9 2 4 2 3 2 4 4 5

(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
50) 45) 35) 58) 61) 58) 59) 58) 57) 57) 57) 57) 57)

GA 0.50 0.45 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56 0.56
0 0 5 4 7 5 1 7 4 5 5 6 4

(0.2 (0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
33) 10) 69) 72) 73) 74) 73) 73) 73) 73) 73) 73) 73)

SLOC 0.99 0.89 1.15 1.16 1.15 1.16 1.16 1.16 1.16 1.16 1.16 1.16 1.16
1 2 4 0 7 3 0 1 2 0 2 1 2

(0.2 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1 (0.1
19) 97) 62) 66) 67) 67) 67) 67) 67) 66) 66) 66) 66)

Varia
nce

35 ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT

Withi 1.47 1.20 0.74 1.42 1.42 1.42 1.41 1.41 1.42 1.42 1.42 1.42 1.42
n 8 7 9 7 1 0 8 8 0 3 1 1 1

(0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
55) 44) 31) 45) 49) 47) 46) 47) 47) 46) 46) 47) 46)

Betwe 0.84 0.68 0.44 0.43 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44 0.44
en 0 2 3 8 2 1 2 1 1 2 1 2 2

(0.1 (0.1 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0 (0.0
23) 00) 68) 70) 72) 72) 71) 71) 71) 71) 71) 71) 71)

Note: The true values for ATM, SES, GA, and SCLOC, the within-- sand between--variances

were set to be equal to 1.

36 ACCEPTED MANUSCRIPT

You might also like