GMM in R - Jan2021
GMM in R - Jan2021
GMM in R - Jan2021
Klaas J. Wardenaar
Department of Psychiatry,
University Medical Center Groningen
Groningen, the Netherlands
Summary
Latent Class Growth Analyses (LCGA) and Growth Mixture Modeling (GMM) analyses are used to
explain between-subject heterogeneity in growth on an outcome, by identifying latent classes with
different growth trajectories. Dedicated software packages are available to estimate these models, with
Mplus (Muthén & Muthén, 2019) being widely used1. Although this and other available commercial
software packages are of good quality, very flexible and rich in options, they can be costly and fit
poorly into the analytical workflow of researchers that increasingly depend on the open-source R-
platform. Interestingly, although plenty of R-packages to conduct mixture analyses are available, there
is little documentation on how to conduct LCGA/GMM in R. Therefore, the current paper aims to
provide applied researchers with a tutorial and coding examples for conducting LCGA and GMM in
R. Furthermore, it will be evaluated how results obtained with R and the modeling approaches (e.g.,
default settings, model configuration) of the used R-packages compare to each other and to Mplus.
Correspondence
K.J. Wardenaar PhD
University Medical Center Groningen
Department of Psychiatry (CC-72)
Hanzelplein 1
9713 GZ Groningen
Email: [email protected]
1
some major statistical software packages also offer the possibility to estimate LCGA and GMM: SAS (Proc
Traj), Stata/GLLAMM (Stata corp, TX),
1
LCGA and GMM in R
1. Introduction
Latent Class Growth Analyses (LCGA) and Growth Mixture Modeling (GMM) are widely used
statistical models in social, behavioural and medical science. They can be used to identify latent
subgroups, classes or clusters of individuals based on their common growth trajectories over time.
This tutorial focuses on illustrating the procedures to carry out the actual analyses, and thus, assumes
basic knowledge on the topics of longitudinal data analysis and linear mixed models (i.e. the
distinction between fixed and random effects). Absolute beginners could start with some introductory
texts on growth curve modeling (e.g., Curran et al., 2010) and LCGA/GMM (Jung & Wickrama, 2008;
Ram & Grimm, 2009) before starting the actual analyses.
It is useful to look briefly at what LCGA and GMM are and how they are related. Both techniques
are related and can be seen as an extended growth model. Growth models are used to quantify
temporal trends or growth patterns in longitudinal data. Often, such models describe the outcome as a
linear function of a time, but differently shaped functions are also possible (e.g., quadratic,
logarithmic, cubic). The most restrictive type of growth model focuses on estimating the average
growth, quantified by a fixed intercept and slope describing the average trend over time in the data.
The resulting fixed effect models has the advantage of being easy to interpret and estimate, but often
show suboptimal fit to the data, especially in case of substantial between-subject heterogeneity in
change over time. It is possible to address this limitation by using a less restrictive growth model that
accounts for between-subject heterogeneity by including random effects (e.g., a random intercept and,
sometimes, random slope). This means that on top of the fixed effects (intercepts and slope) normally
distributed variances around the fixed intercept slope are estimated. This means that a collection of
individual growth trajectories instead of a single mean trend are estimated, accounting for between-
subject heterogeneity in growth and allowing for explanation of between-subject variance (e.g., by
including covariates). Such random effect models2 allow for much more flexibility, explanation of both
within- and between-subject variance in the outcome, and generally show better statistical fit to real-
life data than fixed effect models. Some disadvantages of random effects models are that they are less
parsimonious (more parameters) and harder to estimate than fixed effect models.
LCGA can roughly be seen as an extension of a fixed effect growth model, whereas GMM can be
seen as an extension of a random effect growth model. Both techniques belong to the latent class or
mixture family, with latent classes being identified based specifically on growth characteristics (i.e.
growth model parameters). In this context, the terms ‘latent class’ and ‘mixture’ can be used
interchangeably and both refer to the on the concept that the total observed data consists of a finite
number of mixtures or latent (‘invisible’) classes, each of which is characterized by more
homogeneity in the parameters or of the underlying model3. The mixture approach is used in many
different statistical contexts, ranging from relatively simple (e.g., Latent Class Analysis (LCA;
Lazarsfeld, 1950; Lazarsfeld & Henry, 1968), to quite complex models (e.g., Mixed Measurement
Item Response Theory (MM-IRT; Mislevy & Verhelst, 1990). However, for all mixture models the
optimal number of classes is estimated using roughly the same approach: models with increasing
numbers of classes are estimated and the fit is compared between the subsequent models to select the
model that best describes the data. This is also the preferred way in LCGA/GMM.
This paper focuses on using the R-platform (R-core Team, 2010) for estimating LCGA and GMM.
R has several advantages over widely-used, commercially available software, such as Mplus. R-
packages are free and all code is open-source, making the software broadly accessible and transparent.
In addition, the R-platform and (many) packages are continually maintained/improved by a very active
community of researchers and programmers. A downside is that the learning-curve for R is steep and a
huge number of packages exist (of varying provenance and quality). The latter makes it hard to know
where to start and what package to use for a given task. In addition, for less experienced analysts it can
be a challenge to find out if a chosen package actually does what you want it to do. R-packages
generally tend to cater to more advanced level analysts: relatively standard procedures like
LCGA/GMM can be carried out, but are often not thoroughly documented. In fact, R-documentation
2
Confusingly, may different terms are used in the literature to refer to such models, including Multilevel Models, Hierarchical Linear
Models and Mixed Effects Models.
3
For novice users it may be very useful to know from the start that in statistical jargon, the term mixture model characterizes models used to
identify latent classes (by parametric methods), whereas the term mixed model refers to models that include both fixed and random effects.
Confusingly, GMM with random effects can in fact be seen as a type of ‘Mixture Mixed Model’.
2
LCGA and GMM in R
in general can be quite minimal or technical. Therefore, this paper aims to provide applied researchers
with information and coding examples needed to use R to estimate LCGA/GMM models. The use of
two user friendly R-packages that both offer some mixture growth modeling functionality as part of
their options will be demonstrated. In addition, as software packages often differ in terms of the
configuration/operationalization of the estimated model(s) and with regard to default settings, these
factors will be compared between the packages and their effects on the results evaluated. Finally, the
general consistency of obtained results across R-packages and Mplus will be evaluated. In the end, this
should provide applied researchers with a clear entry point when embarking on LCGA/GMM in R, as
well as coding examples for the typical model configurations.
2. Methods
2.1 The dataset
The used dataset (download here) was simulated based on a linear mixed models growth model with
random intercept and slope and a dichotomous class covariate. The dataset contains data for 100 cases,
each with 5 equally spaced repeated measures on a continuous outcome scale (y1-y5) and is simulated
based on a linear mixed model with a random intercept and random slope. The data are simulated with
a grouping variable that splits the sample into two classes of 50 subjects with differing mean growth.
The simulation code is provided in Appendix 1. Figure 1 shows the data and makes it clearly visible
that there is considerable heterogeneity in growth, both across and between classes. Note that these
data are model-driven and that real data can often contain more outliers and look more messy overall.
Of course, class-membership is never known a priori.
With regard to structure, data can have a ‘long’ format, with multiple rows per subject and the
number of rows per subjects corresponding to the number of repeated measurements, or a ‘wide’
format, with a single row per subject and a variable (column) for each repeated measurement of the
outcome. The R-packages use the long format and Mplus uses the wide format.
Figure 1: simulated data for use in the tutorial: the different colors
denote the two simulated subgroups.
3
LCGA and GMM in R
2.1 Software
There are several commercially available software packages to conduct LCGA or GMM. Probably the
most widely used packages are SAS (PROC TRAJ (Jones et al., 2001; mainly for LCGA) and Mplus
(Muthén & Muthén, 2013; for both LCGA and GMM). Especially Mplus offers a broad range of
modeling options, covering most structural equation modeling (SEM) approaches, latent variable
models and various item response theory models. Given that Mplus is so widely used for LCGA and
GMM, it will be used to provide the reference models in the current paper4.
For R, there are several packages that allow for growth modeling and/or mixture modeling. Of
these, several allow users to fit LCGA and/or GMM models, of which two packages will be illustrated
here in detail. First, the ‘lcmm’ package offers many advanced options for mixture modeling with
longitudinal data, but also provides the options needed to conduct LCGA and GMM, using the ‘hlme’
function (Proust-Lima et al., 2017; https://cran.r-project.org/web/packages/lcmm/index.html). Second,
the ‘flexmix’ package (Leisch, 2004; https://cran.r-project.org/web/packages/flexmix/index.html)
provides functionality for flexible mixture regression modeling, including LCGA and GMM models.
The used versions for this tutorial are Mplus 5.0, R v3.6.2, Rstudio v1.2.5033, ‘lcmm’ v1.8.1, and
‘flexmix’ v2.3-15.
4it is important to point out that there are other commercial packages also offer possibilities to estimate LCGA, GMM and
more. These include Latent Gold (Vermunt & Magidson, 2016) and Stata (GLLAMM; Rabe-Hesketh et al., 2004).
4
LCGA and GMM in R
Table 1: model fit statistics for latent class growth models (LCGA) estimated with different
software packages using the same simulated dataset.
Software Number of Free LL AIC BIC Class-sizes
classes parameters
Table 2: model fit statistics for growth mixture models (GMM), estimated with different software
packages using the same simulated dataset.
Software Number Free Log- AIC BIC Class-sizes
of classes parameters likelihood
lcmm Random 1-class 4 -1042.0 2091.9 2102.4 100
intercept 2-class 8 -855.3 1726.7 1747.5 50, 50
3-class 12 -838.7 1701.4 1732.7 35, 45, 20
Random 1-class 6 -857.5 1727.1 1742.7 100
5
LCGA and GMM in R
# open package
library(lcmm)
# run models with 1-4 classes, each with 100 random starts,
# using the 1-class model to set initial start values:
lcga1 <- hlme(y ~ time, subject = "ID", ng = 1, data = mydata)
lcga2 <- gridsearch(rep = 100, maxiter = 10, minit = lcga1,
hlme(y ~ time, subject = "ID",
ng = 2, data = mydata, mixture = ~ time))
lcga3 <- gridsearch(rep = 100, maxiter = 10, minit = lcga1,
hlme(y ~ time, subject = "ID",
ng = 3, data = mydata, mixture = ~ time))
We start by opening the package and setting the seed to be used by the random number generator,
which will allow us to exactly reproduce the results of the procedures with multiple random starts if
we rerun the analyses. The hlme() function starts with a formula statement (‘y~time’) to define the
growth model. The subject=”ID” statement identifies the unit (subject) within which the repeated
measures are nested. The function hlme allows for specification of random effects (slope and
intercept). In the code above, these are omitted, leading to estimation of LCGA models with a fixed
intercept and slope per class. Here, hlme is used as a sub-function in the gridsearch() function.
The latter is used to run each hlme model 100 times (rep=100) using different start values to avoid
local maxima. Here, the start values are based on the 1-class model (minit=lcga1). All of the user-
specified settings can of course be adjusted to fit one’s aims.
Running the code in the simulated data yields individual model objects (lcga1-lcga3). The
summarytable() function can be used to tabulate all models for comparison on degrees of
freedom (npm), indices of fit (AIC and BIC) and class sizes:
To inspect a specific model more closely, we can look at its model summary, using
summary(lcga2), which provides the following, more extensive, information:
Goodness-of-fit statistics:
maximum log-likelihood: -875.46
AIC: 1762.91
BIC: 1778.54
6
LCGA and GMM in R
coef Se
Residual standard error: 1.21375 0.03844
In addition to the estimation details and goodness-of-fit information, this output also shows the class-
specific intercepts and slopes (i.e. ‘time class1’ and ‘time class2’) as well as the residual
standard deviation, providing insight into the overall unexplained variance (1.213752≈1.39).
set.seed(2002)
gmm1 <- hlme(y ~ time, subject = "ID", random=~1, ng = 1, data =
mydata)
gmm2 <- gridsearch(rep = 100, maxiter = 10, minit = gmm1,
hlme(y ~ time, subject = "ID", random=~1,
ng = 2, data = mydata, mixture = ~ time,
nwg=T))
gmm3 <- gridsearch(rep = 100, maxiter = 10, minit = gmm1,
hlme(y ~ time, subject = "ID", random=~1,
ng = 3, data = mydata, mixture = ~ time,
nwg=T))
The main difference with the LCGA code lies in the addition of the ‘random=~1’ part to the model
definition. This indicates that we want to include a random intercept per class: i.e. a normal
distribution of individual intercepts around a fixed mean level. The ‘nwg=T’ statement indicates
that we want the variances of the random intercepts to be estimated separately for each class. Again,
the 1-class model is used for initial start values and each model is estimated 100 times using the
‘gridsearch()’ function. The summary of GMM-1 results looks like this:
Here, we can see that compared to the LCGA models, additional free parameters are estimated per
model (one additional parameter per class). These are the class-specific random intercept variances.
We can also see that the BIC values are considerably lower than for the LCGAs, indicating that the
GMMs describe the data considerably better. Next, a specific model can be inspected more closely, for
instance, by typing summary(gmm2).
7
LCGA and GMM in R
Goodness-of-fit statistics:
maximum log-likelihood: -855.34
AIC: 1726.68
BIC: 1747.52
coef Se
Proportional coefficient class1 0.88937 0.22389
Residual standard error: 1.06407 0.03767
The output is very similar to the LCGA output, with the only difference being the added estimation of
a class-specific random intercept variance, displayed in the ‘Variance-covariance matrix
of the random-effects’. In lcmm, the random effect variance is estimated for the highest
numbered class (here: class 2) and the other class-specific variances are derived by multiplying the
estimated (co)variance matrix by a proportional coefficient, which is provided at the bottom of the
output (i.e. ‘Proportional coefficient class1’. In the example output, we obtain an
intercept variance of 0.88 for class 2 and an intercept variance of (0.3874*0.89)=0.34 for class 1.
set.seed(2002)
gmm1_2 <- hlme(y ~ time, subject = "ID", random=~1 + time, ng = 1,
data =mydata)
gmm2_2 <- gridsearch(rep = 100, maxiter = 10, minit = gmm1_2,
hlme(y ~ time, subject = "ID", random=~1 + time,
ng = 2, data = mydata, mixture = ~ time, nwg=T))
gmm3_2 <- gridsearch(rep = 100, maxiter = 10, minit = gmm1_2,
hlme(y ~ time, subject = "ID", random=~1+time,
ng = 3, data = mydata, mixture = ~ time,
nwg=T))
Here, the term ‘random=~1+time’ indicates that we want to estimate a random intercept and slope
(‘random=~time’ would accomplish the same; the intercept is included by default). After running
the model with our simulated data, the summarized results look like this:
8
LCGA and GMM in R
We can see that each model has two more freely estimated parameters than the GMM with random
intercept because a random slope and the random intercept-slope covariance are estimated in each
class in this model configuration. Again, it can be seen that the model fit of these models is better than
that of the less complex versions (LCGA and GMM-1). In addition, we can see that the 2-class model
– unsurprisingly given the used simulation model - best describes the data, as indicated by the lowest
AIC and BIC values (also see Table 2). In addition, the class-numbers remain the same compared to
LCGA and GMM-1. The model output for this 2-class model can be further inspected
(‘summary(gmm2_2)’) and looks as follows:
Goodness-of-fit statistics:
maximum log-likelihood: -828.77
AIC: 1677.54
BIC: 1703.6
coef Se
Proportional coefficient class1 1.11805 0.22247
Residual standard error: 0.92713 0.03775
9
LCGA and GMM in R
# open package:
library(flexmix)
The stepFlexmix() function starts by defining the model formula. Here, a default .~. is entered,
which will is updated with the formula provided in the model statement. The added element ‘|ID’
indicates that measures are nested within subjects. The k=1:3 statement indicates that we want to
estimate models with 1 to 3 classes, and ‘rep=50’ indicates that we want to run each model 50
times (to prevent local maxima). The model statement is used to specify the growth model. Here, we
make use of the function FLXMRglmfix(), which allows users to specify general linear models for
mixture analyses, while fixing certain parameters to be equal across classes. Here, we specify that we
want the residual term to be equal across classes by varFix=TRUE This leads to estimation of a
LCGA with a single residual term that reflects the overall unexplained variance in the data and does
not vary across classes. This is analogous to the default in Mplus and lcmm. If no such constraints are
required, FLXMRglm() can be used instead of FLXMRglmfix(). The final statement, control=
list(iter.max = 500, minprior = 0) is used to (re)set relevant model estimation
parameters.
When run, this code yields an object that contains information on all estimated models. Her, we named
this object lcgaMix, which contains the following results:
Here, we can see the fit statistics for LCGA models with 1-3 classes. Using the getModel()
function we then can take a closer look at the 2-class model. First we get the model summary
(summary(lcga2)), showing the individual class proportions, log likelihood value, degrees of
freedom and the BIC:
10
LCGA and GMM in R
Interestingly, the AIC values found with flexmix correspond to those found using other software,
whereas the BIC values are different, suggesting that the latter are calculated differently in flexmix.
Indeed, as with the class size calculations, the package also uses the total number of data points
(subjects*repeated measures) as sample size (n) in the calculation of the BIC (BIC=ln(n)k - 2ln(𝐿̂)),
whereas the other covered packages use the number of subjects5. Note that in line with the BIC
calculation, the presented class-sizes reflect the number of data points per class, rather than the number
of subjects. The number of subjects is easily calculated by dividing the size-column
(lcgaMix@models$`2`@size) by the number of repeated measurements (here 5).
Comp.1 Comp.2
coef.(Intercept) 9.9558053 10.498145
coef.time 0.2656535 1.766542
sigma 1.2149648 1.214965
Here, we see the class-specific intercepts and slopes (i.e. coef.time) and the residual term
(sigma), which is the same across classes, as requested in the code. Unsurprisingly, the overall
LCGA results are very similar to those found using the other software packages (see Tables 1 and 3).
The general structure of the code is similar to that for the LCGA, except for the used function
‘FLXMRlmm()’, which enables to use a linear mixed models function, and thus, inclusion of both
fixed and random effects. The code for the GMM includes the term ‘random=~1’ to indicate that we
want to estimate a random intercept6. The element varFix=c(Random=FALSE,
Residual=TRUE) is used to indicate if the random effect variance and residual variance should be
fixed across classes or estimated for each class separately. Here, the intercept variances are estimated
class-specific by stating Random=FALSE and the residual variance is kept constant across classes by
stating Residual=TRUE. After running this code, the model output looks like this:
5
For instance, for the 2-class LCGA model, the BIC calculation used in lcmm is: ln(100)*6 – 2(-875.46)=1778.5
and the calculation used in flexmix is: ln(500)*6 – 2(-875.46)=1788.2.
11
LCGA and GMM in R
The AIC and BIC are notably lower than those for the LCGA models, indicating that these models
generally provide a better description of the data. If we select the 2-class model for further inspection,
we get the following information:
We can see that the class sizes are the same as those found in the LCGA. The model parameter
estimates can be inspected next (‘parameters(gmm2)’):
After running the code, we can again take a look at the summary of the model-fit indices in the model
output:
iter converged k k0 logLik AIC BIC ICL
1 32 TRUE 1 1 -857.5481 1727.096 1752.384 1752.384
2 611 TRUE 2 2 -828.3619 1680.724 1731.299 1732.411
3 827 TRUE 3 3 -824.5532 1685.106 1760.969 1787.593
The models took much more iterations to converge compared to LCGA and GMM-1, but all
estimations did eventually reach convergence, as shown in the second column (‘converged’). When
comparing the AIC and BIC, we can see that the lowest values are obtained for the 2-class model,
12
LCGA and GMM in R
indicating – not surprisingly – that this model best describes the simulated data. Further inspection of
this two-class model (i.e. ‘summary(gmm2_2)’) shows us:
Here, we can see that both estimated classes are of equal size, similar to the LCGA and GMM-1
results. Finally, we can take a look at the estimated model parameters (i.e.
‘parameters(gmm2_2)’), which look like this:
Comp.1 Comp.2
coef.(Intercept) 10.49933978 9.95371893
coef.time 1.76293406 0.26677424
sigma2.Random1 0.28890857 0.09849563
sigma2.Random2 -0.07222707 -0.04435989
sigma2.Random3 -0.07222707 -0.04435989
sigma2.Random4 0.11462830 0.10657086
sigma2.Residual 0.86561728 0.86561728
The intercepts (‘coef.(Intercept)’) and slopes (‘coef.time’) as well as the residual variance
(‘sigma2.residual’) are quite easily identified and similar in magnitude as in the LCGA and
GMM-1 results. The other parameters represent the estimated class-specific elements of the intercept
and slope (co)variance matrix, which has the following structure.
Intercept Slope
Intercept sigma2.Random1 sigma2.Random2
Slope sigma2.Random3 sigma2.Random4
The first random term (‘sigma2.Random1’) represents the class-specific intercept variance. The
entries ( ‘sigma2.Random2’ and ‘sigma2.Random3’, represent the off-diagonal elements of the
matrix (i.e. the random intercept-slope covariances). These both have the same value as the matrix is
symmetric within each class. Finally, the term ‘sigma2.random4’ represents the class-specific
slope variance. The parameters are presented in a more comprehensible way in Table 3.
13
LCGA and GMM in R
Table 3: estimated coefficients for the 2-class models estimated by LCGA and GMM in
simulated data
Software package
Model Class parameter lcmm flexmix Mplus
LCGA Class1 Intercept 10.5 10.0 10.5
Slope 1.8 0.3 1.8
Class1 Intercept 10.0 10.5 10.0
Slope 0.3 1.8 0.3
GMM-1 Class1 Intercept 10.0 10.5 10.0
Intercept variance 0.3 0.4 0.3
Slope 0.3 0.3 0.3
Class1 Intercept 10.5 10.0 10.5
Intercept variance 0.4 0.3 0.4
Slope 1.8 1.8 1.8
GMM-2 Class1 Intercept 10.5 10.5 10.0
Intercept variance 0.2 0.3 0.1
Slope 1.8 1.8 0.3
Slope variance 0.1 0.1 0.1
Slope-intercept covariance -0.06 -0.07 -0.05
Class 2 Intercept 10.0 10.0 10.5
Intercept variance 0.2 0.1 0.3
Slope 0.3 0.3 1.8
Slope variance 0.1 0.1 0.1
Slope-intercept covariance -0.06 -0.04 -0.08
Looking at the tutorials and the results obtained with the different packages based on the same data,
we can see that the same general results are obtained with each of the covered packages and that the R-
packages yield results that are similar to those found in Mplus. However, each package takes a slightly
different approach to the default configuration of the models to be estimated and the calculation of the
indices of fit, which can lead to different results. Some notable differences are listed below:
- The model residuals can be estimated in different ways when doing LCGA or GMM and this
is reflected in the packages’ default settings. In lcmm, a single residual is estimated for the
complete model. The default in flexmix (FLXMRglmfix and FLXMRlmm) is to estimate
residuals for each class. In Mplus, the default is to estimate a single residual for each time
point but not for each class separately. In each package, the estimation of the residuals can be
adjusted to a user’s need.
- The estimated log likelihood values are similar across packages, but the calculation of the
indices of fit differed slightly. Lcmm and Mplus compute the BIC using the total number of
subjects for n. Flexmix uses the total number of measurements (i.e. subjects*time points) for n,
leading to comparatively higher BIC values. In either case, it is very easy to (re)calculate the
BIC with a different n if this is required.
- The class-specific random effects co-variance matrices are estimated differently by lcmm
compared to flexmix and Mplus. Lcmm estimates a single matrix for the highest numbered
class and yields proportional coefficients that can be used to derive the matrices for the other
classes. Flexmix and Mplus estimate the full matrix for each class. As a result, computation
times are generally longer for the latter packages.
14
LCGA and GMM in R
4. Final remarks
The current tutorial presented relatively straightforward LCGA and GMM examples. In many cases,
additional steps or model elements may be required, such as the fitting of non-linear growth functions
or the use of alternative link functions (e.g., count or binary outcomes). In addition, many real datasets
will contain missing values that need to be dealt with in some way. These topics were outside the
scope of the current tutorial, but will be briefly discussed below.
4.2 Covariates
It is possible to add covariates to LCGA and GMM models. Such variables can be added, adjusting for
(1) class-membership probabilities (e.g., higher level on baseline severity score makes one more likely
to be in class x) and/or (2) for variations in the intercept or slope within the class (e.g., a higher score
on a severity s ale makes one more probable to have a higher intercept). In lcmm, covariates with both
fixed or random effects in the growth model can easily be added as part of the model formula
arguments in hlme(). In addition, the classmb argument can be used to add covariates of class
membership. In flexmix, covariates can be included with fixed or random effects as well, but tere is no
dedicated argument that can be used to add covariates of class-membership.
15
LCGA and GMM in R
References
Benaglia T, Chauveau D, Hunter DR, Young D (2009). “mixtools: An R Package for Analyzing Finite
Mixture Models.” Journal of Statistical Software, 32(6), 1–29.
Collins, L. M., Schafer, JL. et al. (2001). A comparison of inclusive and restrictive strategies in
modern missing data procedures. Psychological Methods 6(4): 330-351.
Curran PJ, Obeidat K, Losardo D. (2010). Twelve Frequently Asked Questions About Growth Curve
Modeling. J Cogn Dev.11(2):121-136.
Leisch F (2004). FlexMix: A general framework for finite mixture models and latent class regression
in R. Journal of Statistical Software, 11(8), 1-18.
Jones, B. L., Nagin, D. S., & Roeder, K. (2001). A SAS procedure based on mixture models for
estimating developmental trajectories. Sociological Methods & Research, 29, 374–393
Jung T. Wickrama, KAS (2008). An Introduction to Latent Class Growth Analysis and Growth
Mixture Modeling. Social and Personality Psychology Compass 2/1: 302–317.
Komárek A, Komárková L (2014). Capabilities of R Package mixAK for Clustering Based on
Multivariate Continuous and Discrete Longitudinal Data. Journal of Statistical Software,
59(12)
Lazarsfeld PF (1950). The interpretation and computation of some latent structures. In S. A. Stouffer
et al., Measurement and prediction. Princeton: Princeton Univ. Press, Ch. 11.
Lazarsfeld PF, Henry P (1968). Latent structure analysis. New York : Houghton Mifflin.
Mislevy RJ, Verhelst N (1990). Modeling item responses when different subjects employ different
solution strategies. Psychometrika 55: 195–215
Muthén, L. K., & Muthén, B. O. (2013). Mplus: Statistical analysis with latent variables. User’s guide.
Los Angeles, CA: Muthén & Muthén
Neale MC, Hunter MD, Pritikin JN, Zahery M, Brick TR, Kirkpatrick RM, Estabrook R, Bates TC,
Maes HH, Boker SM. OpenMx 2.0: Extended Structural Equation and Statistical Modeling.
Psychometrika. 2016 Jun;81(2):535-49.
Proust-Lima C, Philipps V, Liquet B. Estimation of extended mixed models using latent classes and
latent processes: the R package lcmm. J Stat Softw 2017;78:1–56
Schafer JL, Graham, J (2002) Missing Data: Our View of the State of the Art. Psychological Methods
7: 147-177.
Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2004). GLLAMM manual. U.C. Berkeley Division of
Biostatistics Working Paper Series. Working Paper 160.
Ram, N., & Grimm, K. J. (2009). Methods and measures: Growth mixture modeling. A method for
identifying differences in longitudinal change among unobserved groups. International Journal
of Behavioral Development, 33, 565–576
van Buuren, S., Groothuis-Oudshoorn, K. (2011). MICE: Multivariate Imputation by Chained
Equations in R. Journal of Statistical Software, 45(3), 1–67.
Vermunt, JK, Magidson J. (2016). Technical guide for latent GOLD 5.1: Basic, advanced, and syntax.
Belmont, CA: Statistical Innovations.
16
LCGA and GMM in R
colnames(Rand.eff) = c("intercept","slope_time");
return(dat) }
# simulate the tutorial data
set.seed(2002)
d1 <- ml_data(n_pers=100,
n_time=5,
beta_int=0,
beta_slo_time=0.3,
beta_slo_covar=0.5,
beta_slo_interact=1.5,
mean_i=10,
var_i=0.13,
mean_s=0,
var_s=0.09,
cov_is=0,
mean_r=0,
var_r=1)
# trim the number of variables
mydata <- d1[,c(1,2,3,8)]
17
LCGA and GMM in R
LCGA
Data: file="mydata_wide.dat";
Variable:
! provide column names and indicate which
! variables to include in analyses
names are rownr ID c y1 y2 y3 y4 y5;
usevariables are y1 y2 y3 y4 y5;
%c#2%
[i-s];
GMM-1
Model:
%overall%
i s | y1@0 y2@1 y3@2 y4@3 y5@4;
s@0;
y1-y5 (1);
%c#1%
[i-s];
i;
%c#2%
[i-s];
i;
GMM-2
Model:
%overall%
i s | y1@0 y2@1 y3@2 y4@3 y5@4;
y1-y5 (1);
%c#1%
[i-s];
i-s;
i with s;
18
LCGA and GMM in R
%c#2%
[i-s];
i-s;
I with s;
19