Cheung & Rensvold 2002
Cheung & Rensvold 2002
Cheung & Rensvold 2002
Roger B. Rensvold
Department of Management
City University of Hong Kong
Social science researchers are increasingly concerned with testing for measure-
ment invariance; that is, determining if items used in survey-type instruments
mean the same things to members of different groups. Measurement invariance is
critically important when comparing groups. If measurement invariance cannot be
established, then the finding of a between-group difference cannot be unambigu-
Requests for reprints should be sent to Gordon W. Cheung, Department of Management, The Chi-
nese University of Hong Kong, Shatin, Hong Kong. E-mail: [email protected]
234 CHEUNG AND RENSVOLD
ously interpreted. One does not know if it is due to a true attitudinal difference, or
to different psychometric responses to the scale items. This is of particular concern
in cross-cultural research when the cultures speak different languages, and re-
searchers use translated versions of a survey instrument (Janssens, Brett, & Smith,
1995; Reise, Widaman, & Pugh, 1993; Riordan & Vandenberg, 1994; Steenkamp
& Baumgartner, 1998). Other examples include groups having different levels of
academic achievement (Byrne, Shavelson, & Muthén, 1989), working in different
industries (Drasgow & Kanfer, 1985), of different genders (Byrne, 1994), and in
experimental versus control groups (Pentz & Chou, 1994).
Measurement invariance is a general term that can be applied to various compo-
nents of measurement models. Little (1997) identified two types of invariance.
Category 1 invariance has to do with the psychometric properties of the measure-
ment scales, and includes configural invariance (e.g., Buss & Royce, 1975; Irvine,
1969; Suzuki & Rancer, 1994), metric invariance (Horn & McArdle, 1992; also
called weak factorial invariance by Meredith, 1993), measurement error invariance
(Mullen, 1995; Singh, 1995), and scalar invariance (Meredith, 1993; Steenkamp &
Baumgartner, 1998; Vandenberg & Lance, 2000). Category 2 invariance has to do
with between-group differences in latent means, variances, and covariances. Gen-
erally speaking, Category 1 invariance is a prerequisite for the interpretation of
Category 2 differences, whereas Category 2 differences are usually the data having
substantive research interest.
Structural equation modeling (SEM) is widely used in the social sciences. The
suitability of a single-group measurement model is usually assessed using an SEM
procedure known as confirmatory factor analysis (CFA). A model is considered
suitable if the covariance structure implied by the model is similar to the
covariance structure of the sample data, as indicated by an acceptable value of
goodness-of-fit index (GFI).
The most commonly used GFI of SEM is the χ2 statistic, defined as
χ2 = ( N - 1)Fˆmin (1)
where N is the sample size, and F$min is the minimum value of the empirical fit
function, estimated using an iterative procedure under the assumption that the data
have multivariate normal distribution. A nonsignificant value of χ2 indicates fail-
ure to reject the null hypothesis that the hypothesized covariance matrix is identi-
cal to the observed covariance matrix, which is usually accepted as evidence of ad-
equate fit. A problem arises because of the statistic’s functional dependence on N.
For large sample sizes, the χ2 statistic provides a highly sensitive statistical test, but
not a practical test, of model fit. Owing to this and other considerations, many GFIs
have been proposed as alternatives to χ2. Some in common use include the compar-
ative fit index (CFI; Bentler, 1990), Tucker–Lewis Index (TLI; Tucker & Lewis,
1973), Normed Fit Index (NNFI; Bentler & Bonett, 1980), and root mean squared
GFIs FOR MEASUREMENT INVARIANCE 235
error of approximation (RMSEA; Steiger, 1989). Due to the fact that most of the
“practical” GFIs do not have known sampling distributions, researchers have pro-
posed many criterion values indicative of satisfactory model fit; examples include
.90 or above for TLI and CFI. It is common practice to use multiple GFIs when
evaluating and reporting overall model fit.
An extension of CFA, Multigroup Confirmatory Factor Analysis (MGCFA),
tests the invariance of estimated parameters of two nested models across groups.
The degree of invariance is most frequently assessed by the Likelihood Ratio Test
(differences in χ2 between two nested models), although researchers have demon-
strated that differences in χ2 are also dependent on sample size (Brannick, 1995;
Kelloway, 1995). Reliance on the Likelihood Ratio Test is probably due to the lack
of sampling distributions for GFI differences. In contrast to the CFA test for overall
fit, there are no generally accepted criteria in MGCFA for determining if changes
in the “practical” GFIs are meaningful when measurement invariance constraints
are added. For example, there is no standard against which a researcher can com-
pare changes in CFI when measurement invariance constraints are added, in order
to determine if the constrained model fits the data less well than the less-con-
strained model (Vandenberg & Lance, 2000). Hence, the objective of this article is
to assess the effects of sampling error and model characteristics on MGCFA out-
comes; that is, differences in GFIs (∆GFIs) obtained when an unconstrained model
is compared with one having measurement invariance constraints, under the null
hypothesis of invariance. We propose critical values of ∆GFIs that are independent
from model characteristics, basing our proposals on sampling distributions of
∆GFIs obtained using simulations.
MEASUREMENT INVARIANCE
The series of tests that constitutes invariance testing through MGCFA are covered
at length elsewhere (Bollen, 1989b; Byrne et al., 1989; Cheung & Rensvold, 2000;
Drasgow & Kanfer, 1985; Jöreskog & Sörbom, 1993; Little, 1997; Steenkamp &
Baumgartner, 1998; Vandenberg & Lance, 2000). For the reader’s convenience we
briefly review them. There are eight invariance hypotheses that are frequently ex-
amined, as summarized in Table 1. The sequence of the invariance tests in Table 1
is only one of the many possible sequences that have been proposed, based on the
substantive research questions at hand. The five hypotheses at the top of the table
relate to measurement level invariance, whereas the three in the lower portion re-
late to construct level invariance (Little, 1997; Steenkamp & Baumgartner, 1998;
Vandenberg & Lance, 2000).
Hypothesis Hform postulates configural invariance; that is, participants belong-
ing to different groups conceptualize the constructs in the same way (Riordan &
Vandenberg, 1994). If configural invariance exists, then data collected from each
236
TABLE 1
Hypotheses of Measurement Invariance
Model Hypothesis Hypothesis Testa Hypothesis Name Symbolic Statementb Conceptual Meanings of Hypotheses
ple, the test statistic for HΛ is the fit difference between Model 2 (with construct-level constraints) and Model 1 (with no constraints). This is indicated in the table
as “2 – 1.”
bParenthetical superscripts indicate groups. For brevity, only the two-group case is shown, but each hypothesis generalizes to K groups. The general state-
group decompose into the same number of factors, with the same items associated
with each factor (Meredith, 1993). Configural invariance may fail when, for exam-
ple, the concepts are so abstract such that participants’ perceptions of the con-
structs depend on their cultural context (Tayeb, 1994), or when participants from
different groups use different conceptual frames of reference and attach different
meanings to constructs (Millsap & Everson, 1991; Millsap & Hartog, 1988;
Riordan & Vandenberg, 1994). Configural invariance may also fail owing to a host
of other reasons, including data collection problems, translation errors, and so
forth.
Hypothesis HΛ postulates metric invariance (i.e., that all factor loading parame-
ters are equal across groups). Samples drawn from two populations may provide
data that indicate conceptual agreement in terms of the type and number of under-
lying constructs, and the items associated with each construct. Despite this, the
strengths of the relations between specific scale items and the underlying con-
structs may differ. The data may indicate disagreement concerning how the con-
structs are manifested. Metric invariance is important as a prerequisite for mean-
ingful cross-group comparison (Bollen, 1989b).
Hypothesis Hλ posits the invariance of factor loadings associated with a particu-
lar item. Due to the fact that the metric invariance requirement is usually difficult
to satisfy, some researchers (Byrne, et al., 1989; Marsh & Hocevar, 1985) propose
relaxing it. They suggest that if the noninvariant items constitute only a small por-
tion of the model, then cross-group comparisons can still be made because the
noninvariant items will not affect the comparisons to any meaningful degree. Be-
fore settling on this course of action, however, one must first identify the
noninvariant items. Hence, if the metric invariance hypothesis HΛ is rejected, a se-
ries of Hλ hypotheses are often tested in an attempt to locate items responsible for
overall noninvariance.
Hypothesis HΛ,Θ(δ) states that residual variance is invariant across groups. Re-
sidual variance is the portion of item variance not attributable to the variance of the
associated latent variable. Therefore, testing for the equality of between-group re-
sidual variance determines if the scale items measure the latent constructs with the
same degree of measurement error. Residual invariance may fail when participants
belonging to one group, compared with those of another, are unfamiliar with a
scale and its scoring formats, and therefore respond to it inconsistently (Mullen,
1995). In addition, differences in vocabulary, idioms, grammar, syntax, and the
common experiences of different cultures may produce residual noninvariance
(Malpass, 1977).
Hypothesis HΛ,ν proposes that in addition to metric invariance (HΛ), the vectors
of item intercepts are also invariant. The item intercepts are the values of each item
corresponding to the zero value of the underlying construct. Support for HΛ,ν indi-
cates the existence of strong factorial invariance (Meredith, 1993), also referred to
as scalar equivalence (Mullen, 1995). Strong factorial invariance is a prerequisite
238 CHEUNG AND RENSVOLD
for the comparison of latent means, because it implies that the measurement scales
have the same operational definition across groups (i.e., have the same intervals
and the zero points). In the absence of strong factorial invariance, the comparison
of latent means is ambiguous, because the effects of a between-group difference in
the latent means are confounded with differences in the scale and origin of the la-
tent variable. Under these circumstances, Byrne et al. (1989) proposed to compare
latent means under partial intercept invariance, assuming that the noninvariant
item will not affect the latent means comparison at a great extent.
FIT STATISTICS
When testing measurement invariance under MGCFA, a series of models are esti-
mated, and invariance is tested by comparing the GFI statistics of particular mod-
els with a model having additional between-group constraints. For example, test-
ing construct-level metric invariance involves comparing the fit of an
unconstrained model, which places no restrictions on model parameters,1 with a
constrained model in which all factor loadings associated with a particular con-
struct are constrained to be equal across groups. If the imposition of additional
constraints results in a signignicantly lower value of the fit statistic, then the con-
straint is “wrong.” The parameters constrained to be equal across groups should
not be constrained, because they are noninvariant. The fit differences used to test
various invariance hypotheses under MGCFA are shown in Table 1.
1Except for the referent associated with each construct, which is set equal to unity across groups.
GFIs FOR MEASUREMENT INVARIANCE 239
Model fit differences are often determined using the likelihood-ratio (LR) test,
also known as the chi-square difference test (Bollen, 1989b). The chi-square dif-
ference (∆χ2) is calculated as
where χ2c and χ2uc are the values for the constrained model and the unconstrained
(or less constrained) model, respectively. Significance is evaluated with ∆df de-
grees of freedom, where
∆df = dfc – dfuc (3)
The LR test, like the usual chi-square test, is a null-hypothesis significance test
for a difference between the two groups. If there is no difference in fit, that is, if
( F$min ) c = ( F$min ) uc , then ∆χ2 = 0. If the sample sizes are large, however, even a
small difference between ( F$min ) c and ( F$min ) uc may result in a significant value of
∆χ2, indicating that the null hypothesis of no difference should be rejected even
when the difference is trivial (Brannick, 1995; Kelloway, 1995). The question then
becomes one of statistical significance versus practical significance.
When assessing overall model fit using CFA, it is common for researchers to
use other GFIs in preference to the chi-square statistic because the models rarely
seem to fit by that criterion due to its well-known dependence on sample size. On
the other hand, the same researchers tend to adopt the LR test when testing
invariance hypotheses using MGCFA (e.g., Byrne et al., 1989; Reise et al., 1993;
Steenkamp & Baumgartner, 1998). Brannick (1995) and Kelloway (1995) warned
against this double standard, and suggested that researchers should consistently
use a single standard for all GFI tests, whether applied to tests of overall model fit,
or to differences in fit between constrained and unconstrained models.
An alternative to ∆χ2 is ∆GFI, defined as
∆GFI = GFIc – GFIuc (4)
where GFIc and GFIuc are the values of some selected GFI estimated with respect
to the constrained and unconstrained model. As opposed to ∆χ2, however, there is
in general no statistically based criterion (controlling for sampling error) for deter-
mining if the invariance hypothesis ought to be rejected based on a particular value
of ∆GFI. Although many simulation studies have examined GFIs as indicators of
overall model fit for single-group data, there have been no studies that examine
various ∆GFI as indicators of measurement invariance. One exception was Little
(1997), who proposed four criteria for assessing the relative fit indexes of two
nested models: (a) the overall fit should be acceptable, (b) the difference in
Tucker–Lewis Index (TLI) should be less than or equal to 0.05, (c) indexes of local
misfit are uniformly and unsystematically distributed with respect to the con-
strained parameters, and (d) the constrained model is substantively more meaning-
240 CHEUNG AND RENSVOLD
ful and parsimonious than the unconstrained model. However, the 0.05 criterion
has neither strong theoretical nor empirical support, and it is not widely used. In
short, researchers to date have enjoyed considerable latitude in deciding whether a
particular value of ∆GFI indicates noninvariance (e.g., Drasgow & Kanfer, 1985).
We review the properties of 20 different ∆GFIs when invariance constraints are
added to two-group measurement models when the null hypotheses of invariance
are true. The GFIs can be classified into six categories, as follows:
SIMULATION
We used a Monte Carlo procedure to generate ∆GFI for the 20 GFIs described pre-
viously. The sampling distributions of the ∆GFIs were examined under the null hy-
pothesis of invariance. A total of 48 different models were generated by varying
the parameters shown in Table 2: the number of Factors F (either 2 or 3), the factor
GFIs FOR MEASUREMENT INVARIANCE 241
TABLE 2
Model Parameters for Simulation
variance V (.36 or .81), the correlations between Factors C (.3 or .5), the number of
items per Factor I (3, 4, or 5), and the factor loadings L (two patterns for each value
of I). This range of model parameters, which follows Anderson and Gerbing’s
(1984) simulation, was an attempt to represent a range of models encountered in
practice, whereas keeping the size of the simulation within manageable limits.
Two different sample sizes N per group (150 or 300) were also examined. All
model parameters were estimated with maximum likelihood (ML) method.
The simulation procedure is shown schematically in Figure 1. It was assumed
that the latent factors would not explain all the variance in the items; therefore, the
reliability of each factor was set to 0.80. Appropriate item residual variances were
calculated using Fornell and Larcker’s (1981) equation. These, together with the
model parameters F, C, V, I, and L, were used to calculate a population covariance
matrix Σ for each model with χ2 equal to zero.
Instead of running the simulation using models with perfect fit, we used models
containing approximation errors (Cudeck & Henly, 1991). A residual Matrix E
consisting of random terms was generated, and the random off-diagonal elements
were added to Σ to produce a population covariance Matrix S. This matrix was
transformed using Cudeck and Browne’s (1992) procedure into a population
covariance matrix, having a model χ2 equal to the number of degrees of freedom
with sample size equals to 300. The choice of χ2 = df in the population produced
samples with unconstrained fit statistics representative of actual cases; for exam-
ple, the mean values of CFI, TLI, and RMSEA for samples generated during this
study were .97, .96 and .057, respectively. These were less than the perfect values
(e.g., CFI = 1.0), which are seldom encountered in practice. On the other hand,
they were high enough to justify invariance testing. If a test of overall fit produced
a value of CFI less than .90, for example, it is unlikely that the model would receive
further consideration.
242 CHEUNG AND RENSVOLD
The Cudeck and Browne (1992) procedure warrants a brief digression. For a
hypothesized Model K and a specified value of the fit function Fmin, there exists a
corresponding covariance matrix Σk(Fmin) and a corresponding covariance matrix
of measurement errors, E(Fmin), such that
For any value of a scalar multiplier κ, there exists a specific degree of model fit.
If κ = 0, then Σ( Fmin ) = Σˆ k ; that is, the fit is perfect. If κ is greater than zero, then
the fit is less than perfect. In the context of a particular model, there exists a
one-to-one correspondence between the value of κ and model fit, as represented by
the value of some GFI. Cudeck and Browne (1992) demonstrated that finding κ for
any value of an GFI is straightforward when the generalized least squares function
is used to estimate the model. If the ML function is used, as in this study, then an it-
erative method for finding κ is required.
Amos 3.6 was used to generate two multivariate-normal samples of size N, rep-
resenting samples from two different groups that were completely invariant except
for the effects of sampling error. These two samples were tested for the eight types
of invariance represented by Hypotheses 1 through 8, and the relevant statistics
were appended to an output file. The portion of the process beginning with the gen-
eration of the two data sets was repeated 1,000 times for each combination of F, C,
V, I, L, and N. The same population covariance matrices were used for models with
sample sizes of 150 and 300.
SIMULATION RESULTS
The quality of the simulation was assessed by examining the distribution of the
∆χ2 statistics. As expected, these closely followed the distributions of χ2 having
∆df degrees of freedom (Steiger, Shapiro, & Browne, 1985).
TABLE 3
Factors Affecting Goodness-of-Fit Indexes (GFIs) in Testing Configural
Invariance
GFIs I F I×F N
Note. I = number of items (manifest variables); F = number of factors (latent variables); N = sam-
ple size.
aI × N (3.0%) also significantly affects NCP. bI × F × N (5.2%), I x N (8.6%), and F × N (6.8%) also
M SD 1% M SD 1% M SD 1%
Overall .9728 .0180 .9196 .9980 .0009 .9957 .9385 .0474 .8077
I*F
3, 2 .9899 .0085 .9613 .9987 .0011 .9956 .9865 .0110 .9547
3, 3 .9793 .0109 .9449 .9982 .0008 .9958 .9598 .0182 .9106
4, 2 .9824 .0094 .9549 .9982 .0009 .9956 .9681 .0164 .9229
4, 3 .9672 .0123 .9306 .9977 .0007 .9958 .9153 .0257 .8467
5, 2 .9715 .0116 .9389 .9979 .0008 .9956 .9432 .0216 .8856
5, 3 .9468 .0156 .9017 .9974 .0006 .9957 .8583 .0317 .7733
IFI RNI
M SD 1% M SD 1%
χ2 χ2/df NCP
Overall 131.95 97.09 389.08 1.77 .418 2.90 57.61 50.01 215.12
I*F
3, 2 28.21 9.79 55.75 1.76 .612 3.48 12.21 9.79 39.75
3, 3 84.73 19.18 133.60 1.77 .400 2.78 36.73 19.18 85.60
4, 2 67.14 16.59 109.70 1.77 .437 2.89 29.14 16.59 71.70
4, 3 181.00 33.17 258.21 1.77 .325 2.53 79.00 33.17 156.21
5, 2 120.32 24.69 180.40 1.77 .363 2.65 52.32 24.69 112.40
5, 3 310.29 50.74 419.13 1.78 .292 2.41 136.29 50.74 245.13
Overall .0568 .0157 .0926 219.95 121.66 521.08 225.59 124.40 528.54
I*F
3, 2 .0533 .0256 .1075 80.21 9.79 107.75 82.11 9.54 109.05
3, 3 .0567 .0148 .0883 168.73 19.18 217.60 173.20 18.27 220.51
4, 2 .0561 .0171 .0918 135.14 16.59 177.70 138.38 15.98 179.84
4, 3 .0582 .0097 .0806 289.00 33.17 366.21 296.62 31.24 371.11
5, 2 .0574 .0123 .0844 204.32 24.69 264.40 209.27 23.57 267.61
5, 3 .0589 .0073 .0768 442.29 50.74 551.13 453.96 47.37 558.59
(continued)
245
TABLE 4 (Continued)
M SD 99% M SD 1% M SD 1%
Overall .538 .330 1.451 .938 .0358 .828 .916 .0409 .792
I*F
3, 2 .198 .062 .321 .977 .0118 .936 .956 .0222 .880
3, 3 .414 .119 .630 .951 .0174 .896 .927 .0261 .844
4, 2 .332 .096 .516 .958 .0134 .921 .939 .0197 .883
4, 3 .706 .191 1.037 .925 .0223 .857 .903 .0289 .815
5, 2 .500 .139 .753 .934 .0180 .886 .912 .0238 .849
5, 3 1.077 .278 1.542 .883 .0303 .804 .858 .0366 .763
M SD 1% M SD 1% M SD 1%
Overall .9633 .0219 .9012 .6595 .0726 .5096 .6855 .0843 .5201
I*F
3, 2 .9816 .0166 .9275 .5208 .0063 .4991 .5280 .0045 .5127
3, 3 .9690 .0165 .9174 .6343 .0116 .5973 .6529 .0073 .6299
4, 2 .9741 .0140 .9136 .6503 .0091 .6248 .6666 .0064 .6480
4, 3 .9575 .0160 .9101 .7147 .0172 .6622 .7473 .0095 .7191
5, 2 .9623 .0154 .9191 .7056 .0136 .6694 .7340 .0087 .7094
5, 3 .9358 .0188 .8814 .7313 .0251 .6662 .7844 .0129 .7471
CAIC CK Critical N
Overall .5365 .3290 1.4459 .7796 .7883 3.2803 348.95 122.49 779.35
I*F
3, 2 .1978 .0616 .3195 .2158 .0722 .3488 452.96 188.99 1075.12
3, 3 .4133 .1185 .6282 .4873 .1656 .7512 347.26 96.64 601.84
4, 2 .3309 .0959 .5139 .3755 .1233 .5873 362.27 106.62 659.02
4, 3 .7042 .1896 1.0336 .9359 .3533 1.4329 310.45 75.72 482.35
5, 2 .4985 .1379 .7507 .5974 .2023 .9166 328.18 85.16 540.03
5, 3 1.0741 .2766 1.5368 2.0654 1.1121 3.3712 292.56 67.30 426.23
Note. CFI = comparative fit index; Mc = McDonald’s Noncentrality Index; IFI = Incremental Fit Index;
RNI = Relative Noncentrality Index; NCP = noncentrality parameter; RMSEA = root mean squared error of ap-
proximation; AIC = Akaike’s Information Criterion; BCC = Browne and Cudeck Criterion; ECVI = Expected
Cross-Validation Index; NFI = Normed Fit Index; RFI = Relative Fit Index; TLI = Tucker–Lewis Index; PNFI =
parsimony-adjusted NFI; PCFI = parsimonious CFI; CAIC = rescaled Akaike’s Information Criterion; CK =
cross-validation index.
246
GFIs FOR MEASUREMENT INVARIANCE 247
RMSEA for all models was .0401, with a standard deviation of 0.011 and a 99th
percentile value of .0655. These values are consistent with Browne and Cudeck’s
(1993) 0.05 criterion for RMSEA.
M SD 1% M SD 1% M SD 1%
∆IFI ∆RNI
M SD 1% M SD 1%
(continued)
248
TABLE 5 (Continued)
M SD 99% M SD 1% M SD 1%
M SD 1% M SD 1% M SD 1%
Note. CFI = comparative fit index; Mc = McDonald’s Noncentrality Index; IFI = Incremental Fit Index;
RNI = Relative Noncentrality Index; NCP = noncentrality parameter; RMSEA = root mean squared error of ap-
proximation; AIC = Akaike’s Information Criterion; BCC = Browne and Cudeck Criterion; ECVI = Expected
Cross-Validation Index; NFI = Normed Fit Index; RFI = Relative Fit Index; TLI = Tucker–Lewis Index; PNFI =
parsimony-adjusted NFI; PCFI = parsimonious CFI; CAIC = rescaled Akaike’s Information Criterion; CK =
cross-validation index. H2 = metric invariance (weak factorial invariance); H3 = partial metric invariance (par-
tial measurement invariance); H4 = metric invariance + invariance of residual variance; H5 = strong factorial
invariance (metric invariance + scalar invariance); H6 = metric invariance + invariance of construct variance;
H7 = metric invariance + invariance of construct covariance; H8 = strong factorial invariance + invariance of la-
tent means.
249
250 CHEUNG AND RENSVOLD
shown in Table 5 are the critical values for rejecting the null hypothesis of equiva-
lence, with an alpha of 0.01 and assuming multivariate normal distributions.
Although the formulas for many GFIs (e.g., CFI and TLI) involve terms that adjust
for degrees of freedom, this study shows that number of items per factor and num-
ber of factors in the model affect most of the GFIs (except for RMSEA). When
overall fit is examined, models with more items and more factors can be expected
to yield smaller values of these GFIs. This is due to the omission of small, theoreti-
cally insignificant factor loadings and correlated error terms in the model (Hall,
Snell, & Foust, 1999; Hu & Bentler, 1998). In exploratory factor analysis, these
terms are usually ignored; in CFA, and other SEM, they are hypothesized to be
zero. This assumption has a negative impact on overall fit. This should serve as a
warning to researchers who judge model fit in accordance with some generally ac-
cepted criterion (e.g., CFI = .90) while ignoring the effects of model complexity.
RMSEA was not affected by any of the model parameters examined in this study,
but its standard error was affected. Models with fewer items and factors were asso-
ciated with larger standard errors in RMSEA.
Unlike the LR test in which ∆χ2 is always greater than or equal to zero, many
difference indexes (e.g., ∆RFI and ∆TLI) can assume both positive and negative
values. This is because the underlying GFIs are functions of the number of degrees
of freedom in the model. If the null hypothesis of invariance is true, and there is no
sampling error, then decreasing the number of degrees of freedom can produce a
value of ∆GFI greater than zero. If a ∆GFI is less than zero, then its value does not
represent a change from a baseline value of zero, but rather a change from its hypo-
thetical positive value. Hence, a slight reduction in the difference indexes may in-
dicate a substantial change in the minimum value of the fit function.
Many ∆GFIs are superior to ∆χ2 as tests of invariance because they are not af-
fected by sample size. In many cases, however, ∆GFI is correlated with the GFI of
the overall model. This implies that less accurately specified models produce
larger values of difference statistics when measurement invariance constraints are
added. As shown previously, the only difference statistics not having this undesir-
able characteristic are ∆CFI, ∆Gamma hat, ∆McDonald’s NCI, ∆NCP, ∆IFI,
∆RNI, and ∆critical N.
The aforementioned results show that ∆CFI, ∆Gamma hat, and ∆McDonald’s
NCI are robust statistics for testing the between-group invariance of CFA models.
These results are unexpected because many simulation studies have demonstrated
that GFIs are affected by model complexity when evaluating overall model fit.
Although the standard errors and critical values differ for the different
invariance models, the between-model variations are so small that a general crite-
GFIs FOR MEASUREMENT INVARIANCE 251
rion for all hypotheses can be proposed. A value of ∆CFI smaller than or equal to
–0.01 indicates that the null hypothesis of invariance should not be rejected. For
∆Gamma hat and ∆McDonald’s NCI, the critical values are –.001 and –.02, re-
spectively.
CONCLUSION
ACKNOWLEDGMENTS
REFERENCES
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107,
238–246.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of
covariance structures. Psychological Bulletin, 88, 588–606.
Bollen, K. A. (1986). Sample size and Bentler and Bonett’s nonnormed fit index. Psychometrika, 51,
375–377.
Bollen, K. A. (1989a). A new incremental fiIt index for general structural equation models. Sociologi-
cal Methods and Research, 17, 303–316.
Bollen, K. A. (1989b). Structural equations with latent variables. New York: Wiley.
Boomsma, A. (1982). The robustness of LISREL against small sample sizes in factor analysis models.
In K. G. Jöreskog & H. Wold (Eds.), Systems under indirect observation: Causality, structure, pre-
diction (Part 1, pp. 149–173). Amsterdam: North-Holland.
Brannick, M. T. (1995). Critical comments on applying covariance structure modeling. Journal of Or-
ganizational Behavior, 16, 201–213.
Browne, M. W., & Cudeck, R. (1989). Single sample cross-validation indices for covariance structures.
Multivariate Behavioral Research, 24, 445–455.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fiIt. In K. A. Bollen & J. S.
Long (Eds.), Testing structural equations models (pp. 136–162). Newbury Park, CA: Sage.
Buss, A. R., & Royce, J. R. (1975). Detecting cross-cultural commonalties and differences: Intergroup
factor analysis. Psychological Bulletin, 82, 128–136.
Byrne, B. M. (1994). Testing for the factorial validity, replication, and invariance of a measurement in-
strument: A paradigmatic application based on the Maslach Burnout Inventory. Multivariate Behav-
ioral Research, 29, 289–311.
Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance
and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105,
456–466.
Cheung, G. W., & Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in
cross-cultural research using structural equation modeling. Journal of Cross-Cultural Psychology,
31, 187–212.
Cudeck, R., & Browne, M. W. (1983). Cross-validation of covariance structures. Multivariate Behav-
ioral Research, 18, 147–167.
Cudeck, R., & Browne, M. W. (1992). Constructing a covariance matrix that yields a specified mini-
mizer and a specified minimum discrepancy function value. Psychometrika, 57, 357–369.
Cudeck, R., & Henly, S. J. (1991). Model selection in covariance structures analysis and the “problem”
of sample size: A clarification. Psychological Bulletin, 109, 512–519.
Drasgow, F., & Kanfer, R. (1985). Equivalence of psychological measurement in heterogeneous popu-
lations. Journal of Applied Psychology, 70, 662–680.
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables
and measurement error. Journal of Marketing Research, 18, 39–50.
Gerbing, D. W., & Anderson, J. C. (1993). Monte Carlo evaluations of goodness-of-fit indices for struc-
tural equation models. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp.
40–65). Newbury Park, CA: Sage.
Hall, R. J., Snell, A. F., & Foust, M. S. (1999). Item parcelling strategies in SEM: Investigating the sub-
tle effects of unmodeled secondary constructs. Organizational Research Methods, 2, 233–256.
Hoelter, J. W. (1983). The analysis of covariance structures: Goodness-of-fit indices. Sociological
Methods and Research, 11, 325–344.
Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in ag-
ing research. Experimental Aging Research, 18, 117–144.
Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to
underparameterized model misspecification. Psychological Methods, 3, 424–453.
254 CHEUNG AND RENSVOLD
Irvine, S. H. (1969). Contributions of ability and attainment testing in Africa to a general theory of in-
tellect. Journal of Biosocial Science, 1, 91–102.
Jackson, P., Wall, T., Martin, R., & Davids, K. (1993). New measures of job control, cognitive demand,
and production responsibility. Journal of Applied Psychology, 78, 753–762.
James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models and data.
Beverly Hills: Sage.
Janssens, M., Brett, J. M., & Smith, F. J. (1995). Confirmatory cross-cultural research: Testing the via-
bility of a corporation-wide safety policy. Academy of Management Journal, 38, 364–382.
Jöreskog, K., & Sörbom, D. (1993). LISREL 8: User’s reference guide. Chicago: Scientific Software
International.
Kelloway, E. K. (1995). Structural equation modelling in perspective. Journal of Organizational Be-
havior, 16, 215–224.
La Du, T. J., & Tanaka, J. S. (1989). The influence of sample size, estimation method, and model speci-
fication on goodness-of-fit assessment in structural equation models. Journal of Applied Psychology,
74, 625–635.
Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical
and theoretical issues. Multivariate Behavioral Research, 32, 53–76.
Malpass, R. S. (1977). Theory and method in cross-cultural psychology. American Psychologist, 32,
1069–1079.
Marsh, H. W. (1993). The multidimensional structure of academic self-concept: Invariance over gender
and age. American Educational Research Journal, 30, 841–860.
Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor
analysis: The effect of sample size. Psychological Bulletin, 103, 391–410.
Marsh, H. W., & Hocevar, D. (1985). Application of confirmatory factor analysis to the study of
self-concept: First- and higher order factor models and their invariance across groups. Psychological
Bulletin, 97, 562–582.
McDonald, R. P. (1989). An index of goodness-of-fit based on noncentrality. Journal of Classification,
6, 97–103.
McDonald, R. P., & Marsh, H. W. (1990). Choosing a multivariate model: Noncentrality and goodness
of fit. Psychological Bulletin, 107, 247–255.
Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika,
58, 525–543.
Millsap, R. E., & Everson, H. (1991). Confirmatory measurement model comparisons using latent
means. Multivariate Behavioral Research, 26, 479–497.
Millsap, R. E., & Hartog, S. B. (1988). Alpha, beta, and gamma changes in evaluation research: A struc-
tural equation approach. Journal of Applied Psychology, 73, 574–584.
Mullen, M. (1995). Diagnosing measurement equivalence in cross-national research. Journal of Inter-
national Business Studies, 3, 573–596.
Pentz, M. A., & Chou, C. (1994). Measurement invariance in longitudinal clinical research assuming
change from development and intervention. Journal of Consulting and Clinical Psychology, 62,
450–462.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response the-
ory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566.
Riordan, C. M., & Vandenberg, R. J. (1994). A central question in cross-cultural research: Do employ-
ees of different cultures interpret work-related measures in an equivalent manner? Journal of Man-
agement, 20, 643–671.
Singh, J. (1995). Measurement issues in cross-national research. Journal of International Business
Studies, 26, 597–619.
Steenkamp, J. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross- national
consumer research. Journal of Consumer Research, 25, 78–90.
GFIs FOR MEASUREMENT INVARIANCE 255