KWW 098
KWW 098
KWW 098
10
© The Author 2016. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of DOI: 10.1093/aje/kww098
Public Health. All rights reserved. For permissions, please e-mail: [email protected] Advance Access publication:
October 21, 2016
Practice of Epidemiology
Lauren E. Griffith*, Edwin van den Heuvel, Parminder Raina, Isabel Fortier, Nazmul Sohel,
Standardization procedures are commonly used to combine phenotype data that were measured using differ-
ent instruments, but there is little information on how the choice of standardization method influences pooled es-
timates and heterogeneity. Heterogeneity is of key importance in meta-analyses of observational studies
because it affects the statistical models used and the decision of whether or not it is appropriate to calculate a
pooled estimate of effect. Using 2-stage individual participant data analyses, we compared 2 common methods
of standardization, T-scores and category-centered scores, to create combinable memory scores using cross-
sectional data from 3 Canadian population-based studies (the Canadian Study on Health and Aging (1991–
1992), the Canadian Community Health Survey on Healthy Aging (2008–2009), and the Quebec Longitudinal
Study on Nutrition and Aging (2004–2005)). A simulation was then conducted to assess the influence of varying
the following items across population-based studies: 1) effect size, 2) distribution of confounders, and 3) the
relationship between confounders and the outcome. We found that pooled estimates based on the unadjusted
category-centered scores tended to be larger than those based on the T-scores, although the differences were
negligible when adjusted scores were used, and that most individual participant data meta-analyses identified
significant heterogeneity. The results of the simulation suggested that in terms of heterogeneity, the method of
standardization played a smaller role than did different effect sizes across populations and differential confound-
ing of the outcome measure across studies. Although there was general consistency between the 2 types of
standardization methods, the simulations identified a number of sources of heterogeneity, some of which are
not the usual sources considered by researchers.
Abbreviations: CCHS, Canadian Community Health Survey; CSHA, Canadian Study on Health and Aging; IPD, individual
participant data; NuAge, Quebec Longitudinal Study on Nutrition and Aging.
To explore many important scientific questions (e.g., psychological, lifestyle, and health status data, on hundreds
understanding the influence of lifestyle, psychological, of thousands of participants. Although many of these indi-
social, nutritional, or genetic factors on disease or pheno- vidual cohorts and data sources are large, multiple data sets
typic outcomes), researchers need to link both genotype and are sometimes required when studying rare outcomes or
phenotype data. Currently, investigators for large national gene-environment interactions or when exploring the influ-
and international cohorts, such as those from the Canadian ence of geographical and cultural variations in exposure-
Longitudinal Study on Aging (CLSA) (1), UK Biobank (2), outcome relationships. To maximize the utility of publicly
and LifeLines (3) from the Netherlands, are collecting a funded projects and increase the speed of scientific discov-
wide range of information, including biological, social, ery, there has been a worldwide push to combine multiple
data sources in order to explore important research ques- The Rey Auditory Verbal Learning Test (13), a 15-item
tions (4). word-learning test, was used to measure short-term memory
The current gold standard for analyzing multiple data in the CSHA and the CCHS. The test is one of the most
sources as part of a systematic review is individual participant widely used neuropsychological tests (14) and generally has
data (IPD) meta-analysis, because it provides flexibility with good test-retest reliability (0.51 ≤ r ≤ 0.86) (15). The
regard to the types of analyses that can be done and thus Buschke Cued Recall Procedure tests memory under condi-
provides reliable results (5). It also increases the power to tions of free recall (hereafter referred to as the Free Buschke
explore differential treatment effects in randomized controlled test) and cued recall (hereafter referred to as the Total
trials and allows for adjustments of confounding factors in Buschke test). The CSHA used English and French versions
meta-analyses of observational studies; however, it is time of the 12-item Buschke memory test (16), and NuAge used
consuming and costly to conduct (6). Combining IPD is also a French version of the 16-item Free and Cued Selective
scientifically and technically very challenging. Ensuring data Reminding Test adapted from Grober and Buschke (17).
compatibility and content equivalence through harmonization Free recall and cued recall have acceptable sensitivity
allows integration of information from different studies/data- (62%–100%) and specificity (94%–100%) when comparing
Am J Epidemiol. 2016;184(10):770–778
772 Griffith et al.
memory scores across studies in order to examine whether educational level. The relationships between the confounders
or not these approaches provided similar results in terms of and physical activity level were selected consistently across
overall effect estimates and measures of heterogeneity in a cohort studies (homogeneous associations). Memory scores
2-stage IPD meta-analysis. T-scores are dependent on the full were generated with latent variables that were generated per
underlying distribution of cognitive measures in each study cohort study, indicating the true memory ability of indivi-
and have been used to create norms and compare different cog- duals. Conditionally on the latent construct, we applied a
nitive measures on a common scale (26). Category-centered binomial distribution to simulate a sum score on memory.
scores use the mean and standard deviation for a common The latent variable was affected by age, sex, educational
demographically determined group (within studies) that is pre- level, and physical activity level. We simulated homoge-
sumed to be homogeneous with respect to the cognitive neous or heterogeneous associations between the latent vari-
measures to standardize or “center” the individual cognitive able memory and the 3 potential confounders (confounder
measures. More details about the standardization methods are association with memory = homogeneous or heterogeneous),
provided in Web Appendix 1. We applied the scores to our and we simulated a homogeneous or heterogeneous associa-
case study and also separately undertook a simulation study to
Am J Epidemiol. 2016;184(10):770–778
Standardization Approaches to Harmonization 773
Table 1. Baseline Demographic and Health-Related Characteristics of Participants With Cognition Dataa, Canadian Community Health
Survey-Canadian Longitudinal Study on Aging (2008–2009), Canadian Study of Health and Aging (1991–1992), and Quebec Longitudinal
Study on Nutrition and Aging (2004–2005)
Study
Characteristic CCHS-CLSA (n = 7,107) CSHA (n = 1,730) NuAge (n = 432)
No. % Mean SD No. % Mean SD No. % Mean SD
Am J Epidemiol. 2016;184(10):770–778
774 Griffith et al.
Table 1. Continued
Study
Characteristic CCHS-CLSA (n = 7,107) CSHA (n = 1,730) NuAge (n = 432)
No. % Mean SD No. % Mean SD No. % Mean SD
Chronic conditionsb
High blood pressure 3,993 56.2 614 35.5 206c 47.7
Stroke 283 4.0 226 13.5 0c
Diabetes 1,258 17.7 228 13.2 40c 9.3
Myocardial infarction 876 12.4 263 15.2 57c 13.4
Adjustment for age, sex, and educational level increased In the present study, we explored whether 2 standardiza-
the effect sizes on average by approximately 4%–13%, tion methods, T- and category-centered scores, can influence
despite the fact that the category-centered scores and T-scores estimates of effect and heterogeneity when outcomes are
are essentially corrected for these 3 variables (Table 3). The measured using different scales or instruments. Researchers
increase was greater for the category-centered score than for conducting meta-analyses often use measures of heteroge-
the T-score; however, the 2 scores gave identical results neity, which may be defined as the proportion of total vari-
when the effect sizes were adjusted for the covariates age, ation in measured pooled risk estimates that is due to
sex, and educational level. between-study heterogeneity rather than to chance, as an
The effect sizes for all adjusted and unadjusted scores indication that findings across studies are consistent and
were affected by the different simulation settings (i.e., thus can be pooled. In the case study, there is a suggestion
homogeneous or heterogeneous: 1) effect size of the associ- that important heterogeneity may be masked by one’s
ation between physical activity and memory, 2) population choice of standardization procedure. When using a criterion
distribution of confounders, and 3) relationship between of I2 > 50%, all analyses indicated there was important het-
confounders and memory). Homogeneous and heteroge- erogeneity. When the criterion of PQ < 0.05 was used,
neous associations of physical activity and memory had the however, 6 of the 18 analyses indicated there was not sta-
largest influence in terms of change in the overall effect tistically significant heterogeneity; 5 of the 6 analyses
size. The reason is that the pooled estimates were not iden- involved the T-score. Because the T-scores are standard-
tical for these 2 settings. However, the association between ized to the same mean across studies, it was expected that
physical activity and memory for the different settings of the T-scores would reduce between-study heterogeneity
age, sex, and educational level should have been identical, when compared with the category-centered score, espe-
because that was consistent across all settings. Whether we cially in the unadjusted analyses. In fact, in the adjusted
pooled homogeneous or heterogeneous populations with analysis, the same results were found regardless of the
respect to the distribution of confounders (age, sex, and method of standardization.
educational level) had the least influence on the pooled es- In the case study, also we found that the effect estimates
timates. Different relationships between the confounders of physical activity on memory based on the unadjusted T-
and memory across studies also influenced the pooled re- score and category-centered score were similar, but the
sults because the heterogeneous setting was 6%–17% magnitudes of those using the category-centered scores
larger than the homogeneous setting. tended to be larger. In the adjusted analysis, these effect es-
The I2 clearly detects when the association between timates based on the category-centered scores and T-scores
physical activity and memory is different across studies. were nearly identical, and they were closer to the unad-
However, a large I2 is also observed when the association justed T-scores than were the unadjusted category-centered
between physical activity and memory is consistent across scores. This is supported by the simulation analysis and im-
studies. This occurs when the influences of the confounders plies that the method of standardization may be less impor-
on memory are different across studies. Populations that tant if standardized measures are adjusted for a common
are heterogeneous with respect to the distribution of con- set of important confounders. If only unadjusted analyses
founders have only a limited influence on the I2 compared are available, the T-scores may be preferable in terms of
with homogeneous populations. Power was also most influ- bias, because they are already adjusted for important con-
enced when the influence of confounders on memory was founders. It was interesting, however, that there was still
heterogeneous across populations. residual confounding, because the effect estimates based on
Am J Epidemiol. 2016;184(10):770–778
Standardization Approaches to Harmonization 775
Table 2. Summary Hedges’ g Values for the Weighted Mean Difference of Combinations of Memory Tests in People Who Reported No or
Low Physical Activity Compared With People Who Reported Moderate or High Levels of Physical Activitya, Canadian Community Health
Survey-Canadian Longitudinal Study on Aging (2008–2009), Canadian Study of Health and Aging (1991–1992), and Quebec Longitudinal
Study on Nutrition and Aging (2004–2005)
Q Statistic for
Study/Memory Test Given and Type of Outcome Hedges’ g 95% CI I2 P for Heterogeneity
Heterogeneity
Unadjusted
CCHS/RAVLT; CSHA/RAVLT; and NuAge/Free Buschke
T-score 0.12 0.01, 0.23 0.64 5.5 0.06
Category-centered score 0.16 0.01, 0.30 0.78 8.96 0.01
CCHS/RAVLT; CSHA/Free Buschke; and NuAge/Free Buschke
T-score 0.14 −0.03, 0.31 0.85 12.92 0.002
Abbreviations: CCHS, Canadian Community Health Survey; CI, confidence interval; CSHA, Canadian Study of Health and Aging; Free
Buschke, Buschke Cued Recall Procedure under conditions of free recall; HUI, Health Utility Index; NuAge, Quebec Longitudinal Study on
Nutrition and Aging; RAVLT, Rey Auditory Verbal Learning Test; Total Buschke, Buschke Cued Recall Procedure under conditions of cued
recall.
a
Shown are results from separate meta-analyses for combinations of compatible memory tests for each study. CCHS included the RAVLT
and HUI; CSHA included the RAVLT, Free Buschke, and Total Buschke; and NuAge include the Free Buschke and Total Buschke.
the T-scores in the simulation still increased by approxi- activity on memory across population. Interestingly, sub-
mately 4%. stantial heterogeneity was also evident when the relation-
In the simulation study, we compared the 2 standardiza- ship between the confounding variables and the outcome
tion methods across a number of scenarios to examine the differed across the studies, even when the population distri-
types of heterogeneity that researchers generally explore in butions of the confounders and the effect sizes of physical
a meta-analysis. We found the method of standardization activity on memory were consistent across cohorts and
and the population characteristics had only a small influ- regardless of whether or not the effect estimates were
ence on heterogeneity. As one would expect, heterogeneity adjusted. This implies that in terms of sources of heteroge-
was evident when we varied the effect size of physical neity, the method of standardization plays a much smaller
Am J Epidemiol. 2016;184(10):770–778
776 Griffith et al.
Table 3. Summary Hedges’ g for the Weighted Mean Difference for Simulated Memory Tests in People Who Reported No or Low Physical
Activity Compared With People Who Reported Moderate or High Levels of Physical Activitya, Canadian Community Health Survey-Canadian
Longitudinal Study on Aging (2008–2009), Canadian Study of Health and Aging (1991–1992), and Quebec Longitudinal Study on Nutrition and
Aging (2004–2005)
Effect of Physical Activity Population Confounder Effect on Memory Type of Outcome Effect Size Power Average I2
Unadjusted
Homogeneous Homogeneous Homogeneous T-score 0.57 100 16.6
Homogeneous Homogeneous Homogeneous C-score 0.53 100 17.1
Homogeneous Homogeneous Heterogeneous T-score 0.62 100 87.6
Homogeneous Homogeneous Heterogeneous C-score 0.57 100 82.7
Homogeneous Heterogeneous Homogeneous T-score 0.58 100 27.7
Homogeneous Heterogeneous Homogeneous C-score 0.54 100 30.0
role than does differential confounding of the outcome conducted 2-stage meta-analysis. Although the results are
measure across studies and that a significant I2 can be ob- often similar, there are occasions when 1-stage and 2-stage
tained even when the “standard” sources of heterogeneity meta-analyses can provide different parameter estimates and
are not existent across studies. different conclusion (37); however, it is not clear whether the
This also has implications for conducting aggregate data use of a 1-stage rather than a 2-stage model affects mea-
meta-analyses. To fully explore the contribution of these sures of heterogeneity. We expect that a 1-stage IPD analy-
factors to heterogeneity, one requires exploration of study- sis would be able to better address the heterogeneity in each
specific data and IPD meta-analysis. In our analyses, we of the effect sizes. Indeed, using random coefficient models
Am J Epidemiol. 2016;184(10):770–778
Standardization Approaches to Harmonization 777
makes it possible to study heterogeneity for each effect size. (Hélène Payette); Research Institute of the McGill
Furthermore, exploring whether or not the outcome being University Health Centre, McGill University, Montreal,
measured is unidimensional and consistent across studies Quebec, Canada (Christina Wolfson); Department of
would also require more complex modeling. For example, Epidemiology, Biostatistics and Occupational Health,
latent variable modeling allows for simultaneously use of McGill University, Montreal, Quebec, Canada (Christina
information on all measures of a construct, testing of the Wolfson); and Research Center, Institut Universitaire de
goodness of fit of the proposed model, and testing of whether Gériatrie de Montréal and Psychology Department,
or not there is consistency of the measures across data sets. Université de Montréal, Montreal, Quebec, Canada (Sylvie
When using the other methods of standardization, researchers Belleville).
implicitly assume that all instruments are measuring the same The groundwork for this manuscript is based on the
construct, and this assumption is generally not verified. methods research report Harmonization of Cognitive
Data from observational studies are presented in this arti- Measures in Individual Participant Data and Aggregate
cle; methods to retrospectively harmonize outcome, expo- Data Meta-Analysis, funded by the Agency for Healthcare
sure, and covariate data were used. If one were applying Research and Quality, United States Department of Health
Am J Epidemiol. 2016;184(10):770–778
778 Griffith et al.
11. Statistics Canada. Canadian Community Health Survey – across bioclinical studies. Int J Epidemiol. 2010;39(5):
Healthy aging (CCHS). http://www23.statcan.gc.ca/imdb/ 1383–1393.
p2SV.pl?Function=getSurvey&SDDS=5146. Published 25. Fortier I, Doiron D, Little J, et al. Is rigorous retrospective
March 26, 2008. Updated November 27, 2008. Accessed harmonization possible? Application of the DataSHaPER
January 5, 2016. approach across 53 large studies. Int J Epidemiol. 2011;
12. Gaudreau P, Morais JA, Shatenstein B, et al. Nutrition as a 40(5):1314–1328.
determinant of successful aging: description of the Quebec 26. Tuokko H, Woodward TS. Development and validation of
longitudinal study Nuage and results from cross-sectional a demographic correction system for neuropsychological
pilot studies. Rejuvenation Res. 2007;10(3):377–386. measures used in the Canadian Study of Health and Aging.
13. Taylor EM. The Appraisal of Children With Cerebral J Clin Exp Neuropsychol. 1996;18(4):479–616.
Deficits. Cambridge, MA: Harvard University Press; 1959. 27. Riley RD, Simmonds MC, Look MP. Evidence synthesis
14. Butler M, Retzlaff P, Vanderploeg R. Neuropsychological combining individual patient data and aggregate data: a
test usage. Prof Psychol Res Pr. 1991;22(6):510–512. systematic review identified current practice and possible
15. Lezak MD, Howlesonn DB, Loring DW. Neuropsychological methods. J Clin Epidemiol. 2007;60(5):431–439.
Assessment. 4th ed. New York, NY: Oxford University Press; 28. Dion M, Potvin O, Belleville S, et al. Normative data for the
Am J Epidemiol. 2016;184(10):770–778