Abstract
Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify genetic signals. Here we train a deep convolutional neural network on noisy self-reported and International Classification of Diseases labels to predict COPD case–control status from high-dimensional raw spirograms and use the model’s predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls, and predicts COPD-related hospitalization without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival and exacerbation events. A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci and also identifies 67 new loci. Lastly, our method provides a general framework to use ML methods and medical-record-based labels that does not require domain knowledge or expert curation to improve disease prediction and genomic discovery for drug design.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Genotypes and phenotypes are available for approved projects through the UKB study (https://www.ukbiobank.ac.uk). The full ML-based COPD GWAS summary statistics are currently available on our GitHub repository page (https://github.com/Google-Health/genomics-research/tree/main/ml-based-copd) and in the GWAS catalog (accession number GCST90244098). The raw ML-based COPD liability scores will be returned to UKB. This research has been conducted under Application Number 65275. We used the GWAS Catalog (https://www.ebi.ac.uk/gwas/) for replication analysis. This research used data generated by the COPDGene study (dbGaP accession phs000179.v6.p2), which was supported by NIH grants U01 HL089856 and U01 HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer-Ingelheim, Novartis and Sunovion. ICGC genome-wide association summary statistics were obtained from dbGaP under accession phs000179.v5.p2. SpiroMeta summary statistics were obtained from LDHub (https://ldsc.broadinstitute.org/ldhub).
Code availability
Code and detailed instructions for model training, prediction and analysis, as well as instructions for evaluating the trained model on spirograms, are available at https://github.com/Google-Health/genomics-research/tree/main/ml-based-copd47. We used the following tools: Baseline and BaselineLD annotations (https://data.broadinstitute.org/alkesgroup/ldscore), BOLT-LMM v2.3.4 (https://data.broadinstitute.org/alkesgroup/bolt-lmm), DeepNull v0.2.2 (https://github.com/Google-Health/genomics-research/tree/main/nonlinear-covariate-gwas), GWAS Catalog (https://www.ebi.ac.uk/gwas/), GARFIELD v2 (https://www.ebi.ac.uk/birney-srv/GARFIELD/), GREAT v4.0.4 (http://great.stanford.edu), S-LDSC v1.0.1 (https://data.broadinstitute.org/alkesgroup/ldscore), PLINK v1.9 (https://www.cog-genomics.org/plink1.9), scikit-learnv1.0.2 (https://scikit-learn.org/stable/), TensorFlow v2.9.0 (https://www.tensorflow.org), UCSC LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver), LDHub (https://ldsc.broadinstitute.org/ldhub) and Vizier v0.1.1 (https://github.com/google/vizier).
References
MacNee, W. ABC of chronic obstructive pulmonary disease: pathology, pathogenesis, and pathophysiology. BMJ 332, 1202–1204 (2006).
Ingebrigtsen, T. Genetic influences on chronic obstructive pulmonary disease—a twin study. Respir. Med. 104, 1890–1895 (2010).
Zhou, J. J. et al. Heritability of chronic obstructive pulmonary disease and related phenotypes in smokers. Am. J. Respir. Crit. Care Med. 188, 941–947 (2013).
Jørgen, V. et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease. Am. J. Respir. Crit. Care Med. 187, 347–365 (2013).
Brian, L. G. et al. Standardization of spirometry 2019 update. An official American Thoracic Society and European Respiratory Society technical statement. Am. J. Respir. Crit. Care Med. 200, e70–e88 (2019).
Mannino, D. M. & Buist, A. S. Global burden of COPD: risk factors, prevalence, and future trends. Lancet 370, 765–773 (2007).
Hobbs, B. D. et al. Genetic loci associated with chronic obstructive pulmonary disease overlap with loci for lung function and pulmonary fibrosis. Nat. Genet. 49, 426–432 (2017).
Sakornsakolpat, P. et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nat. Genet. 51, 494–505 (2019).
Wain, L. V. et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 3, 769–781 (2015).
Nick, S. et al. New genetic signals for lung function highlight pathways and chronic obstructive pulmonary disease associations across multiple ancestries. Nat. Genet. 51, 481–493 (2019).
Regan, E. A. et al. Clinical and radiologic disease in smokers with normal spirometry. JAMA Intern. Med. 175, 1539–1549 (2015).
Woodruff, P. G. et al. Clinical significance of symptoms in smokers with preserved pulmonary function. N. Engl. J. Med. 374, 1811–1821 (2016).
Anzueto, A. et al. COPDGene® 2019: redefining the diagnosis of chronic obstructive pulmonary disease. Chronic Obstr. Pulm. Dis. 6, 384–399 (2019).
Han, M. K. et al. From GOLD 0 to pre-COPD. Am. J. Respir. Crit. Care Med. 203, 414–423 (2021).
Silverman, E. K. Genetics of COPD. Annu. Rev. Physiol. 82, 413–431 (2020).
Babak, A. et al. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am. J. Hum. Genet. 108, 1217–1230 (2021).
Xikun, H. et al. Automated AI labeling of optic nerve head enables insights into cross-ancestry glaucoma risk and genetic discovery in >280,000 images from UKB and CLSA. Am. J. Hum. Genet. 108, 1204–1216 (2021).
LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
He T, et al. Bag of tricks for image classification with convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 558–567 (IEEE, 2019).
Nay, A. et al. Genome-wide association analysis reveals insights into the genetic architecture of right ventricular structure and function. Nat. Genet. 54, 783–791 (2022).
Joo, J., Hobbs, B., Cho, M. & Himes, B. Trait insights gained by comparing genome-wide association study results using different chronic obstructive pulmonary disease definitions. AMIA Jt. Summits Transl. Sci. Proc. 30, 278–287 (2020).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 76, 1262–1272 (2020).
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Regan, E. A. et al. Genetic epidemiology of COPD (COPDGene) study design. COPD 7, 32–43 (2011).
Artigas, M. S. et al. Sixteen new lung function signals identified through 1000 genomes project reference panel imputation. Nat. Commun. 6, 8658 (2015).
Wei, Z. et al. Global Biobank Meta-analysis Initiative: powering genetic discovery across human disease. Cell Genom. 2, 100192 (2022).
McCaw, Z. R. et al. DeepNull models non-linear covariate effects to improve phenotypic prediction and association power. Nat. Commun. 13, 241 (2022).
Rabe, K. F. et al. Safety and efficacy of itepekimab in patients with moderate-to-severe COPD: a genetic association study and randomised, double-blind, phase 2a trial. Lancet Respir. Med. 9, 1288–1298 (2021).
Kichaev, G. et al. Leveraging polygenic functional enrichment to improve GWAS power. Am. J. Hum. Genet. 104, 65–75 (2019).
Amirav, I. et al. Systematic analysis of CCNO variants in a defined population: implications for clinical phenotype and differential diagnosis. Hum. Mutat. 37, 396–405 (2016).
Julia, W. et al. Mutations in CCNO result in congenital mucociliary clearance disorder with reduced generation of multiple motile cilia. Nat. Genet. 46, 646–651 (2014).
Tilley, A. E., Walters, M. S., Shaykhiev, R. & Crystal, R. G. Cilia dysfunction in lung disease. Annu. Rev. Physiol. 77, 379–406 (2015).
Qiao, D. et al. Whole exome sequencing analysis in severe chronic obstructive pulmonary disease. Hum. Mol. Genet. 27, 3801–3812 (2018).
Wootton, R. E. et al. Evidence for causal effects of lifetime smoking on risk for depression and schizophrenia: a Mendelian randomisation study. Psychol. Med. 50, 2435–2443 (2019).
Lehmann, M., Baarsma, H. A. & Königshoff, M. WNT signaling in lung aging and disease. Ann. Am. Thorac. Soc. 13, S411–S416 (2016).
Morrow, J. D. et al. Functional interactors of three genome-wide association study genes are differentially expressed in severe chronic obstructive pulmonary disease lung tissue. Sci. Rep. 7, 44232 (2017).
Conlon, T. M. et al. Inhibition of LTβR signalling activates WNT-induced regeneration in lung. Nature 588, 151–156 (2020).
Shrine, N. et al. Multi-ancestry genome-wide association study improves resolution of genes, pathways and pleiotropy for lung function and chronic obstructive pulmonary disease. Nat. Genet. 55, 410–422 (2022).
Cloonan, S. M. et al. Mitochondrial iron chelation ameliorates cigarette smoke–induced bronchitis and emphysema in mice. Nat. Med. 22, 163–174 (2016).
Routhier, J. et al. An innate contribution of human nicotinic receptor polymorphisms to COPD-like lesions. Nat. Commun. 12, 6384 (2021).
Golovin, D. et al. Google vizier. In Proc. 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1487–1495 (ACM, 2017).
Frazier, P. I. A tutorial on Bayesian optimization. Preprint at https://arxiv.org/abs/1807.02811 (2018).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process Syst. 30, 6405–6416 (2017).
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
Cosentino, J. et al. Google-Health/genomics-research: ML-based COPD v0.2.0. Zenodo https://doi.org/10.5281/zenodo.7718510 (2023).
Acknowledgements
We thank T. Yun (Google Health AI) for helpful discussions and H. Yang (Google Health AI) for project management assistance. B.D.H. is supported by NIH K08 HL136928, U01 HL089856, R01 HL155749 and a Research Grant from the Alpha-1 Foundation. M.H.C. is supported by R01HL153248, R01HL149861, R01HL147148 and R01HL089856. D.H. was supported by NIH 2T32HL007427-41. This study was funded by Google. The funder had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
J.C., B.B., B.A. and F.H. conceived the study. J.C., B.B., B.A., Z.R.M., B.D.H., M.H.C., C.Y.M. and F.H. designed the study. J.C., B.B., B.A., Z.R.M., D.H., T-H.S-A., D.L., C.Y.M. and F.H. performed experiments. J.C., B.B., B.A., Z.R.M., D.H., T-H.S-A., D.L., A.C., B.D.H., M.H.C., C.Y.M. and F.H. analyzed results. J.C., B.B., Z.R.M., B.D.H., M.H.C., C.Y.M. and F.H. wrote the paper. All authors reviewed and contributed to the final version of the paper.
Corresponding authors
Ethics declarations
Competing interests
J.C., B.B., B.A., Z.R.M., A.W.C., C.Y.M. and F.H. are current or former employees of Google and own Alphabet stock. This study was funded by Google. B.D.H. receives grant support from Bayer. M.H.C. has received grant support from GSK and Bayer, and consulting or speaking fees from Genentech, AstraZeneca and Illumina. The other authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 An overview of the optimized one-dimensional ResNet18-D model architecture.
Sublabels denote layer parameters, with convolutional, pooling, residual, and downsample layers adhering to the following convention: kernel size, number of filters, and stride length. Supplementary Fig. 1 details the residual and downsample layers. Subfigures a-f comprise the ResNet18-D model’s ‘backbone’, which produces 128-dimensional embeddings from flow-volume spirograms. a) The input stem. We zero-pad the flow-volume curve to size 1024. All 1D convolutions in the input stem are followed by a batch normalization layer and ReLU activation function. b–e) Residual and downsampling layers (Supplementary Fig. 1). f) The backbone’s final layer, followed by a ReLU activation function, produces an 128-dimensional embedding passed to each outcome head. g) The model’s disease head subarchitecture. We use swish activation functions in the first three layers and a sigmoid activation function in the final layer.
Extended Data Fig. 2 An overview of UK Biobank dataset used in this study.
Our initial dataset consists of all European-ancestry in UK Biobank (n = 435,766). We considered all individuals with valid spirograms as modeling dataset (n = 325,027) and individuals with invalid spirograms are used as PRS holdout set. The PRS holdout set is from the European individuals who are not used in the ML modeling and in the GWASs (n = 110,739). We split the modeling datasets to train and validation set with 80% and 20% of samples, respectively. The modeling dataset was used to select model architectures, tune hyperparameters, and evaluate ML model performance across tasks while a two-fold cross-fold dataset was used during the final model application process to generate phenotypes. It is worth mentioning that the combination of train fold 1 and train fold 2 sample size is not equal to the size of the whole training dataset due to the fact we removed genetically close samples that fall cross two different folds. As folds were constructed to keep genetically related individuals together, preventing the same individual or a close relative from being used for both training and prediction.
Extended Data Fig. 3 GWS loci replication pipeline.
A GWAS on the ML-based liability score identifies 265 novel COPD risk loci in addition to 91 previously known COPD loci with respect to8 and GWAS catalog entries (as of 2022-07-09) for COPD, emphysema, chronic bronchitis. Out of 265 additional COPD loci, 221 of which independently replicate as associated with COPD or COPD-related lung function as follows. We observed that 101 out of 265 was detected in a previous COPD GWAS8 after Bonferroni correction. Also, 198 out of 265 are previously known FEV1 or FEV1/FVC loci with respect to10 and GWAS catalog entries. The three datasets are GBMI (Global Biobank Meta-analysis Initiative)28, SpiroMeta27, and ICGC (International COPD Genetics Consortium)7 which all three exclude samples from UK Biobank. We defined two replication strategies: First, we defined supportive replication as consistent effect size direction across all studies with our ML-based COPD. The ICGC and GBMI GWAS are based on a COPD phenotype; thus, we expect their effect size signs to match our ML-based COPD. SpiroMeta phenotypes, on the other hand, capture lung function, so we expect their effect size signs to be the opposite of our ML-based COPD signs. Second, we defined strict replication as consistent effect size direction in any study with Bonferroni correction of P < 0.1 (one-sided) for that study.
Extended Data Fig. 4 ML-based COPD GWAS Manhattan plot via DeepNull.
We performed ML-based COPD GWAS where we used the same set of covariates as the Fig. 4 with one additional covariate provided by DeepNull. DeepNull model predicts the ML-based COPD using age, sex, genotype-array, and FEV1/FVC as inputs. The additional DeepNull-covariate is the DeepNull model prediction of ML-based COPD. DeepNull learns a function (that is, linear or non-linear) that predicts ML-based COPD via age, sex, genotype-array, and FEV1/FVC as inputs. Thus, this analysis is similar to the ML-based COPD GWAS conditional on FEV1/FVC where instead of assuming that FEV1/FVC has linear relationship with ML-based COPD, DeepNull handles cases where age, sex, and FEV1/FVC can have non-linear relationship with ML-based COPD. We obtained p-values from BOLT-LMM using a two-sided test. The green dashed line is the genome-wide significant level (P < 5 × 10 − 8).
Extended Data Fig. 5 Statistical power comparison of ML-based COPD with Hobbs et al.7. COPD GWAS.
a) The X-axis is the -log p-value of Hobbs et al.7 COPD GWAS. The Y-axis is the -log p-value of the ML-based COPD. Both p-values are computed using two-sided tests with no adjustment for multiple hypothesis tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Hobbs et al.7 COPD GWAS but not significant for our ML-based COPD and green dots indicate variants-in-hits that are significant for our ML-based COPD but not significant for the Hobbs et al.7 COPD GWAS. b) Effect size correlation of ML-based COPD and Hobbs et al.7 COPD GWAS. The X-axis is the effect size of Hobbs et al.7 COPD GWAS for all GWS variants-in-hits and Y-axis is the effect size of our ML-based COPD. Light red band is the 95% confidence interval (for example, band) of effect size correlation.
Extended Data Fig. 6 Statistical power comparison of ML-based COPD without MRB COPD cases with Hobbs et al.7. COPD GWAS.
a) The X-axis is the -log p-value of Hobbs et al.7 COPD GWAS. The Y-axis is the -log p-value of the ML-based COPD restricted to control samples (that is, no cases) based on the MRB label. Both p-values are computed using two-sided tests and no adjustment for multiple hypothesis tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Hobbs et al.7 COPD GWAS but not significant for our ML-based COPD and green dots indicate variants-in-hits that are significant for our ML-based COPD but not significant for the Hobbs et al.7 COPD GWAS. b) Effect size correlation of ML-based COPD and Hobbs et al.7 COPD GWAS. The X-axis is the effect size of Hobbs et al.7 COPD GWAS for all GWS variants-in-hits and Y-axis is the effect size of our ML-based COPD. It is worth noting that the effect sizes from this analysis, which conditions on control status, are potentially subject to selection bias. Light red band is the 95% confidence interval (for example, band) of effect size correlation.
Extended Data Fig. 7 Statistical power comparison of binarized ML-based COPD with Sakornsakolpat.
a) The X-axis is the -log p-value of Sakornsakolpat et al.8. The Y-axis is -log p-value of the binarized ML-based COPD. Both p-values are computed using two-sided tests and no adjustment for multiple hypothesis tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Sakornsakolpat et al.8 but not significant for our binarized ML-based COPD and green dots indicate variants-in-hits that are significant for our binarized ML-based COPD but not significant for Sakornsakolpat et al.8 GWAS. b) Effect size correlation of binarized ML-based COPD and Sakornsakolpat et al.8 GWAS. The X-axis is the effect size of Sakornsakolpat et al.8 for all GWS variants-in-hits and Y-axis is the effect size of our binarized ML-based COPD. Light red band is the 95% confidence interval (for example, band) of effect size correlation.
Extended Data Fig. 8 Statistical power comparison of binarized ML-based COPD matching GOLD prevalence with Sakornsakolpat et al.
8 a) The X-axis is the -log p-value of proxy-GOLD. The Y-axis is -log p-value of the binarized ML-based COPD prevalence matching with proxy GOLD. Both p-values are computed using two-sided tests and no adjustment for multiple hypothesis tests. Both p-values are computed using two-sided tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Sakornsakolpat but not significant for our binarized ML-based COPD and green dots indicate variants-in-hits that are significant for our binarized ML-based COPD but not significant for Sakornsakolpat GWAS. b) Effect size correlation of binarized ML-based COPD and Sakornsakolpat GWAS. The X-axis is the effect size of Sakornsakolpat for all GWS variants-in-hits and Y-axis is the effect size of our binarized ML-based COPD. Light red band is the 95% confidence interval (for example, band) of effect size correlation.
Supplementary information
Supplementary Information
Supplementary Notes, Figs. 1–17 and Tables 1–41.
Supplementary Tables
Supplementary Tables 2, 7, 9–12, 15–31, 39 and 40.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cosentino, J., Behsaz, B., Alipanahi, B. et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat Genet 55, 787–795 (2023). https://doi.org/10.1038/s41588-023-01372-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41588-023-01372-4
This article is cited by
-
Development and application of a deep learning-based comprehensive early diagnostic model for chronic obstructive pulmonary disease
Respiratory Research (2024)
-
Genome-wide association study identifies novel susceptible loci and evaluation of polygenic risk score for chronic obstructive pulmonary disease in a Taiwanese population
BMC Genomics (2024)
-
Valid inference for machine learning-assisted genome-wide association studies
Nature Genetics (2024)
-
Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score
Nature Communications (2024)
-
Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks
Nature Genetics (2024)