Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models

Cosentino, Justin; Behsaz, Babak; Alipanahi, Babak; McCaw, Zachary R.; Hill, Davin; Schwantes-An, Tae-Hwi; Lai, Dongbing; Carroll, Andrew; Hobbs, Brian D.; Cho, Michael H.; McLean, Cory Y.; Hormozdiari, Farhad

doi:10.1038/s41588-023-01372-4

Article
Published: 17 April 2023

Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models

Nature Genetics volume 55, pages 787–795 (2023)Cite this article

5528 Accesses
16 Citations
56 Altmetric
Metrics details

Subjects

Abstract

Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify genetic signals. Here we train a deep convolutional neural network on noisy self-reported and International Classification of Diseases labels to predict COPD case–control status from high-dimensional raw spirograms and use the model’s predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls, and predicts COPD-related hospitalization without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival and exacerbation events. A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci and also identifies 67 new loci. Lastly, our method provides a general framework to use ML methods and medical-record-based labels that does not require domain knowledge or expert curation to improve disease prediction and genomic discovery for drug design.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: ML-based COPD phenotyping overview.**

**Fig. 2: Spirometry and COPD status overview.**

**Fig. 3: ML methods improve COPD detection relative to spirometry metrics in the UKB modeling validation set.**

**Fig. 4: ML-based COPD discovers 67 new association loci.**

Development of a daily predictive model for the exacerbation of chronic obstructive pulmonary disease

Article Open access 31 October 2023

An independently validated, portable algorithm for the rapid identification of COPD patients using electronic health records

Article Open access 07 October 2021

Whole genome sequence analysis of pulmonary function and COPD in 19,996 multi-ethnic participants

Article Open access 14 October 2020

Data availability

Genotypes and phenotypes are available for approved projects through the UKB study (https://www.ukbiobank.ac.uk). The full ML-based COPD GWAS summary statistics are currently available on our GitHub repository page (https://github.com/Google-Health/genomics-research/tree/main/ml-based-copd) and in the GWAS catalog (accession number GCST90244098). The raw ML-based COPD liability scores will be returned to UKB. This research has been conducted under Application Number 65275. We used the GWAS Catalog (https://www.ebi.ac.uk/gwas/) for replication analysis. This research used data generated by the COPDGene study (dbGaP accession phs000179.v6.p2), which was supported by NIH grants U01 HL089856 and U01 HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer-Ingelheim, Novartis and Sunovion. ICGC genome-wide association summary statistics were obtained from dbGaP under accession phs000179.v5.p2. SpiroMeta summary statistics were obtained from LDHub (https://ldsc.broadinstitute.org/ldhub).

Code availability

Code and detailed instructions for model training, prediction and analysis, as well as instructions for evaluating the trained model on spirograms, are available at https://github.com/Google-Health/genomics-research/tree/main/ml-based-copd⁴⁷. We used the following tools: Baseline and BaselineLD annotations (https://data.broadinstitute.org/alkesgroup/ldscore), BOLT-LMM v2.3.4 (https://data.broadinstitute.org/alkesgroup/bolt-lmm), DeepNull v0.2.2 (https://github.com/Google-Health/genomics-research/tree/main/nonlinear-covariate-gwas), GWAS Catalog (https://www.ebi.ac.uk/gwas/), GARFIELD v2 (https://www.ebi.ac.uk/birney-srv/GARFIELD/), GREAT v4.0.4 (http://great.stanford.edu), S-LDSC v1.0.1 (https://data.broadinstitute.org/alkesgroup/ldscore), PLINK v1.9 (https://www.cog-genomics.org/plink1.9), scikit-learnv1.0.2 (https://scikit-learn.org/stable/), TensorFlow v2.9.0 (https://www.tensorflow.org), UCSC LiftOver (https://genome.ucsc.edu/cgi-bin/hgLiftOver), LDHub (https://ldsc.broadinstitute.org/ldhub) and Vizier v0.1.1 (https://github.com/google/vizier).

References

MacNee, W. ABC of chronic obstructive pulmonary disease: pathology, pathogenesis, and pathophysiology. BMJ 332, 1202–1204 (2006).
PubMed Central Google Scholar
Ingebrigtsen, T. Genetic influences on chronic obstructive pulmonary disease—a twin study. Respir. Med. 104, 1890–1895 (2010).
PubMed Google Scholar
Zhou, J. J. et al. Heritability of chronic obstructive pulmonary disease and related phenotypes in smokers. Am. J. Respir. Crit. Care Med. 188, 941–947 (2013).
PubMed PubMed Central Google Scholar
Jørgen, V. et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease. Am. J. Respir. Crit. Care Med. 187, 347–365 (2013).
Google Scholar
Brian, L. G. et al. Standardization of spirometry 2019 update. An official American Thoracic Society and European Respiratory Society technical statement. Am. J. Respir. Crit. Care Med. 200, e70–e88 (2019).
Google Scholar
Mannino, D. M. & Buist, A. S. Global burden of COPD: risk factors, prevalence, and future trends. Lancet 370, 765–773 (2007).
PubMed Google Scholar
Hobbs, B. D. et al. Genetic loci associated with chronic obstructive pulmonary disease overlap with loci for lung function and pulmonary fibrosis. Nat. Genet. 49, 426–432 (2017).
CAS PubMed PubMed Central Google Scholar
Sakornsakolpat, P. et al. Genetic landscape of chronic obstructive pulmonary disease identifies heterogeneous cell-type and phenotype associations. Nat. Genet. 51, 494–505 (2019).
CAS PubMed PubMed Central Google Scholar
Wain, L. V. et al. Novel insights into the genetics of smoking behaviour, lung function, and chronic obstructive pulmonary disease (UK BiLEVE): a genetic association study in UK Biobank. Lancet Respir. Med. 3, 769–781 (2015).
PubMed PubMed Central Google Scholar
Nick, S. et al. New genetic signals for lung function highlight pathways and chronic obstructive pulmonary disease associations across multiple ancestries. Nat. Genet. 51, 481–493 (2019).
Google Scholar
Regan, E. A. et al. Clinical and radiologic disease in smokers with normal spirometry. JAMA Intern. Med. 175, 1539–1549 (2015).
PubMed PubMed Central Google Scholar
Woodruff, P. G. et al. Clinical significance of symptoms in smokers with preserved pulmonary function. N. Engl. J. Med. 374, 1811–1821 (2016).
CAS PubMed PubMed Central Google Scholar
Anzueto, A. et al. COPDGene® 2019: redefining the diagnosis of chronic obstructive pulmonary disease. Chronic Obstr. Pulm. Dis. 6, 384–399 (2019).
PubMed PubMed Central Google Scholar
Han, M. K. et al. From GOLD 0 to pre-COPD. Am. J. Respir. Crit. Care Med. 203, 414–423 (2021).
PubMed PubMed Central Google Scholar
Silverman, E. K. Genetics of COPD. Annu. Rev. Physiol. 82, 413–431 (2020).
CAS PubMed Google Scholar
Babak, A. et al. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am. J. Hum. Genet. 108, 1217–1230 (2021).
Google Scholar
Xikun, H. et al. Automated AI labeling of optic nerve head enables insights into cross-ancestry glaucoma risk and genetic discovery in >280,000 images from UKB and CLSA. Am. J. Hum. Genet. 108, 1204–1216 (2021).
Google Scholar
LeCun, Y. et al. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1, 541–551 (1989).
Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
He T, et al. Bag of tricks for image classification with convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 558–567 (IEEE, 2019).
Nay, A. et al. Genome-wide association analysis reveals insights into the genetic architecture of right ventricular structure and function. Nat. Genet. 54, 783–791 (2022).
Google Scholar
Joo, J., Hobbs, B., Cho, M. & Himes, B. Trait insights gained by comparing genome-wide association study results using different chronic obstructive pulmonary disease definitions. AMIA Jt. Summits Transl. Sci. Proc. 30, 278–287 (2020).
Google Scholar
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
CAS PubMed PubMed Central Google Scholar
McCaw, Z. R., Lane, J. M., Saxena, R., Redline, S. & Lin, X. Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genome-wide association studies. Biometrics 76, 1262–1272 (2020).
CAS PubMed PubMed Central Google Scholar
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
CAS PubMed PubMed Central Google Scholar
Regan, E. A. et al. Genetic epidemiology of COPD (COPDGene) study design. COPD 7, 32–43 (2011).
Google Scholar
Artigas, M. S. et al. Sixteen new lung function signals identified through 1000 genomes project reference panel imputation. Nat. Commun. 6, 8658 (2015).
CAS Google Scholar
Wei, Z. et al. Global Biobank Meta-analysis Initiative: powering genetic discovery across human disease. Cell Genom. 2, 100192 (2022).
Google Scholar
McCaw, Z. R. et al. DeepNull models non-linear covariate effects to improve phenotypic prediction and association power. Nat. Commun. 13, 241 (2022).
CAS PubMed PubMed Central Google Scholar
Rabe, K. F. et al. Safety and efficacy of itepekimab in patients with moderate-to-severe COPD: a genetic association study and randomised, double-blind, phase 2a trial. Lancet Respir. Med. 9, 1288–1298 (2021).
CAS PubMed Google Scholar
Kichaev, G. et al. Leveraging polygenic functional enrichment to improve GWAS power. Am. J. Hum. Genet. 104, 65–75 (2019).
CAS PubMed Google Scholar
Amirav, I. et al. Systematic analysis of CCNO variants in a defined population: implications for clinical phenotype and differential diagnosis. Hum. Mutat. 37, 396–405 (2016).
CAS PubMed Google Scholar
Julia, W. et al. Mutations in CCNO result in congenital mucociliary clearance disorder with reduced generation of multiple motile cilia. Nat. Genet. 46, 646–651 (2014).
Google Scholar
Tilley, A. E., Walters, M. S., Shaykhiev, R. & Crystal, R. G. Cilia dysfunction in lung disease. Annu. Rev. Physiol. 77, 379–406 (2015).
CAS PubMed Google Scholar
Qiao, D. et al. Whole exome sequencing analysis in severe chronic obstructive pulmonary disease. Hum. Mol. Genet. 27, 3801–3812 (2018).
CAS PubMed PubMed Central Google Scholar
Wootton, R. E. et al. Evidence for causal effects of lifetime smoking on risk for depression and schizophrenia: a Mendelian randomisation study. Psychol. Med. 50, 2435–2443 (2019).
PubMed PubMed Central Google Scholar
Lehmann, M., Baarsma, H. A. & Königshoff, M. WNT signaling in lung aging and disease. Ann. Am. Thorac. Soc. 13, S411–S416 (2016).
PubMed Google Scholar
Morrow, J. D. et al. Functional interactors of three genome-wide association study genes are differentially expressed in severe chronic obstructive pulmonary disease lung tissue. Sci. Rep. 7, 44232 (2017).
CAS PubMed PubMed Central Google Scholar
Conlon, T. M. et al. Inhibition of LTβR signalling activates WNT-induced regeneration in lung. Nature 588, 151–156 (2020).
CAS PubMed PubMed Central Google Scholar
Shrine, N. et al. Multi-ancestry genome-wide association study improves resolution of genes, pathways and pleiotropy for lung function and chronic obstructive pulmonary disease. Nat. Genet. 55, 410–422 (2022).
Google Scholar
Cloonan, S. M. et al. Mitochondrial iron chelation ameliorates cigarette smoke–induced bronchitis and emphysema in mice. Nat. Med. 22, 163–174 (2016).
CAS PubMed PubMed Central Google Scholar
Routhier, J. et al. An innate contribution of human nicotinic receptor polymorphisms to COPD-like lesions. Nat. Commun. 12, 6384 (2021).
CAS PubMed PubMed Central Google Scholar
Golovin, D. et al. Google vizier. In Proc. 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1487–1495 (ACM, 2017).
Frazier, P. I. A tutorial on Bayesian optimization. Preprint at https://arxiv.org/abs/1807.02811 (2018).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process Syst. 30, 6405–6416 (2017).
Google Scholar
Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53, 1097–1103 (2021).
CAS PubMed Google Scholar
Cosentino, J. et al. Google-Health/genomics-research: ML-based COPD v0.2.0. Zenodo https://doi.org/10.5281/zenodo.7718510 (2023).

Download references

Acknowledgements

We thank T. Yun (Google Health AI) for helpful discussions and H. Yang (Google Health AI) for project management assistance. B.D.H. is supported by NIH K08 HL136928, U01 HL089856, R01 HL155749 and a Research Grant from the Alpha-1 Foundation. M.H.C. is supported by R01HL153248, R01HL149861, R01HL147148 and R01HL089856. D.H. was supported by NIH 2T32HL007427-41. This study was funded by Google. The funder had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Justin Cosentino, Babak Behsaz, Babak Alipanahi, Zachary R. McCaw.

Authors and Affiliations

Google Health AI, Palo Alto, CA, USA
Justin Cosentino, Babak Alipanahi, Zachary R. McCaw & Andrew Carroll
Google Health AI, Cambridge, MA, USA
Babak Behsaz, Cory Y. McLean & Farhad Hormozdiari
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA
Davin Hill
Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA, USA
Davin Hill, Brian D. Hobbs & Michael H. Cho
Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
Tae-Hwi Schwantes-An & Dongbing Lai
Division of Cardiology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
Tae-Hwi Schwantes-An
Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA, USA
Brian D. Hobbs & Michael H. Cho
Harvard Medical School, Boston, MA, USA
Brian D. Hobbs & Michael H. Cho

Authors

Justin Cosentino
View author publications
You can also search for this author in PubMed Google Scholar
Babak Behsaz
View author publications
You can also search for this author in PubMed Google Scholar
Babak Alipanahi
View author publications
You can also search for this author in PubMed Google Scholar
Zachary R. McCaw
View author publications
You can also search for this author in PubMed Google Scholar
Davin Hill
View author publications
You can also search for this author in PubMed Google Scholar
Tae-Hwi Schwantes-An
View author publications
You can also search for this author in PubMed Google Scholar
Dongbing Lai
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Carroll
View author publications
You can also search for this author in PubMed Google Scholar
Brian D. Hobbs
View author publications
You can also search for this author in PubMed Google Scholar
Michael H. Cho
View author publications
You can also search for this author in PubMed Google Scholar
Cory Y. McLean
View author publications
You can also search for this author in PubMed Google Scholar
Farhad Hormozdiari
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.C., B.B., B.A. and F.H. conceived the study. J.C., B.B., B.A., Z.R.M., B.D.H., M.H.C., C.Y.M. and F.H. designed the study. J.C., B.B., B.A., Z.R.M., D.H., T-H.S-A., D.L., C.Y.M. and F.H. performed experiments. J.C., B.B., B.A., Z.R.M., D.H., T-H.S-A., D.L., A.C., B.D.H., M.H.C., C.Y.M. and F.H. analyzed results. J.C., B.B., Z.R.M., B.D.H., M.H.C., C.Y.M. and F.H. wrote the paper. All authors reviewed and contributed to the final version of the paper.

Corresponding authors

Correspondence to Justin Cosentino or Farhad Hormozdiari.

Ethics declarations

Competing interests

J.C., B.B., B.A., Z.R.M., A.W.C., C.Y.M. and F.H. are current or former employees of Google and own Alphabet stock. This study was funded by Google. B.D.H. receives grant support from Bayer. M.H.C. has received grant support from GSK and Bayer, and consulting or speaking fees from Genentech, AstraZeneca and Illumina. The other authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 An overview of the optimized one-dimensional ResNet18-D model architecture.

Sublabels denote layer parameters, with convolutional, pooling, residual, and downsample layers adhering to the following convention: kernel size, number of filters, and stride length. Supplementary Fig. 1 details the residual and downsample layers. Subfigures a-f comprise the ResNet18-D model’s ‘backbone’, which produces 128-dimensional embeddings from flow-volume spirograms. a) The input stem. We zero-pad the flow-volume curve to size 1024. All 1D convolutions in the input stem are followed by a batch normalization layer and ReLU activation function. b–e) Residual and downsampling layers (Supplementary Fig. 1). f) The backbone’s final layer, followed by a ReLU activation function, produces an 128-dimensional embedding passed to each outcome head. g) The model’s disease head subarchitecture. We use swish activation functions in the first three layers and a sigmoid activation function in the final layer.

Extended Data Fig. 2 An overview of UK Biobank dataset used in this study.

Our initial dataset consists of all European-ancestry in UK Biobank (n = 435,766). We considered all individuals with valid spirograms as modeling dataset (n = 325,027) and individuals with invalid spirograms are used as PRS holdout set. The PRS holdout set is from the European individuals who are not used in the ML modeling and in the GWASs (n = 110,739). We split the modeling datasets to train and validation set with 80% and 20% of samples, respectively. The modeling dataset was used to select model architectures, tune hyperparameters, and evaluate ML model performance across tasks while a two-fold cross-fold dataset was used during the final model application process to generate phenotypes. It is worth mentioning that the combination of train fold 1 and train fold 2 sample size is not equal to the size of the whole training dataset due to the fact we removed genetically close samples that fall cross two different folds. As folds were constructed to keep genetically related individuals together, preventing the same individual or a close relative from being used for both training and prediction.

Extended Data Fig. 3 GWS loci replication pipeline.

A GWAS on the ML-based liability score identifies 265 novel COPD risk loci in addition to 91 previously known COPD loci with respect to⁸ and GWAS catalog entries (as of 2022-07-09) for COPD, emphysema, chronic bronchitis. Out of 265 additional COPD loci, 221 of which independently replicate as associated with COPD or COPD-related lung function as follows. We observed that 101 out of 265 was detected in a previous COPD GWAS⁸ after Bonferroni correction. Also, 198 out of 265 are previously known FEV₁ or FEV₁/FVC loci with respect to¹⁰ and GWAS catalog entries. The three datasets are GBMI (Global Biobank Meta-analysis Initiative)²⁸, SpiroMeta²⁷, and ICGC (International COPD Genetics Consortium)⁷ which all three exclude samples from UK Biobank. We defined two replication strategies: First, we defined supportive replication as consistent effect size direction across all studies with our ML-based COPD. The ICGC and GBMI GWAS are based on a COPD phenotype; thus, we expect their effect size signs to match our ML-based COPD. SpiroMeta phenotypes, on the other hand, capture lung function, so we expect their effect size signs to be the opposite of our ML-based COPD signs. Second, we defined strict replication as consistent effect size direction in any study with Bonferroni correction of P < 0.1 (one-sided) for that study.

Extended Data Fig. 4 ML-based COPD GWAS Manhattan plot via DeepNull.

We performed ML-based COPD GWAS where we used the same set of covariates as the Fig. 4 with one additional covariate provided by DeepNull. DeepNull model predicts the ML-based COPD using age, sex, genotype-array, and FEV₁/FVC as inputs. The additional DeepNull-covariate is the DeepNull model prediction of ML-based COPD. DeepNull learns a function (that is, linear or non-linear) that predicts ML-based COPD via age, sex, genotype-array, and FEV₁/FVC as inputs. Thus, this analysis is similar to the ML-based COPD GWAS conditional on FEV₁/FVC where instead of assuming that FEV₁/FVC has linear relationship with ML-based COPD, DeepNull handles cases where age, sex, and FEV₁/FVC can have non-linear relationship with ML-based COPD. We obtained p-values from BOLT-LMM using a two-sided test. The green dashed line is the genome-wide significant level (P < 5 × 10 − 8).

Extended Data Fig. 5 Statistical power comparison of ML-based COPD with Hobbs et al.7. COPD GWAS.

a) The X-axis is the -log p-value of Hobbs et al.⁷ COPD GWAS. The Y-axis is the -log p-value of the ML-based COPD. Both p-values are computed using two-sided tests with no adjustment for multiple hypothesis tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Hobbs et al.⁷ COPD GWAS but not significant for our ML-based COPD and green dots indicate variants-in-hits that are significant for our ML-based COPD but not significant for the Hobbs et al.⁷ COPD GWAS. b) Effect size correlation of ML-based COPD and Hobbs et al.⁷ COPD GWAS. The X-axis is the effect size of Hobbs et al.⁷ COPD GWAS for all GWS variants-in-hits and Y-axis is the effect size of our ML-based COPD. Light red band is the 95% confidence interval (for example, band) of effect size correlation.

Extended Data Fig. 6 Statistical power comparison of ML-based COPD without MRB COPD cases with Hobbs et al.7. COPD GWAS.

a) The X-axis is the -log p-value of Hobbs et al.⁷ COPD GWAS. The Y-axis is the -log p-value of the ML-based COPD restricted to control samples (that is, no cases) based on the MRB label. Both p-values are computed using two-sided tests and no adjustment for multiple hypothesis tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Hobbs et al.⁷ COPD GWAS but not significant for our ML-based COPD and green dots indicate variants-in-hits that are significant for our ML-based COPD but not significant for the Hobbs et al.⁷ COPD GWAS. b) Effect size correlation of ML-based COPD and Hobbs et al.⁷ COPD GWAS. The X-axis is the effect size of Hobbs et al.⁷ COPD GWAS for all GWS variants-in-hits and Y-axis is the effect size of our ML-based COPD. It is worth noting that the effect sizes from this analysis, which conditions on control status, are potentially subject to selection bias. Light red band is the 95% confidence interval (for example, band) of effect size correlation.

Extended Data Fig. 7 Statistical power comparison of binarized ML-based COPD with Sakornsakolpat.

a) The X-axis is the -log p-value of Sakornsakolpat et al.⁸. The Y-axis is -log p-value of the binarized ML-based COPD. Both p-values are computed using two-sided tests and no adjustment for multiple hypothesis tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Sakornsakolpat et al.⁸ but not significant for our binarized ML-based COPD and green dots indicate variants-in-hits that are significant for our binarized ML-based COPD but not significant for Sakornsakolpat et al.⁸ GWAS. b) Effect size correlation of binarized ML-based COPD and Sakornsakolpat et al.⁸ GWAS. The X-axis is the effect size of Sakornsakolpat et al.⁸ for all GWS variants-in-hits and Y-axis is the effect size of our binarized ML-based COPD. Light red band is the 95% confidence interval (for example, band) of effect size correlation.

Extended Data Fig. 8 Statistical power comparison of binarized ML-based COPD matching GOLD prevalence with Sakornsakolpat et al.

⁸ a) The X-axis is the -log p-value of proxy-GOLD. The Y-axis is -log p-value of the binarized ML-based COPD prevalence matching with proxy GOLD. Both p-values are computed using two-sided tests and no adjustment for multiple hypothesis tests. Both p-values are computed using two-sided tests. The vertical and horizontal red line indicates the genome-wide significance level. The diagonal red line indicates the y = x. The orange dots indicate variants-in-hits that are significant for Sakornsakolpat but not significant for our binarized ML-based COPD and green dots indicate variants-in-hits that are significant for our binarized ML-based COPD but not significant for Sakornsakolpat GWAS. b) Effect size correlation of binarized ML-based COPD and Sakornsakolpat GWAS. The X-axis is the effect size of Sakornsakolpat for all GWS variants-in-hits and Y-axis is the effect size of our binarized ML-based COPD. Light red band is the 95% confidence interval (for example, band) of effect size correlation.

Supplementary information

Supplementary Information

Supplementary Notes, Figs. 1–17 and Tables 1–41.

Reporting Summary

Supplementary Tables

Supplementary Tables 2, 7, 9–12, 15–31, 39 and 40.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cosentino, J., Behsaz, B., Alipanahi, B. et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat Genet 55, 787–795 (2023). https://doi.org/10.1038/s41588-023-01372-4

Download citation

Received: 02 September 2022
Accepted: 14 March 2023
Published: 17 April 2023
Issue Date: May 2023
DOI: https://doi.org/10.1038/s41588-023-01372-4

This article is cited by

Development and application of a deep learning-based comprehensive early diagnostic model for chronic obstructive pulmonary disease
- Zecheng Zhu
- Shunjin Zhao
- Xifeng Wu
Respiratory Research (2024)
Genome-wide association study identifies novel susceptible loci and evaluation of polygenic risk score for chronic obstructive pulmonary disease in a Taiwanese population
- Wei-De Lin
- Wen-Ling Liao
- Fuu-Jen Tsai
BMC Genomics (2024)
Valid inference for machine learning-assisted genome-wide association studies
- Jiacheng Miao
- Yixuan Wu
- Qiongshi Lu
Nature Genetics (2024)
Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score
- Robert Chen
- Áine Duffy
- Ron Do
Nature Communications (2024)
Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks
- Zachary R. McCaw
- Jianhui Gao
- Jessica Gronsbell
Nature Genetics (2024)