Abstract
While respiratory diseases such as COPD and asthma share many risk factors, most studies investigate them in insolation and in predominantly European ancestry populations. Here, we conducted the most powerful multi-trait and -ancestry genetic analysis of respiratory diseases and auxiliary traits to date. Our approach improves the power of genetic discovery across traits and ancestries, identifying 44 novel loci associated with lung function in individuals of East Asian ancestry. Using these results, we developed PRSxtra (cross TRait and Ancestry), a multi-trait and -ancestry polygenic risk score approach that leverages shared components of heritable risk via pleiotropic effects. PRSxtra significantly improved the prediction of asthma, COPD, and lung cancer compared to trait- and ancestry-matched PRS in a multi-ancestry cohort from the All of Us Research Program, especially in diverse populations. PRSxtra identified individuals in the top decile with over four-fold odds of asthma and COPD compared to the first decile. Our results present a new framework for multi-trait and -ancestry studies of respiratory diseases to improve genetic discovery and polygenic prediction.
Introduction
Respiratory diseases are leading causes of morbidity, mortality, and disability-adjusted life-years globally. Chronic obstructive pulmonary disease (COPD) is the third leading cause of death globally, asthma is the most common chronic disease of childhood, and lung cancer is the leading cause of cancer deaths worldwide1–3.
Existing models for predicting the risk of respiratory disease are limited and often do not consider environmental or genetic risk factors beyond smoking. For example, the Global Initiative for Chronic Obstructive Lung Disease (GOLD) classification system uses spirometry results to guide treatment and prognosis of COPD4; the asthma predictive index (API) determines the likelihood of pediatric asthma mainly based on family history information5; and lung cancer risk is primarily assessed by age and smoking history.
While respiratory diseases such as COPD and lung cancer are strongly influenced by smoking, they are complex diseases that develop due to the influence of many environmental and genetic risk factors. For instance, cumulative non-smoking factors can better predict and stratify COPD risk compared to smoking alone6. Genetic factors also play an important role. Family-based studies estimate heritability at 40% to 60% for COPD and asthma7–9, respectively, and around 18% for lung cancer10, indicating a substantial genetic component.
Previous studies have shown that single-trait polygenic risk scores (PRS), which model the cumulative genetic risk using genome-wide association study (GWAS) summary statistics, can identify and stratify individuals with risk of respiratory diseases 11,12. While respiratory diseases share many comorbidities as well as genetic, clinical, and lifestyle risk factors, such as smoking exposure, most PRS to date have been constructed in the context of a single trait and ancestry group without modeling dense genetic correlations between traits and linkage disequilibrium (LD) and allele frequency patterns between ancestries.
To address these limitations, we expanded genetic studies of spirometry and developed a novel multi-trait and multi-ancestry PRS approach that integrates genetic correlations between respiratory disease and auxiliary traits and LD patterns across diverse populations. This approach improves the power for genetic discovery across traits and the predictive accuracy of PRS for respiratory diseases, thereby providing a more comprehensive risk assessment tool that can be used in research contexts or potentially integrated into existing clinical models.
Previous efforts have used multi-trait approaches to improve genetic discovery and prediction13. For example, multi-trait analysis of GWAS (MTAG) has demonstrated significant improvements in the power of detecting signals for psychiatric disorders, cardiomyopathies, and tobacco and alcohol use, among others 14–16. Multi-trait PRS constructed from a weighted sum of multiple cardiometabolic trait scores also predicts heart disease better than the single trait PRS17,18. Similarly, lung function PRS from two spirometry measurements–Forced Expiratory Volume in one second (FEV1) and the ratio of FEV1 to Forced Vital Capacity (FVC)–can improve the identification of heterogeneity when predicting asthma and COPD11,19. However, the majority of studies still report a single-trait PRS derived in cohorts of primarily European genetic ancestry. This lack of diversity in genomic studies has been shown to greatly reduce the generalizability of prediction models and exacerbate healthcare disparities20.
We hypothesize that the joint modeling of multi-trait and -ancestry information at the SNP level will enhance genomic discovery and prediction in respiratory diseases with the largest global disease burden and disparities. Specifically, we jointly model the genetic information of four ancestry groups with population labels and definitions based on training with genetic reference panels, including African (AFR), Admixed American (AMR), East Asian (EAS), and European (EUR)21,22. We also model eight strongly correlated traits–COPD, asthma, lung cancer, forced expiratory volume (FEV1), forced vital capacity (FVC), FEV1/FVC, smoking status, and smoking heaviness.
To evaluate the utility of this approach for predicting respiratory diseases and interpreting genetic variant effects across traits and ancestries, we: 1) conducted the largest meta-analysis of lung function in East Asian populations and a comprehensive multi-trait meta-analysis of respiratory disease and related traits to significantly improve genetic discovery; 2) compared shared and distinct genetic architectures and effects by modeling pleiotropy across these traits; 3) developed and validated the PRSxtra (cross TRait and Ancestry) method in the All of Us Research Program, modeling genetic correlations between traits and ancestry-specific LD and allele frequency patterns between populations; and 4) quantified PRSxtra prediction accuracy and case stratification of asthma, COPD, and lung cancer risk across multiethnic populations, especially those traditionally underrepresented in genetic studies, and compared it to single-trait and -ancestry PRS and clinical risk factors.
Results
We aim to improve the power of genetic discovery and prediction of respiratory diseases through multi-ancestry and multi-trait analyses of eight correlated traits - COPD, asthma, lung cancer, spirometry (FEV1, FVC, FEV1/FVC), smoking status, and smoking heaviness in African (AFR), Admixed (AMR), East Asian (EAS), and European (EUR) ancestry populations. We conducted the largest meta-analysis GWAS of lung function in EAS to date and meta-analyzed additional GWAS from the Global Biobank Meta-analysis Initiative, GWAS & Sequencing Consortium of Alcohol and Nicotine (GSCAN), and multi-population genome-wide meta-analyses of lung function and lung cancer. We then developed the most predictive PRS of asthma, COPD, and lung cancer across populations using the strategy and datasets outlined in Figure 1.
Largest genome-wide association study of lung function in the East Asian population identifies novel loci
Spirometry is a pulmonary function test that evaluates lung function. As continuous measures of lung function, they are useful traits for studying the heritable basis of respiratory diseases. However, the largest genetic studies of spirometry to date have been Eurocentric. To identify novel loci associated with lung function in understudied populations, we performed the largest GWAS to date of East Asian (EAS) ancestry individuals of FEV1, FVC, and FEV1/FVC, combining GWAS summary statistics from the Korean Cancer Prevention Study-II (KCPS2) and Taiwanese Biobank (TWB) (Supplementary Table 1, Methods)23–25. We conducted fixed-effects inverse-variance weighted meta-analysis for the three continuous spirometry measures of lung function across 132,200 total individuals. We applied quality control filters (Methods) that resulted in 8 million unique single-nucleotide polymorphisms (SNPs) for meta-analysis.
The meta-analysis identified 44, 73, and 31 independent loci for FEV1, FVC, and FEV1/FVC, respectively (Figure 2, Supplementary Figure 1-3). Of these, 44 are novel and have not been previously associated with each trait (Supplementary Table 2), including 25 novel loci for FVC, 17 for FEV1, and 2 for FEV1/FVC. Among these novel loci, several were previously associated with height, including rs3782886 in BRAP (P=4.20×10-8, β=0.0219 with FEV1), rs7290267 near FLJ27365 (P=1.49×10-9, β=-0.0272 with FEV1), rs724016 in ZBTB38 (P=1.35×10-10, β=-0.0178 with FVC), rs58744877 near GNAS (P=2.71×10-14, β=-0.033 with FVC), rs3176466 near CDKN2C (P=4.59×10-8, β=0.0236 with FVC), rs28839214 near RP11-361D14.2 (P=1.91×10-13, β=0.0197 with FVC), rs149580940 in FIBIN (P=7.89×10-10, β=0.0449 with FVC)26–28. rs3782886 has also shown previous associations with other traits including alcohol use disorder, high-density lipoprotein, and type 2 diabetes29–31. rs724016 has also been shown to be associated with Crohn’s disease32, and rs150971595 near GABBR1 (P=1.49×10-8, β=-0.0307 with FVC) associated with total cholesterol level and triglyceride levels33. Previous reported associations for each locus can be found in Supplementary Table 3.
We combined the results of our EAS meta-analysis of lung function with the largest multi-ancestry GWAS of lung function to date34. Incorporating results from KCPS2 and TWB resulted in 519, 526, and 476 total loci discovered for FEV1, FVC, and FEV1/FVC, respectively, of which 131, 163, and 149 were not originally reported to be associated with each trait (Figure 2, Supplementary Figure 4-5). Significantly associated loci are reported in Supplementary Table 4-6.
Pervasive pleiotropic effects across respiratory traits and diseases inform shared components of heritable risk
Measures of lung function are known clinical risk factors for respiratory diseases such as chronic obstructive pulmonary disease (COPD), asthma, and lung cancer. Using the largest multi-ancestry GWAS summary statistics to date, we found that lung function (FEV1, FVC, FEV1/FVC), smoking behavior (smoking status and cigarettes/day), and respiratory diseases (asthma, COPD, and lung cancer) were all genetically significantly correlated with each other (P<0.05) with the exception of FVC and lung cancer (Figure 3a).
To examine the distinct and shared variants among respiratory diseases and auxiliary traits, we first evaluated the consistency of lead SNPs between the traits. Many variants demonstrated pleiotropic effects associated with multiple traits. In particular, we explored the shared and distinct genetic etiology among the three respiratory disease traits (asthma, COPD, and lung cancer), risk factors (smoking status, smoking heaviness), and a measure of lung function (FEV1/FVC). For each trait pair, we selected variants significantly associated with either or both of the traits. We computed the linear slopes of variant effect sizes within each category using an expectation-maximization (EM) algorithm, which typically revealed more than one linear trend and divergent slopes (Methods, Figure 3, Supplementary Figure 6-8). For example, genetic variants associated with the amount and frequency of smoking have mostly distinct effects from asthma. This suggests that there is no uniform explanation of the genetic mechanisms of the two traits across those variants.
We then implemented a Bayesian algorithm, linemodels35, to classify the variant effect sizes into confident associations with trait 1, trait 2, or both traits. We used ancestry-specific (AFR, EAS, EUR) as well as meta-analyzed multi-ancestry GWAS results to identify variants having confident associations with a posterior probability above 0.95. We found 31 variants that were confidently associated with COPD and 108 with lung cancer, which are distinct from those associated with smoking heaviness, as measured by Cigarettes/Day, in EUR. No variants were confidently associated primarily with both traits or with smoking heaviness only (Supplementary Table 7-8). Two variants on chromosome 8 near the GULOP pseudogene, rs113623975 and rs56223946, were predominantly associated with both COPD and lung cancer.
When we further investigated variants with predominant effects on lung cancer across EAS, EUR, and the multi-ancestry meta-analysis GWAS results, we identified 7 variants from TERT appearing in all three groups and 6 variants significant only in the meta-analysis results (Supplementary Figure 9-10). These include rs34517439 from DNAJB4, rs11778371 from CHRNA2, and 4 other variants within or near TERT.
PRSxtra jointly models multi-ancestry and multi-trait effects to predict diseases and exacerbations
To develop PRSxtra, we combined the results of our new GWAS meta-analysis of lung function in EAS with the largest and most diverse GWAS of COPD, asthma, lung cancer, and smoking (Figure 1) to conduct ancestry-specific multi-trait GWAS (MTAG)13. We then used PRS-CSx to jointly model trait-specific MTAG results across ancestry groups to derive candidate scores36,37 to include in the final PRSxtra.
First, MTAG resulted in a greater number of significant associations and loci discovered across traits (Supplementary Table 9). AMR and EUR ancestry-specific MTAG analysis resulted in the largest gains. For example, FVC in AMR and COPD in EUR cohorts had the greatest proportional increases in the number of lead SNPs discovered, from 2 to 34 (17-fold) and 27 to 441 (16-fold), respectively. In total, MTAG identified 609 more lead SNPs across all traits and ancestry groups. We assessed the gain in power for each run of MTAG by the increase in mean χ2 statistic, where all but five GWAS (all of which were ancestry-specific spirometry GWAS) increased in power (Supplementary Table 9).
Next, we aggregated the summary statistics from MTAG to derive multi-trait and multi-ancestry polygenic risk scores. For each trait, we used PRS-CSx to jointly model trait-specific MTAG summary statistics across ancestry groups to derive a total of 39 candidate scores for individuals in the All of Us Research Program (AoU), a longitudinal cohort study continuously enrolling adults in the US with genotype, self-reported, and linked health record data. The program places a strong emphasis on including diverse populations that have traditionally been underrepresented in biomedical research. The individuals from AoU are independent from the cohorts in which the GWAS was obtained. For each trait, we randomly split participants who passed quality control into 70% for training and 30% for validation (Figure 1, Methods, Supplementary Table 10-13). We then used ridge regression to calculate linear combinations of the scores in the training set (Supplementary Table 14-16, Supplementary Figure 11-13).
We evaluated the performance of PRSxtra in the held-out validation cohort for predicting asthma (Ncases=9450, Ncontrols=64070), COPD (Ncases=4560, Ncontrols=66605), and lung cancer (Ncases=578, Ncontrols=70600) compared to a single ancestry- and trait-matched PRS (Figure 1, Supplementary Table 10). The validation cohort includes individuals of AFR, AMR, EAS, EUR, MID, and SAS genetic ancestries. In this multi-ancestry validation cohort, ancestry- and trait-matched PRS and PRSxtra were significantly correlated with each other (P<0.0001), with r=0.444, r=0.185, and r=0.120 for asthma, COPD, and lung cancer, respectively.
PRSxtra alone predicted asthma, COPD, and lung cancer more accurately than the trait- and ancestry-matched PRS alone (P<0.0001) in the multi-ancestry validation cohort (Supplementary Table 17-19). For asthma, the AUC improved from 0.543 (95% CI=[0.537, 0.549]) to 0.563 (95% CI=[0.557, 0.569]) using PRSxtra versus PRS. For COPD, AUC improved from 0.540 (95% CI=[0.532, 0.549]) to 0.589 (95% CI=[0.581, 0.597]). For lung cancer, AUC improved from 0.539 (95% CI=[0.516, 0.561]) to 0.592 (95% CI=[0.569, 0.616]) (Figure 3).
Among individuals with each disease, the median percentile of PRSxtra was much higher than PRS (Supplementary Table 20). We observed the largest difference between scores in lung cancer, where cases had a median PRSxtra in the 64th percentile compared to the 57th percentile for PRS. Controls had a median PRSxtra and PRS both in the 50th percentile. The AMR population had the largest improvement in disease prediction between PRS and PRSxtra. For asthma, AUC improved from 0.509 (95% CI=[0.493, 0.524]) to 0.630 (95% CI=[0.616, 0.645]). For COPD, AUC improved from 0.540 (95% CI=[0.532, 0.549]) to 0.589 (95% CI=[0.581, 0.597]). These differences could reflect the increased EAS GWAS power for spirometry and the relatively low genetic divergence between EAS and AMR groups. However, these new ancestry-specific GWAS do not explain the most phenotypic variation when predicting these traits. Specifically, we further assessed the robustness of PRSxtra by conducting leave-one-out analyses for predicting COPD and asthma in the AMR cohort, where there were the largest improvements in prediction. Removing any single score did not significantly change the prediction ability of PRSxtra. For example, removing the EUR lung cancer candidate PRS resulted in the largest decrease in AUC of -0.0175 for predicting COPD, and removing EUR FVC candidate PRS resulted in the largest decrease in AUC of -0.000421 for predicting asthma.
PRSxtra improved overall stratification in the multi-ancestry validation cohort compared to PRS. In the bottom and top deciles of PRSxtra, asthma prevalence was 8.4% and 16.8% (+8.4%), respectively, versus 9.0% and 14.1% (+5.1%) for PRS. For COPD, the prevalence in the bottom and top deciles of PRSxtra was 2.8% and 9.1% (+6.3%), respectively, compared to 4.6% and 7.6% for PRS (+3.0%). Lung cancer prevalence in the bottom and top deciles of PRSxtra was 0.5% and 1.4% (+0.9%), respectively, compared to 0.46% to 0.90% (+0.44%) for PRS (Supplementary Figure 14, Supplementary Table 21-23). PRSxtra was also better at identifying individuals with the highest risk of disease. For example, individuals in the top decile of PRSxtra had 3.5 times the odds of COPD compared to those in the first decile, while PRS had 1.7 times the odds. The improvement was most apparent in the AMR population for asthma and COPD. Individuals in the top decile of PRSxtra had 4-fold odds of asthma and COPD compared to the first decile, whereas the top decile of PRS had almost the same odds of asthma and COPD compared to the first decile (Supplementary Figure 14, Supplementary Table 24-26).
We compared the performance of PRS and PRSxtra alone to a joint model of the strongest clinical risk factors for each disease, i.e. family history of asthma, and smoking for COPD and lung cancer. Clinical risk factors generally provided the largest improvement in performance from sex and age. In the multivariable model, PRSxtra remained significantly better than PRS for predicting asthma in the multi-ancestry validation cohort and for predicting asthma and COPD in the AMR subgroup. For lung cancer, the risk scores do not add additional information to the multivariable model of sex, age, and smoking status (Figure 4, Supplementary Table 17-19).
To investigate performance in clinical subgroups, we tested the association of PRSxtra with COPD and lung cancer in subgroups of never and ever smokers, and of asthma in individuals with and without family history. We found similar performance between subgroups (Supplementary Table 17-19, Supplementary Figure 15-17). PRSxtra was positively associated with COPD and asthma exacerbations. PRSxtra predicted COPD exacerbation significantly better than PRS in the validation cohort, with AUC increasing from 0.572 (95% CI=[0.558-0.586]) to 0.600 (0.587-0.614), However, PRSxtra did not significantly improve prediction of asthma exacerbation compared to PRS (Supplementary Table 27).
Discussion
In this study, we conducted the most powerful multi-trait and multi-ancestry genetic analysis of respiratory diseases and comorbid traits to date. Our novel study framework contrasts with traditional genetic studies which typically analyze a single trait within a single population, thereby overlooking the genetic correlations between traits and LD patterns across diverse populations.
Our innovative approach, which iteratively models genetic correlations across traits and ancestry groups, significantly increases the power to detect novel genetic associations and enhances disease prediction and risk stratification, especially in non-European ancestry populations. Applying MTAG resulted in a greater number of significant associations and loci discovered across traits, with the largest gains in AMR and EUR ancestry-specific analyses. We also conducted the largest meta-analysis of lung function to date in EAS to identify 44 novel loci associated with lung function. Of these, several are associated with lipid levels such as high-density lipoprotein, total cholesterol level and triglyceride levels33, as well as with diseases such as type 2 diabetes and Crohn’s Disease, which supports previously documented relationships between lung dysfunction and disease risk and progression 38–40.
Given that asthma, COPD, lung cancer, lung function, and smoking traits are genetically correlated with each other, we further investigated the distinct and shared variants among the respiratory diseases and auxiliary traits. We found variants with predominant effects on lung cancer across EAS, EUR, and the multi-ancestry meta-analysis GWAS results, with 7 variants from TERT appearing in all three groups and 6 variants significant only in the meta-analysis results from DNAJB4 and CHRNA2. All three genes have been shown to have a role in respiratory function: 1) TERT mitigates oxidative stress and chronic inflammation, which are essential factors in the disease progression of COPD. It also prevents premature cellular aging and apoptosis in lung epithelial cells to sustain respiratory function. Polymorphisms in TERT have previously been associated with pulmonary fibrosis41 as well as COPD risk in an EAS population42. There is also evidence of strong gene-environment interaction between smoking and telomerase mutations, where carriers who do not smoke predominately develop pulmonary fibrosis, while smokers are also at risk for developing emphysema (alone or in combination with pulmonary fibrosis)41. 2) CHRNA2, which encodes a subunit of nicotinic acetylcholine receptors, plays a role in respiratory behavior and function by influencing nicotine addiction and smoking habits43,44., but has a smaller effect than the CHRNA3-CHRNA5 cluster. While CHRNA2 affects respiratory function by modulating the neural pathways involved in nicotine response, the CHRNA3-CHRNA5 cluster has a stronger and more consistently observed association with Cig/day45,46. 3) DNAJB4, a part of the Hsp40 heat shock protein family, plays a crucial role in protein folding and cellular stress response. This protein helps prevent misfolded protein aggregation and is upregulated under oxidative stress, which is common in respiratory failure. DNAJB4 enhances protein quality control and stress response, potentially reducing disease severity and progression by maintaining lung function under chronic stress47.
In our investigation of pleiotropic effects across pairs of traits, we applied linemodels to dissect the pleiotropic effect of variants. Using this method, we identified unique genetic features of lung cancer and COPD not mediated by one of their leading risk factors - smoking. The ‘linemodels’ algorithm, tailored for pleiotropy dissection, uses a Bayesian framework to probabilistically cluster genetic variants based on their linear relationships in effect sizes across two outcomes. This method optimizes model parameters using the EM algorithm and Gibbs sampling, accommodating correlated estimators due to sample overlaps. In contrast, techniques like Non-negative Matrix Factorization (NMF) decompose data into non-negative factors to capture latent structures without modeling direct linear relationships, highlighting the unique focus of linemodels on linear effect size clustering in genetic studies.
We then developed a cross-trait and -ancestry score, PRSxtra, to model the genetic correlations between related traits and ancestry-specific LD and allele frequency patterns between populations. While PRSxtra and the trait- and ancestry-matched PRS are significantly correlated, PRSxtra demonstrates significantly better prediction and stratification for respiratory diseases and disease exacerbations. Previous studies have shown the utility of integrating large and diverse sources of data to construct PRS. For example, GPSmult was calculated by weighting the sum of multiple cardiometabolic trait components, which predicts heart disease better than the single trait PRS17. Similarly, PRS derived from two spirometry measurements predict asthma and COPD11. Recent developments such as PRSmix and BridgePRS jointly model summary information from multiple traits and ancestry groups at the score level but do not account for genetic correlations across traits or model differences in LD across ancestries. PRSxtra, however, shares information across traits and ancestries at the SNP level, refining putatively causal loci and improving PRS accuracy. We demonstrated the robustness of PRSxtra in a leave-one-out analysis, which showed minimal changes in predictive ability when candidate PRS were excluded. The enhanced predictive capacity of PRSxtra was particularly pronounced in the AMR population, which is not primarily explained by the candidate PRS from the new EAS spirometry GWAS. One potential explanation could be due to the phenotypic heterogeneity of asthma and COPD among Hispanic populations48–50. For example, asthma, and COPD are more prevalent in individuals of Puerto Rican heritage than among other Hispanics.
Our study does have limitations. We relied on phenotype definitions based on ICD codes and self-reported data, which can be imprecise. For example, definitions of COPD using common ICD-9 codes have been previously shown to misclassify patients compared to combining them with pharmacy data51. While we aimed to mitigate this effect by combining multiple ICD-9 and ICD-10 codes to define disease cases, it is unclear how adding spirometry would change disease definitions in AoU due to the lack of spirometry measures availability for all participants. We were also limited by the study design of existing GWAS. While traditional GWAS of FEV1, FVC, and FEV1/FVC are adjusted for smoking status, our meta-analysis of lung function in EAS was not due to the limitation of data availability. Our measures were also not explicitly post-bronchodilator lung function measures, although previous work demonstrated little impact of pre-versus post-bronchodilator definitions when predicting COPD52,53. Additionally, we used ridge regression, which assumes a linear association of the candidate PRS. It is unclear how other non-linear methods of combining PRS will compare. Our study sample is shaped by the All of Us Research Program’s cohort creation process, which relies on partnerships with universities, research centers, volunteers, and community engagement54. While All of Us is diverse, there are demographic sampling biases that may not be representative of the general population55. Our results demonstrate significant improvements in a held-out cohort of diverse AoU participants, but they should be validated in an independent cohort.
In summary, we conducted the largest multi-trait and multi-ancestry genetic analysis of respiratory diseases and auxiliary traits to discover numerous novel genetic signals. We propose PRSxtra as a method to model genetic correlation across traits and LD differences between ancestry groups, significantly improving disease prediction and stratification for asthma, COPD, and lung cancer. PRSxtra has the potential to reduce the disparities in risk stratification between populations for survival and outcome, as well as to advance more equitable and generalizable prediction models for respiratory diseases.
Methods
Meta-analysis of lung function across East Asian populations
We performed a fixed-effects meta-analysis with inverse variance weighting, as implemented in METAL v2001-03-25 software, for FEV1, FVC, and FEV1/FVC across two East Asian cohorts using published summary statistics from Korean Cancer Prevention Study-II (KCPS2)24,25, Taiwanese Biobank (TWB)23,33. The details of their study design have been described previously. In summary, the Korean Cancer Prevention Study-II Biobank (KCPS2) is a prospective cohort study based in Korea of 153,950 subjects with genotype data and phenotype measurements between 2004 and 2013. The Taiwanese Biobank is a prospective cohort study of the Taiwanese population with 149,894 participants between the ages of 30-70 years old at recruitment (as of April 2021). FEV1 was defined as the total air blown between zero and 1 second (measured in Liters). FVC was defined as the vital capacity during forced expiration (measured in Liters). In KCPS2, FEV1 and FVC were measured using pulmonary/metabolic systems Vmax 20, Carefusion, USA. For each trait, samples with measurements that were more than 6 standard deviations away from the sample average were excluded.
Altogether, these cohorts had a total sample size of 132,200. In conducting our metanalysis, we excluded genetic variants with minor allele frequency < 0.01. We used FUMA v. 1.5.2 to annotate and functionally map variants in the meta-analysis. Genome-wide significance was defined using a threshold of P<5×10-8. Genomic risk loci included variants correlated with the most significant variant at R2>0.6 using the 1000G Phase 3 EAS reference panel. Genome positions are reported in build hg37 for index variants. To designate a locus as previously known or potentially novel, the index variants, or the most significant variants in each locus, were at least 1 Mb in distance from a previously discovered genome-wide significant variant associated with the trait. Previously discovered variants were compiled from Shrine et al34.
Comparison of pleiotropic effect analysis
We used S-LDSC to estimate heritability and genetic correlation of and between phenotypes using summary statistics of EUR populations. We then applied the linemodels package (https://github.com/mjpirinen/linemodels) to the GWAS summary statistics of three respiratory diseases (asthma, COPD, and lung cancer) and the three featured environmental and genetic risk factors (smoking status, smoking heaviness (Cigday), and FEV1/FVC) across three ancestry groups (AFR, EAS, and EUR) plus the corresponding meta-analysis results. We focused on comparisons for 12 pairs of traits selected from above. Three of these were between diseases phenotypes: asthma and COPD, asthma and lung cancer, COPD and lung cancer, and nine pairs were between disease phenotype and smoking or lung function. We considered variants significantly associated with either one of the traits being compared (Supplementary Table 28). We classified the variants into three classes (two when there is no variant associated with both traits) based on their association patterns: associated with trait 1 only, trait 2 only, and both. The classes were represented by linear trends whose slopes were estimated using an EM algorithm. Conditioning on these classes, we ran the linemodels package on the GWAS effect sizes and standard errors of overlapped variants of the two traits, where we set the scale parameters determining the magnitude of effect sizes to 0.2, the correlation parameters determining the allowed deviation from the lines to 0.99 as default, and the slope parameter to the estimates from the previous EM step. The membership probabilities in the three classes were computed separately for each variant by assuming that the classes were equally probable a priori. We assumed no overlapping samples between the two GWASs being compared and set the correlation of their effect estimators to 0. Confident associations are defined as having a posterior probability above 0.95.
Multi-trait analysis
We conducted multi-trait genome-wide association studies as implemented in MTAG v. 1.0.7 for each ancestry group (AFR, AMR, EAS, EUR) by combining the ancestry-specific GWAS summary statistics for GBMI asthma, GBMI COPD, GWMA lung cancer, spirometry meta-analysis, and GSCAN smoking behaviors. MTAG performs a joint analysis of GWAS results from related traits to improve the number of genetic loci identified and the predictive power of polygenic scores. For each population, we used the ancestry-specific LD reference panel from the gnomAD reference panels v2.1.1. For each ancestry-specific MTAG, we included traits with χ2 > 1.02.
Study population
The All of Us Research Program is a longitudinal cohort study that has continuously enrolled US adults 18 years or older since May 2017. The program aims to engage in one million or more US participants and places a strong emphasis on including diverse populations that have traditionally been underrepresented in biomedical research. Details of the All of Us cohort have been previously described54. In summary, participants of the program opt to provide self-reported data, linked health record data, and biospecimen data to be made available for research uses. The program’s primary objective is to build a resource to help researchers understand individual differences in biological, clinical, social, and environmental determinants of health and disease to advance precision health care.
Informed consent for all the participants in the All of Us Research Program are conducted in person or through an eConsent platform. The protocol was reviewed by the Institutional Review Board (IRB) of the All of Us Research Program. Data can be accessed through the All of US Research Workbench, a secure cloud-based analytic platform. Whole genome sequencing, genotyping array variant data, variant annotations, computed ancestry, and quality reports are accessible through the Controlled Tier of the AoU. This project is registered in the All of Us program under the workspace name “PRSxtra AoU”. In our analysis, we included individuals with whole genome data in the v7 Data Release, self-reported sex, and date of birth along with additional disease-dependent filtering criteria: For COPD and lung cancer, we excluded individuals who did not self-report report smoking status. For COPD, we additionally excluded individuals with homozygous polymorphism of SERPINA1, which encodes for a serine protease inhibitor alpha 1 antitrypsin, as this is a known risk allele associated with COPD56. Participants were randomly split 70% for training and 30% for validation.
Phenotype ascertainment
We curated clinical phenotypes from All of Us using a combination of electronic health record data, and/or self-reported personal history data from the All of Us v7 Data Release. ICD codes for each phenotype and exacerbation are detailed in Supplementary Table 29-33.
We define smoking status and family history based on self-reported data. Previous smokers are individuals who smoked more than 100 pack years but do not currently smoke. Individuals who have never smoked more than 100 cigarettes are considered never smokers. Family history included mother, father, and siblings with the same record of disease as the participants. No family history included those who did not explicitly self-report a family history.
PRSxtra construction
We constructed PRSxtra in a three-layer process. Layer 1 consisted of performing multi-trait meta-analysis across related traits for each ancestry population as previously described. In layer 2, we used PRS-CSx, which leverages linkage disequilibrium across discovery samples to jointly model the genetic effects across populations via a shared continuous shrinkage prior. We used the default parameters on PRS-CSx on AFR, AMR, EAS, and EUR ancestry-specific meta-analysis results. Only HapMap3 variants—a set of 3 million variants compiled by the International HapMap Project which capture common patterns of variation in a variety of human populations—were included in calculating scores. In layer 3, we use ridge regression, as implemented by the “glmnet” R package57, to jointly model the 39 standardized PRS (with mean 0 and variant 1) generated in layer 2 to construct an ancestry-specific PRSxtra for the three disease phenotypes: COPD, asthma, and lung cancer. In ridge regression, we used 10-fold cross-validation and minimum lambda value to estimate the weights of each PRS. PRSxtra was validated in the held-out multi-ancestry cohort from AoU.
PRS construction
As a baseline comparison to PRSxtra, we derived trait- and ancestry-matched PRS using PRS-CS, which uses a Bayesian regression framework to infer posterior effect sizes of SNPS. We ran PRS-CS with default parameters on the trait- and ancestry-specific GWAS summary statistic (Supplementary Table).
Statistical analysis
For COPD, asthma, and lung cancer, we placed individuals into bins by their PRS and PRSxtra deciles. In each decile, we calculated the prevalence of disease and disease exacerbation. We calculated the risk of disease and exacerbation for each decile compared to the lowest decile of PRS and PRSxtra using logistic regression models. We evaluated the performance of predicting diseases based on PRS and PRSxtra alone, as well as in a joint multi-variable model with covariates. The baseline model included age and sex. We then subsequently added clinical risk factors (smoking, family history), and PRS or PRSxtra. We evaluated the predictive performance of each model using the area under the receiving operating curve. In the full population, we used the trait- and EUR-specific PRS as the baseline for comparison. All statistical analyses were two-sided and performed with the use of R software, version 3.5 (R Project for Statistical Computing).
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
Contributions
Y.H. M.H.C, and A.R.M designed the study. Y.H. and W.L. processed, analyzed, and conducted statistical analysis of the data. Y.H.J., Y.W., and K.T. provided methodological and statistical advice. Y.H., W.L., D.C.Q., J.A.D., M. M, M.H.C, and A.R.M interpreted the data. A.R.M. and Y.H. obtained funding. All authors provided critical feedback and revisions for the manuscript.
FUNDING
This work was supported by the National Institutes of Health under award number T32HG010464, K99/R00MH117229, U01HG011719. We are grateful for the volunteers who participated in the All of Us Research Program.
DECLARATION OF INTEREST
MHC has received grant support from GSK, consulting fees from Genentech and AstraZeneca, and speaking fees from Illumina. ARM has received speaker fees from Novartis. MM has received consulting fees from TheaHealth, 2ndMD, Axon Advisors, Verona Pharma, and Sanofi. All other authors declare no competing interests.