Abstract
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 × 10−5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman’s ρ = 0.61, p = 2.2 × 10−59 for quantitative traits, ρ = 0.21, p = 9.6 × 10−4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).
Author summary Polygenic risk scores (PRSs), an approach to estimate genetic predisposition on disease liability by aggregating the effects across multiple genetic variants, has attracted increasing research interest. While there have been improvements in the predictive performance of PRS for some traits, the applicability of PRS models across a wide range of human traits has not been clear. Here, applying penalized regression using Batch Screening Iterative Lasso (BASIL) algorithm to more than 269,000 individuals of white British ancestry in UK Biobank, we systematically characterize PRS models across more than 1,500 traits. We report 813 traits with PRS models of statistically significant predictive performance. While the statistical significance does not necessarily directly translate into clinical relevance, we investigate the properties of the 813 significant PRS models and report a significant correlation between predictive performance and estimated SNP-based heritability. We find that the number of genetic variants selected in our sparse PRS model is significantly correlated with the incremental predictive performance in both quantitative and binary traits. Our transferability assessment of PRS models in UK Biobank revealed that the sparse PRS models trained on individuals of European ancestry had a lower predictive performance for individuals of African and Asian ancestry groups.
Competing Interest Statement
M.A.R is a consultant at MazeTx and is currently on leave at HiBio. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Funding Statement
This work has been supported by the Funai Foundation for Information Technology [to Y.T.]; Stanford University School of Medicine [to Y.T., R.L., and M.A.R.]; National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) [R01HG010140 to M.A.R.]; NIH center for Multi and Trans-ethnic Mapping of Mendelian and Complex Diseases [5U01 HG009080 to M.A.R]; NIH [5R01 EB 001988-21 to T.H., 5R01 EB001988-16 to R.T.]; and National Science Foundation (NSF) [DMS-1407548 to T.H., 19 DMS1208164 to R.T.]. The authors of this manuscript have received the following salary support: NHGRI of NIH [R01HG010140 to Y.T. and M.A.R., R01HG008155 to Y.T.], NIH [5U01 HG009080 to M.A.R.], and the National Institute on Aging of NIH [R01AG067151 to Y.T.]. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies; funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Based on the information provided in Protocol 44532, the Stanford IRB has determined that the research does not involve human subjects as defined in 45 CFR 46.102(f) or 21 CFR 50.3(g).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
We revised the manuscript based on the feedback from colleagues. The major changes in this revision are the following three points. 1) Given the feedback from colleagues, we removed the sentences that inappropriately mentioned genetic architecture in binary traits. We instead clarified a power difference between quantitative traits and binary traits. 2) Given the concerns regarding the lack of theoretical basis in using incremental ROC-AUC for assessing linear relationship (estimated SNP-based heritabilities and transferability assessment), we now use Nagelkerke's pseudo-R2 as the primary evaluation metric of predictive performance for binary traits in the current version of the manuscript. 3) As we change the evaluation metric for binary traits, we now observe a significant rank-based correlation between the effect size (incremental Nagelkerke's pseudo-R2) and the model size (number of genetic variants with non-zero coefficients) of the sparse PRS model.
Data Availability
The sparse PRS model weights generated from this study are available on the Global Biobank Engine (https://biobankengine.stanford.edu/prs). The significant PRS models are also available at the PGS catalog (https://www.pgscatalog.org/publication/PGP000244/ and https://www.pgscatalog.org/publication/PGP000128/, score IDs are listed in S1 Table). The BASIL algorithm implemented in the R snpnet package was used in the PRS analysis, which is available at https://github.com/rivas-lab/snpnet. The analyses presented in this study were based on data accessed through the UK Biobank: https://www.ukbiobank.ac.uk.
https://biobankengine.stanford.edu/prs