ABSTRACT
The utility of polygenic risk score (PRS) models has not been comprehensively evaluated for childhood acute lymphoblastic leukemia (ALL), the most common type of cancer in children. Previous PRS models for ALL were based on significant loci observed in genome-wide association studies (GWAS), even though genomic PRS models have been shown to improve prediction performance for a number of complex diseases. In the United States, Latino (LAT) children have the highest risk of ALL, but the transferability of PRS models to LAT children has not been studied. In this study we constructed and evaluated genomic PRS models based on either non-Latino white (NLW) GWAS or a multi-ancestry GWAS. We found that the best PRS models performed similarly between held-out NLW and LAT samples (PseudoR2 = 0.086 ± 0.023 in NLW vs. 0.060 ± 0.020 in LAT), and can be improved for LAT if we performed GWAS in LAT-only (PseudoR2 = 0.116 ± 0.026) or multi-ancestry samples (PseudoR2 = 0.131 ± 0.025). However, the best genomic models currently do not have better prediction accuracy than a conventional model using all known ALL-associated loci in the literature (PseudoR2 = 0.166 ± 0.025), which includes loci from GWAS populations that we could not access to train genomic PRS models. Our results suggest that larger and more inclusive GWAS may be needed for genomic PRS to be useful for ALL. Moreover, the comparable performance between populations may suggest a more oligo-genic architecture for ALL, where some large effect loci may be shared between populations. Future PRS models that move away from the infinite causal loci assumption may further improve PRS for ALL.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was supported by research grants from the National Institutes of Health (R01CA155461, R01CA175737, R01ES009137, P42ES004705, P01ES018172, P42ES0470518, R24ES028524, and R01CA262263) and the Environmental Protection Agency (RD83451101), United States. C.W.K.C. is supported by R35GM142783 from the Natinoal Institute of General Medical Sciences (NIGMS).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study protocol was approved by the Institutional Review Boards at the California Health and Human Services Agency, University of California Berkeley, Yale University, and University of Southern California.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Our data is derived from the California Biobank. We respectfully are unable to share raw, individual genetic data freely with other investigators since the samples and the data are the property of the State of California. Should we be contacted by other investigators who would like to use the data, we will direct them to the California Department of Public Health Institutional Review Board to establish their own approved protocol to utilize the data, which can then be shared peer-to-peer. The State has provided guidance on data sharing noted in the statement below: "California has determined that researchers requesting the use of California Biobank biospecimens for their studies will need to seek an exemption from NIH or other granting or funder requirements regarding the uploading of study results into an external bank or repository (including into the NIH dbGaP or other bank or repository). This applies to any uploading of genomic data and/or sharing of these biospecimens or individual data derived from these biospecimens. Such activities have been determined to violate the statutory scheme at California Health and Safety Code Section 124980 (j), 124991 (b), (g), (h) and 103850 (a) and (d), which protect the confidential nature of biospecimens and individual data derived from biospecimens. Investigators may agree to share aggregate data on SNP frequency and their associated p-values with other investigators and may upload such frequencies into repositories including the NIH dbGaP repository providing: a) the denominator from which the data is derived includes no fewer than 20,000 individuals; b) no cell count is for < 5 individuals; and c) no correlations or linkage probabilities between SNPs are provided.) Since our dataset is derived from less than 20,000 subjects, we are not able to upload the data to dbGAP or another repository. All underlying numerical data used to create figures are available at https://doi.org/10.7910/DVN/FFBYRT. All other datasets not derived from the California Biobank are available on dbGAP.