ABSTRACT
Innovative and easy-to-implement strategies are needed to improve the pathogenicity assessment of rare germline missense variants. Somatic cancer driver mutations identified through large-scale tumor sequencing studies often impact genes that are also associated with rare Mendelian disorders. The use of cancer mutation data to aid in the interpretation of germline missense variants, regardless of whether the gene is associated with a hereditary cancer predisposition syndrome or a non-cancer-related developmental disorder, has not been systematically assessed. We extracted putative cancer driver missense mutations from the Cancer Hotspots database and annotated them as germline variants, including presence/absence and classification in ClinVar. We trained two supervised learning models (logistic regression and random forest) to predict variant classifications of germline missense variants in ClinVar using Cancer Hotspot data (training dataset). The performance of each model was evaluated with an independent test dataset generated in part from searching public and private genome-wide sequencing datasets from ∼1.5 million individuals. Of the 2,447 cancer mutations, 691 corresponding germline variants had been previously classified in ClinVar: 426 (61.6%) as likely pathogenic/pathogenic, 261 (37.8%) as uncertain significance, and 4 (0.6%) as likely benign/benign. The odds ratio for a likely pathogenic/pathogenic classification in ClinVar was 28.3 (95% confidence interval: 24.2-33.1, p < 0.001), compared with all other germline missense variants in the same 216 genes. Both supervised learning models showed high correlation with pathogenicity assessments in the training dataset. There was high area under precision-recall curve values of 0.847 and 0.829 for logistic regression and random forest models, respectively, when applied to the test dataset. With the use of cancer and germline datasets and supervised learning techniques, our study shows that cancer mutation data can be leveraged to improve the interpretation of germline missense variation potentially causing rare Mendelian disorders.
AUTHOR SUMMARY Our study introduces an approach to improve the interpretation of rare genetic variation, specifically missense variants that can alter proteins and cause disease. We found that published evidence from somatic cancer sequencing studies may be relevant to understanding the impact of the same variant in the context of rare inherited (Mendelian) disorders. By using widely available datasets, we noted that many cancer driver mutations have also been observed as rare germline variants associated with inherited disorders. This intersection led us to employ machine learning techniques to assess how cancer mutation data can predict the pathogenicity of germline variants. We trained machine learning models and tested them on a separate dataset curated by searching public and private genome-wide sequencing data from over a million participants. Our models were able to successfully identify pathogenic genetic changes, demonstrating strong performance in predicting disease-causing variants. This study highlights that cancer mutation data can enhance the interpretation of rare missense variants, aiding in the diagnosis and understanding of rare diseases. Integrating this approach into current genetic classification frameworks could be beneficial, and opens new avenues for leveraging existing cancer research to benefit broader genetic research and diagnostics for rare genetic conditions.
Competing Interest Statement
SW is an employee of Genomics England Limited. MMM is an employee of GeneDx, LLC. The remaining authors have no potential conflicts of interest to declare.
Funding Statement
The study was funded by SickKids Research Institute, Canadian Institutes of Health Research, and the University of Toronto McLaughlin Centre. The funders had no role in the design and conduct of the study.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Research Ethics Board at the Hospital for Sick Children gave ethical approval for this secondary use data study. The Institutional Review Board of GeneDx gave ethical approval for the use of de-identified data from GeneDx and was assessed in accordance with an IRB-approved protocol (WIRB #20171030).
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
Revised background with added definitions; expanded Results section to include additional analyses on loss-of-function versus gain-of-function, and performance analysis with VEST4 and MutPred2; Results now include both ROC and precision-recall curves; Discussion section updated; New supplemental figures and tables.
AVAILABILITY OF DATA AND MATERIALS
The cancer mutation data from Cancer Hotspots that support the findings of this study are available through a public database and at the following URL: https://www.cancerhotspots.org/ (DOI: 10.1038/nbt.3391) Germline variants and their classifications are available in the ClinVar public archive: https://www.ncbi.nlm.nih.gov/clinvar/ (DOI: https://doi.org/10.1093/nar/gkx1153). For the Cancer Hotspots cancer mutation data transformation, the Python script is openly available on a GitHub repository: https://github.com/haqueb2/Cancer-Hotspots-Reformat. The training dataset used for training supervised learning models, the LRM and RFM pathogenicity scores assigned to training and test dataset variants, and prediction scores generated by other in silico tools for the test dataset are all available in Supplemental Table 6. All variants used in test and training datasets are included in Supplemental Table 6. R scripts used to train supervised learning models can be found in Supplemental Appendix 1 and 2. Datasets from Genomics England (DOI: https://doi.org/10.6084/m9.figshare.4530893.v7), MSSNG (DOI: 10.1016/j.cell.2022.10.009), Care4Rare (DOI:10.1016/j.ajhg.2022.10.002), and GeneDx are not openly available due to controlled access requirements. Access to these datasets can be made available upon request to the respective organizations.