Leveraging cancer mutation data to predict the pathogenicity of germline missense variants

Bushra Haque; David Cheerie; Amy Pan; Meredith Curtis; Thomas Nalpathamkalam; Jimmy Nguyen; Celine Salhab; Bhooma Thiruvahindrapura; Jade Zhang; Madeline Couse; Taila Hartley; Michelle M. Morrow; E Magda Price; Susan Walker; David Malkin; Frederick P. Roth; Gregory Costain

doi:10.1101/2024.03.11.24304106

ABSTRACT

Innovative and easy-to-implement strategies are needed to improve the pathogenicity assessment of rare germline missense variants. Somatic cancer driver mutations identified through large-scale tumor sequencing studies often impact genes that are also associated with rare Mendelian disorders. The use of cancer mutation data to aid in the interpretation of germline missense variants, regardless of whether the gene is associated with a hereditary cancer predisposition syndrome or a non-cancer-related developmental disorder, has not been systematically assessed. We extracted putative cancer driver missense mutations from the Cancer Hotspots database and annotated them as germline variants, including presence/absence and classification in ClinVar. We trained two supervised learning models (logistic regression and random forest) to predict variant classifications of germline missense variants in ClinVar using Cancer Hotspot data (training dataset). The performance of each model was evaluated with an independent test dataset generated in part from searching public and private genome-wide sequencing datasets from ∼1.5 million individuals. Of the 2,447 cancer mutations, 691 corresponding germline variants had been previously classified in ClinVar: 426 (61.6%) as likely pathogenic/pathogenic, 261 (37.8%) as uncertain significance, and 4 (0.6%) as likely benign/benign. The odds ratio for a likely pathogenic/pathogenic classification in ClinVar was 28.3 (95% confidence interval: 24.2-33.1, p < 0.001), compared with all other germline missense variants in the same 216 genes. Both supervised learning models showed high correlation with pathogenicity assessments in the training dataset. There was high area under precision-recall curve values of 0.847 and 0.829 for logistic regression and random forest models, respectively, when applied to the test dataset. With the use of cancer and germline datasets and supervised learning techniques, our study shows that cancer mutation data can be leveraged to improve the interpretation of germline missense variation potentially causing rare Mendelian disorders.

AUTHOR SUMMARY Our study introduces an approach to improve the interpretation of rare genetic variation, specifically missense variants that can alter proteins and cause disease. We found that published evidence from somatic cancer sequencing studies may be relevant to understanding the impact of the same variant in the context of rare inherited (Mendelian) disorders. By using widely available datasets, we noted that many cancer driver mutations have also been observed as rare germline variants associated with inherited disorders. This intersection led us to employ machine learning techniques to assess how cancer mutation data can predict the pathogenicity of germline variants. We trained machine learning models and tested them on a separate dataset curated by searching public and private genome-wide sequencing data from over a million participants. Our models were able to successfully identify pathogenic genetic changes, demonstrating strong performance in predicting disease-causing variants. This study highlights that cancer mutation data can enhance the interpretation of rare missense variants, aiding in the diagnosis and understanding of rare diseases. Integrating this approach into current genetic classification frameworks could be beneficial, and opens new avenues for leveraging existing cancer research to benefit broader genetic research and diagnostics for rare genetic conditions.

Competing Interest Statement

SW is an employee of Genomics England Limited. MMM is an employee of GeneDx, LLC. The remaining authors have no potential conflicts of interest to declare.

Funding Statement

The study was funded by SickKids Research Institute, Canadian Institutes of Health Research, and the University of Toronto McLaughlin Centre. The funders had no role in the design and conduct of the study.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Research Ethics Board at the Hospital for Sick Children gave ethical approval for this secondary use data study. The Institutional Review Board of GeneDx gave ethical approval for the use of de-identified data from GeneDx and was assessed in accordance with an IRB-approved protocol (WIRB #20171030).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

LIST OF ABBREVIATIONS

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.

bioRxiv and medRxiv thank the following for their generous financial support:

The Chan Zuckerberg Initiative, Cold Spring Harbor Laboratory, the Sergey Brin Family Foundation, California Institute of Technology, Centre National de la Recherche Scientifique, Fred Hutchinson Cancer Center, Imperial College London, Massachusetts Institute of Technology, Stanford University, University of Washington, and Vrije Universiteit Amsterdam.

Posted October 28, 2024.

Download PDF

Author Declarations

Supplementary Material

Data/Code

Revision Summary

Citation Tools

Get QR code

Tweet Widget

Subject Area

Genetic and Genomic Medicine

Reviews and Context

Comment

TRIP Peer Reviews

Community Reviews

Automated Services

Blogs/Media

Author Videos

Subject Areas

All Articles

Addiction Medicine (418)
Allergy and Immunology (740)
Anesthesia (217)
Cardiovascular Medicine (3175)
Dentistry and Oral Medicine (355)
Dermatology (268)
Emergency Medicine (469)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1129)
Epidemiology (13147)
Forensic Medicine (17)
Gastroenterology (878)
Genetic and Genomic Medicine (4980)
Geriatric Medicine (458)
Health Economics (762)
Health Informatics (3134)
Health Policy (1115)
Health Systems and Quality Improvement (1157)
Hematology (418)
HIV/AIDS (988)
Infectious Diseases (except HIV/AIDS) (14451)
Intensive Care and Critical Care Medicine (897)
Medical Education (462)
Medical Ethics (121)
Nephrology (511)
Neurology (4727)
Nursing (252)
Nutrition (700)
Obstetrics and Gynecology (859)
Occupational and Environmental Health (773)
Oncology (2435)
Ophthalmology (692)
Orthopedics (272)
Otolaryngology (335)
Pain Medicine (315)
Palliative Medicine (88)
Pathology (523)
Pediatrics (1263)
Pharmacology and Therapeutics (535)
Primary Care Research (536)
Psychiatry and Clinical Psychology (4062)
Public and Global Health (7296)
Radiology and Imaging (1635)
Rehabilitation Medicine and Physical Therapy (974)
Respiratory Medicine (953)
Rheumatology (468)
Sexual and Reproductive Health (486)
Sports Medicine (409)
Surgery (527)
Toxicology (66)
Transplantation (226)
Urology (196)

Comments

medRxiv aims to provide a venue for anyone to comment on a medRxiv preprint. Comments are moderated for offensive or irrelevant content (this can take ~24 h). Please avoid duplicate submissions and read our Comment Policy before commenting. The content of a comment is not endorsed by medRxiv.

medRxiv aims to inform readers about online discussion of this preprint occurring elsewhere. The content at the links below is not endorsed by either medRxiv or the preprint's authors.

Community reviews for this article:

There are no community reviews for this paper.

Automated Evaluations

Certain services provide automated analysis of preprints. Analyses invited by the authors are displayed at the top of this tab. Those done independently of authors are shown underneath . None of these analyses is endorsed by medRxiv.

Automated Evaluations:

There are no automated evaluations for this paper.

Leveraging cancer mutation data to predict the pathogenicity of germline missense variants

ABSTRACT

Competing Interest Statement

Funding Statement

Author Declarations

Footnotes

AVAILABILITY OF DATA AND MATERIALS

LIST OF ABBREVIATIONS

Subject Area

Citation Manager Formats

Leveraging cancer mutation data to predict the pathogenicity of germline missense variants

ABSTRACT

Competing Interest Statement

Funding Statement

Author Declarations

Footnotes

AVAILABILITY OF DATA AND MATERIALS

LIST OF ABBREVIATIONS

Subject Area

Follow this preprint