Abstract
Around five percent of the population is affected by a rare disease, most often due to genetic variation. A genetic test is the quickest path to a diagnosis, yet most suffer through years of diagnostic odyssey before getting a test, if they receive one at all. Identifying patients that are likely to have a genetic disease and therefore need genetic testing is paramount to improving diagnosis and treatment. While there are thousands of previously described genetic diseases with specific phenotypic presentations, a common feature among them is the presence of multiple rare phenotypes which often span organ systems. Here, we hypothesize that these patients can be identified from longitudinal clinical data in the electronic health record (EHR). We used diagnostic information from the EHRs of 2,286 patients that received a chromosomal microarray and 9,144 matched controls to train and test a prediction model. We identified high prediction accuracy (AUROC = 0.97, AUPR = 0.92) in a held-out test sample and in 172,265 hospital patients where cases were defined broadly as interacting with a genetics provider (AUROC = 0.9, AUPR = 0.63). High probabilities (median = 0.97) were associated with 46 patients carrying a known pathogenic copy number variant (CNV) among a subset of 6,445 genotyped patients. Our model identified many more patients needing a genetic test while increasing the proportion having a putative genetic disease compared to the current nonsytematic approach. Taken together, we demonstrate that phenotypic patterns representative of a genetic disease can be captured from EHR data and provide an opportunity to systematize decision making on genetic testing to speed up diagnosis, improve care, and reduce costs.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was supported by R01MH111776 (DMR), R01MH113362 (NJC, DMR), R01LM010685 (LB), and U01HG009068 (NJC). This study makes use of data generated by the DECIPHER community. A full list of centres who contributed to the generation of the data is available from https://decipher.sanger.ac.uk and via email from decipher@sanger.ac.uk. Funding for the project was provided by Wellcome. The dataset(s) used for the analyses described were obtained from Vanderbilt University Medical Centers BioVU which is supported by numerous sources: institutional funding, private agencies, and federal grants. These include the NIH funded Shared Instrumentation Grant S10RR025141; and CTSA grants UL1TR002243, UL1TR000445, and UL1RR024975. Genomic data are also supported by investigator-led projects that include U01HG004798, R01NS032830, RC2GM092618, P50GM115305, U01HG006378, U19HL065962, R01HD074711; and additional funding sources listed at https://victr.vanderbilt.edu/pub/biovu/.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Vanderbilt University Medical Center Institutional Review Board #200939
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Code and scripts are available at https://github.com/RuderferLab/chromosomalMicroarray