Abstract
Background Identifying patients who would benefit from whole genome sequencing (WGS) is difficult and time-consuming due to complex eligibility criteria, lack of neonatologist familiarity with WGS ordering, and evolving clinical features. In previous work, we showed that MPSE, the Mendelian Phenotype Search Engine, can provide automated prioritization of probands for WGS while maintaining current diagnostic rates. MPSE is now in use in multiple hospital networks, but questions still surround how to best prioritize patients for WGS.
Methods Here we use the clinical histories of 2,885 neonatal intensive care unit (NICU) admits from two institutions to explore further questions regarding how to best prioritize NICU admits for WGS. First, we ask if changes to the machine learning (ML) classifier and the clinical natural language processing (CNLP) tools used for generating patient phenotype descriptions might improve MPSE’s performance. Second, we explore the utility of using alternative data types as inputs to MPSE. Lastly, we conduct a longitudinal analysis of MPSE’s ability to identify probands for WGS.
Results Eight different ML classifiers, five CNLP tools, and four previously untested alternative data types were used to train and validate MPSE models. MPSE achieved high predictive performance across multiple classifiers (max AUC=0.93), CNLP tools (max AUC=0.91), and input data types (max AUC=0.91). Longitudinal analysis of MPSE scores revealed a significant separation between cases/controls and diagnostic/non-diagnostic cases within 48 hours of NICU admission.
Conclusions MPSE provides a highly flexible and portable framework for automated prioritization of critically ill newborns for WGS. We find that MPSE’s performance is largely agnostic with respect to CNLP tools. Moreover, structured data such as ICD codes can serve as an effective alternative input to MPSE when access to clinical notes or CNLP pipelines is problematic. Finally, MPSE can identify children most likely to benefit from WGS within 48 hours of admission to the NICU, a critical window for maximally impactful care.
Competing Interest Statement
M.Y. is a co-founder and consultant for Fabric Genomics Inc.. M.R. is a shareholder of Fabric Genomics Inc.. E.F. is an employee of Fabric Genomics Inc.. B.M. and J.H. have received consulting fees and stock grants from Fabric Genomics Inc.. The remaining authors declare that they have no competing interests.
Funding Statement
The preparation of this manuscript was supported by a National Library of Medicine training grant (grant number T15LM007124), NIH grant UL1TR002550 from NCATS to E.J. Topol (with sub-award to Rady Children's Institute for Genomic Medicine), and the Warren Alpert Foundation. The Utah NeoSeq Project was funded by the Center for Genomic Medicine at the University of Utah Health, ARUP Laboratories, the Ben B. and Iris M. Margolis Foundation, the R. Harold Burton Foundation, and the Mark Miller Foundation. This work utilized resources and support from the Center for High Performance Computing at the University of Utah. The computational resources used were partially funded by the NIH Shared Instrumentation grant 1S10OD021644-01A1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Library of Medicine or the National Institutes of Health.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The need for Institutional Review Board Approval at Rady Children's Hospital for the current study was waived as all data used from this project had previously been generated as part of IRB approved studies and none of the results reported in this manuscript can be used to identify individual patients. The studies from which cases were derived were previously approved by the Institutional Review Boards of Rady Children's Hospital. The University of Utah Institutional Review Board approved the use of human subjects for this research, under a waiver for the requirement to obtain informed consent.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present study are available upon reasonable request to the authors. MPSE source code, pre-trained models, documentation, and synthetic datasets are available to the public on GitHub (https://github.com/Yandell-Lab/MPSE).