RT Journal Article SR Electronic T1 A systematic review on machine learning approaches in the diagnosis of rare genetic diseases JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2023.01.30.23285203 DO 10.1101/2023.01.30.23285203 A1 Roman-Naranjo, P A1 Parra-Perez, AM A1 Lopez-Escamez, JA YR 2023 UL http://medrxiv.org/content/early/2023/01/31/2023.01.30.23285203.abstract AB Background The diagnosis of rare genetic diseases is often challenging due to the complexity of the genetic underpinnings of these conditions and the limited availability of diagnostic tools. Machine learning (ML) algorithms have the potential to improve the accuracy and speed of diagnosis by analyzing large amounts of genomic data and identifying complex multiallelic patterns that may be associated with specific diseases. In this systematic review, we aimed to identify the methodological trends and the ML application areas in rare genetic diseases.Methods We performed a systematic review of the literature following the PRISMA guidelines to search studies that used ML approaches to enhance the diagnosis of rare genetic diseases. Studies that used DNA-based sequencing data and a variety of ML algorithms were included, summarized, and analyzed using bibliometric methods, visualization tools, and a feature co-occurrence analysis.Findings Our search identified 22 studies that met the inclusion criteria. We found that exome sequencing was the most frequently used sequencing technology (59%), and rare neoplastic diseases were the most prevalent disease scenario (59%). In rare neoplasms, the most frequent applications of ML models were the differential diagnosis or stratification of patients (38.5%) and the identification of somatic mutations (30.8%). In other rare diseases, the most frequent goals were the prioritization of rare variants or genes (55.5%) and the identification of biallelic or digenic inheritance (33.3%). The most employed method was the random forest algorithm (54.5%). In addition, the features of the datasets needed for training these algorithms were distinctive depending on the goal pursued, including the mutational load in each gene for the differential diagnosis of patients, or the combination of genotype features and sequence-derived features (such as GC-content) for the identification of somatic mutations.Conclusions ML algorithms based on sequencing data are mainly used for the diagnosis of rare neoplastic diseases, with random forest being the most common approach. We identified key features in the datasets used for training these ML models according to the objective pursued. These features can support the development of future ML models in the diagnosis of rare genetic diseases.Competing Interest StatementThe authors have declared no competing interest.Clinical Protocols https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=360247 Funding StatementJALE has received funds from Instituto de Salud Carlos III (Grant# PI20-1126), CIBERER (Grant# PIT21_GCV21), Andalusian University, Research and Innovation Department (PY20-00303, EPIMEN), Andalusian Health Department (Grant# PI027-2020), Asociacion Sindrome de Meniere Espana (ASMES) and Meniere Society, UK. PRN is supported by PY20-00303 Grant (EPIMEN). AMPP is a PhD student in the Biomedicine Program at Universidad de Granada and his salary was supported by Andalusian University, Research and Innovation Department (Grant# PREDOC2021/00343).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data produced in the present work are contained in the manuscript.