PT - JOURNAL ARTICLE AU - Yang, Siyue AU - Varghese, Paul AU - Stephenson, Ellen AU - Tu, Karen AU - Gronsbell, Jessica TI - Machine Learning Approaches for Electronic Health Records Phenotyping: A Methodical Review AID - 10.1101/2022.04.23.22274218 DP - 2022 Jan 01 TA - medRxiv PG - 2022.04.23.22274218 4099 - http://medrxiv.org/content/early/2022/10/27/2022.04.23.22274218.short 4100 - http://medrxiv.org/content/early/2022/10/27/2022.04.23.22274218.full AB - Objective Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records (EHRs) for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (i) the data sources used, (ii) the phenotypes considered, (iii) the methods applied, and (iv) the reporting and evaluation methods used.Materials and Methods We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies.Results Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly-supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered marginal improvement over traditional ML for many conditions.Discussion Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released.Conclusion Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study was funded by RGPIN-2021-03734 and a Connaught New Researcher Award.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data produced in the present study are available upon reasonable request to the authors