Abstract
Background Pharmacoepidemiology and population health studies using secondary analysis of electronic health care records (EHR) must define study variables through available electronic data. Defining a study variable starts with the identification of a phenotype, which is a defined set of criteria used to identify specific traits or medical conditions. In the real-world data perspective, a phenotype library is a collection of code lists or algorithms that standardize these sets of criteria. We conducted a systematic review of existing phenotype libraries to appraise their attributes, accessibility, interoperability, and portability.
Methods We systematically searched three databases (Scopus, PubMed, and Web of Science) until June 2024, to identify studies on key characteristics of phenotype libraries. The search combined MeSH terms related to “electronic health records,” “phenotype algorithm,” and “phenotype library”. Study parameters extracted included: library size, vocabularies, phenotype construction tools, validation and library management process, and portability in different sites.
Findings Of 134 articles, 26 met eligibility criteria, leaving nine articles related to eight unique phenotype libraries including CALIBER (Health Data Research UK (HDR UK) Phenotype Library or CALIBER), Centralized Interactive Phenomics Resource (CIPHER), ClinicalCodes Library, Manitoba Centre for Health Policy (MCHP) Concept Dictionary, Observational Health Data Sciences and Informatics (OHDSI) ATLAS, Open CodeLists, Phenotype Execution and Modeling Architecture (PhEMA) Workbench, Phenotype KnowledgeBase (PheKB). These libraries varied largely in size and vocabularies. Each library created rule-based phenotypes, though OHDSI and CIPHER also utilized machine learning. All libraries are both human and machine-readable. Validation processes varied and were only applied to some libraries. All libraries utilized a web-based platform and met at least the minimum requirements for library management, including phenotype definitions, metadata (if applicable), and version control.
Interpretations We observed large variations in library features including phenotype construction. Transparency about phenotypes and creating computable phenotypes enhance portability and streamline the effective reuse of phenotypes for different systems.
Funding This investigation was supported by a Fellowship awarded by VAC4EU (Vaccine Collaboration for Europe) Phenotype Representation Model: An International and Streamlined Approach to Enhance RWE Studies (grant nr 2023/0001).
Evidence before this study Electronic health data have been used extensively in epidemiology and health data science research for decades, as they offer a wealth of detailed real-world data which may be used to address important evidence gaps. Importance of such data sources has been strongly highlighted following the COVID-19 pandemic, which saw a massive increase in the number of scientific investigations utilizing electronic data to rapidly produce evidence to guide health policy. Recent development of multiple phenotype libraries has presented an important advancement in the field. Libraries serve as repositories for the construction and re-use of phenotypes built with electronic health data including diagnostic codes, laboratory values and demographic information. To our knowledge, a systematic review to identify and describe all existing phenotype libraries has not been undertaken following the COVID-19 pandemic undertaken following the COVID-19 pandemic.
Added value of this study This study provides a comprehensive systematic review which identifies and describes all currently existing phenotype libraries. We summarize and compare phenotype construction processes, data sources, user interfaces, portability and algorithm validation practices across 8 individual phenotype libraries. We highlight how these libraries facilitate robust and transparency, open scientific practices in digital health research, and identify potential opportunities for innovation. This systematic review serves as an important benchmark study, providing a central documentation and description of phenotype libraries built to date.
Implications of all the available evidence The use of phenotype libraries and collaboratively constructed and validated phenotypes in health research may greatly improve the robustness and impact of health research. We describe how libraries can be currently be used to improve research practice, as well as how existing libraries
Competing Interest Statement
CC declares doctoral funding awarded from the University of Oxford and GSK UK (2019-2023). MCJMS is head of Data Science and Biostatistics Department at University Medical Center Utrecht, which conducts studies for the European Medicines Agency and several Vaccine manufacturers. All according to ENCEPP code of conduct.
Funding Statement
This investigation was supported by a Fellowship awarded by VAC4EU (Vaccine Collaboration for Europe) Phenotype Representation Model: An International and Streamlined Approach to Enhance RWE Studies (grant nr 2023/0001).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present work are contained in the manuscript