Developing and Evaluating Pediatric Phecodes (Peds-Phecodes) for High-Throughput Phenotyping Using Electronic Health Records ============================================================================================================================ * Monika E. Grabowska * Sara L. Van Driest * Jamie R. Robinson * Anna E. Patrick * Chris Guardo * Srushti Gangireddy * Henry Ong * QiPing Feng * Robert Carroll * Prince J. Kannankeril * Wei-Qi Wei ## ABSTRACT **Objective** Pediatric patients have different diseases and outcomes than adults; however, existing phecodes do not capture the distinctive pediatric spectrum of disease. We aim to develop specialized pediatric phecodes (Peds-Phecodes) to enable efficient, large-scale phenotypic analyses of pediatric patients. **Materials and Methods** We adopted a hybrid data- and knowledge-driven approach leveraging electronic health records (EHRs) and genetic data from Vanderbilt University Medical Center to modify the most recent version of phecodes to better capture pediatric phenotypes. First, we compared the prevalence of patient diagnoses in pediatric and adult populations to identify disease phenotypes differentially affecting children and adults. We then used clinical domain knowledge to remove phecodes representing phenotypes unlikely to affect pediatric patients and create new phecodes for phenotypes relevant to the pediatric population. We further compared phenome-wide association study (PheWAS) outcomes replicating known pediatric genotype-phenotype associations between Peds-Phecodes and phecodes. **Results** The Peds-Phecodes aggregate 15,533 ICD-9-CM codes and 82,949 ICD-10-CM codes into 2,051 distinct phecodes. Peds-Phecodes replicated more known pediatric genotype-phenotype associations than phecodes (248 versus 192 out of 687 SNPs, p<0.001). **Discussion** We introduce Peds-Phecodes, a high-throughput EHR phenotyping tool tailored for use in pediatric populations. We successfully validated the Peds-Phecodes using genetic replication studies. Our findings also reveal the potential use of Peds-Phecodes in detecting novel genotype-phenotype associations for pediatric conditions. We expect that Peds-Phecodes will facilitate large-scale phenomic and genomic analyses in pediatric populations. **Conclusion** Peds-Phecodes capture higher-quality pediatric phenotypes and deliver superior PheWAS outcomes compared to phecodes. Keywords * phecodes * pediatrics * electronic health records (EHRs) * phenotyping * phenome-wide association study (PheWAS) * genomics ## BACKGROUND AND SIGNIFICANCE The pediatric spectrum of disease is distinct from its adult counterpart.[1] Children experience a variety of illnesses not commonly observed in adults. This includes congenital anomalies, genetic disorders with early mortality, and certain infectious diseases.[2,3] Conversely, adults experience many diseases unlikely to affect pediatric patients, such as Alzheimer’s disease, breast and prostate cancer, osteoarthritis, and many other conditions associated with aging.[4] Although pediatric data have rapidly accumulated in electronic health records (EHRs), pediatric patients have not been prioritized in developing high-throughput phenotyping tools such as phecodes, preventing researchers from performing focused large-scale analyses of pediatric data, including pediatric-specific phenome-wide association studies (PheWAS) and genome-wide association studies (GWAS), contributing to missed opportunities for scientific discovery. The development of phecodes represents a key effort in EHR phenotyping. Phecodes aggregate relevant International Classification of Diseases (ICD-9-CM, ICD-10-CM, and ICD-10) codes into distinct phenotypes to better represent clinically meaningful diseases and traits (e.g., grouping ICD-9-CM codes 162.* representing lung cancer and ICD-9-CM codes V10.1* representing a history of lung cancer).[5,6] Phecodes are represented using numeric codes arranged in a three-level hierarchy, allowing phenotypes to be captured at various levels of granularity. Root phecodes, located at the top of the phecode hierarchy, provide the broadest phenotype definitions and are represented using whole numbers. These root phecodes can then branch into progressively more detailed sub-phecodes, indicated by decimal digits. In the current phecodes (version 1.2), up to two levels of additional phenotypic granularity (i.e., two decimal places) are available for each root phecode. Phecodes with a single digit following the decimal point are referred to as level 1 sub-phecodes; level 2 sub-phecodes have two digits after the decimal point. For example, root phecode 008 “Intestinal infection” branches into level 1 sub-phecode 008.5 “Bacterial enteritis”, which then branches into level 2 sub-phecode 008.51 “Intestinal E. coli” (Supplementary Figure 1). We previously demonstrated that phecodes produced superior results in PheWAS compared to other coding systems, including ICD and the Clinical Classifications Software.[5] Since their introduction, phecodes have been globally used in multiple studies to both replicate known genotype-phenotype associations and discover new ones.[7–9] Phecodes have also been used beyond genetics to study long-term disease effects.[10] The latest version of phecodes can be found at: [https://wei-lab.app.vumc.org/phecode-data/phecodes](https://wei-lab.app.vumc.org/phecode-data/phecodes) While the existing phecodes are valuable in biomedical research, they need to be optimized for use in pediatric phenomic analyses. Phecodes were developed using population-based diagnoses predominantly from adult patients, which do not accurately reflect pediatric conditions. Modifying the existing phecodes to capture diseases primarily affecting pediatric patients with increased granularity and exclude age-related diseases uncommon in the pediatric population presents an opportunity to increase statistical power for identifying signals more efficiently and accurately. In this study, we use current phecodes as a starting point to create Peds-Phecodes, specialized pediatric phecodes that more appropriately reflect the unique spectrum of pediatric disease, and evaluate their performance in large-scale PheWAS analyses using real-world data. ## MATERIALS AND METHODS ### Data source We used de-identified EHRs from Vanderbilt University Medical Center’s (VUMC) Synthetic Derivative, a repository of rich, longitudinal clinical information encompassing data from >3.5 million patients, including data from >1 million pediatric individuals.[11] For our validation analyses, we also used data from VUMC’s DNA biobank, BioVU,[12] which links the genetic data of >100,000 individuals to their de-identified EHR data (>50,000 individuals with available pediatric EHR data). ### Pediatric phecode (Peds-Phecodes) development We mapped ICD diagnoses captured while patients were <18 years (pediatric) and ≥18 years (adult) in VUMC’s EHR to current phecodes (N = 1,866) and compared the prevalence of each phecode in the pediatric and adult populations. We used the chi-square test to assess for significant differences in prevalence between the two groups if both proportions were over 5%. Otherwise, we used Fisher’s exact test to assess the difference. We used a Bonferroni-corrected significance threshold to adjust for multiple testing. Additionally, we flagged phecodes with low pediatric patient representation (N<50 pediatric patients) as phecodes for potential removal, as 50 cases has been suggested to be the minimum sample size required to detect an association in PheWAS analyses of binary traits.[13] We adopted a hybrid data- and knowledge-driven approach to develop the pediatric phecodes (Peds-Phecodes) (Figure 1). Because of the three-tiered hierarchical structure of the phecodes (root phecodes, level 1 sub-phecodes, and level 2 sub-phecodes), an important component of this work involved ensuring that all modified phecodes continued to belong to the correct phecode “tree”. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/08/24/2023.08.22.23294435/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2023/08/24/2023.08.22.23294435/F1) Figure 1. Workflow for Peds-Phecodes development. We used a hybrid data- and knowledge-driven approach combining patient diagnosis counts from VUMC with clinician-led manual review to prune out phenotypes with little pediatric relevance (left) and create new phecodes for important pediatric phenotypes (right). From bottom-up, we pruned the phecodes without sufficient pediatric patient representation to identify potential phecodes for removal. We iteratively remapped each sub-phecode with <50 pediatric patients (from a study population of >1 million pediatric patients) to the phecode directly above it in the phecode hierarchy, starting at the most granular level of the hierarchy (level 2 sub-phecodes) and working up to the root phecode. In this way, we collapsed uncommon sub-phecodes while preserving the underlying hierarchical structure of the phecodes. For example, sub-phecodes 153.2 “Colon cancer” and 153.3 “Malignant neoplasm of rectum, rectosigmoid junction, and anus” were consolidated into their root phecode 153 “Colorectal cancer”. Two pediatricians (SLV and PJK) reviewed the remaining low count phecodes to identify those that represented diseases not applicable to the pediatric population (e.g., phecode 453 “Chronic venous hypertension” and phecode 796 “Elevated prostate specific antigen”), which were then removed. Low pediatric count phecodes representing rare diseases (e.g., phecode 209 “Neuroendocrine tumors”) were preserved. From top-down, we focused on the phecodes with significantly higher prevalence in our pediatric cohort. For each phecode with higher pediatric prevalence, the two pediatricians reviewed the distribution of pediatric and adult patient diagnoses mapped to the phecode and provided recommendations for new phecodes reflecting diseases of particular importance to the pediatric population (Supplementary Figure 2). To provide a clearer picture of the intra-phecode ICD distributions, ICD-10-CM diagnoses were converted to ICD-9-CM using the General Equivalence Mappings (GEMS) provided by the Centers for Medicare & Medicaid Services.[14] Both the ICD-9-CM codes and matching ICD-10-CM codes were remapped to the newly created phecodes. We also mapped several ICD-10-CM codes without ICD-9-CM analogs (e.g., ICD-10-CM codes related to COVID-19) to new phecodes. ### PheWAS analyses To validate the Peds-Phecodes, we conducted PheWAS analyses using Peds-Phecodes and the current phecodes independently and compared their ability to replicate known genotype-phenotype associations from previous pediatric studies. We queried the NHGRI-EBI GWAS Catalog[15] to find genetic variants associated with pediatric phenotypes for our PheWAS replication studies. We identified 687 SNPs in the GWAS Catalog that could be investigated using VUMC’s DNA biobank. These SNPs were associated with eight different pediatric phenotypes: congenital heart disease, pyloric stenosis, Hirschsprung disease, hypospadias, café-au-lait spots (observed in neurofibromatosis type I), pediatric eosinophilic esophagitis, childhood-onset asthma, and juvenile idiopathic arthritis. We performed PheWAS analyses for all 687 SNPs using binary logistic regression, adjusting for sex and race. We only used diagnoses made during the pediatric age window (i.e., ICD-9-CM and ICD-10-CM codes recorded at <18 years of age) in creating the phenotypes for PheWAS. We required a minimum of two ICD codes (both recorded at <18 years of age) for a patient to be counted as a case. As in previous studies, replication was defined as the detection of signals related to the phenotype of interest with concordant effect directionality and p<0.05.[6] We evaluated replication using both an exact and approximate phenotype match. In the exact phenotype match, we focused only on the PheWAS signal for the singular phecode best representing the phenotype of the genetic association from the GWAS Catalog (e.g., Peds-Phecode 747.1113 “Transposition of great vessels” and phecode 747.13 “Congenital anomalies of great vessels” for replication of the association between rs150246290 and transposition of the great arteries[16]). In the approximate match, we broadened our replication phenotype, examining the signals for all sub-phecodes stemming from the root phecode best matching the phenotype of the known association. For example, the approximated replication phenotype for the association between rs150246290 and transposition of the great arteries was represented by root phecode 747 “Cardiac and circulatory congenital anomalies” and all its sub-phecodes (N = 57 sub-phecodes in Peds-Phecodes, N = 5 in the current phecodes). Beyond replication, we also evaluated the ability of the Peds-Phecodes to detect novel genetic associations compared to the existing phecodes. We compared the overlapping and unique signals generated in each PheWAS and examined the significant signals detected using Peds-Phecodes but not phecodes (p