Abstract
Rare diseases (RDs) are uncommon as individual diagnoses, but as a group contribute to an enormous disease burden globally. However, partly due the low prevalence and high diversity of individual RDs, this category of diseases is understudied and under-resourced. The advent of large, standardised genetics databases has enabled high-throughput, comprehensive approaches that uncover new insights into the multi-scale aetiology of thousands of diseases. Here, using the Human Phenotype Ontology (9,677 annotated phenotypes) and multiple single-cell transcriptomic atlases (77 human cell types and 38 mouse cell types), we conducted >688,000 enrichment tests (x100,000 bootstrap iterations each) to identify >13,888 genetically supported cell type-phenotype associations. Our results recapitulate well-known cell type-phenotype relationships, and extend our understanding of these diseases by pinpointing the genes linking phenotypes to specific cell (sub)types. We also reveal novel cell type-phenotype relationships across disparate branches of clinical disease (e.g. the nervous, cardiovascular, and immune systems). Next, we introduce a computational pipeline to prioritise gene targets with high cell type-specificity to minimise off-target effects and maximise therapeutic potential. To broaden the impact of our study, we have released two R packages to fully replicate our analyses, as well as a series of interactive web apps so that stakeholders from a variety of backgrounds may further explore and utilise our findings. Together, we present a promising avenue for systematically and robustly uncovering the multi-scale aetiology of RDs at scale.
Introduction
In aggregate, there are over 10,000 recognised rare diseases (RDs) that affect at least 400 million people globally 1, contributing to an enormous disease burden (1 in 10-20 people) 2. As over 70% of RDs have a known genetic component 3, the increasing availability of phenotypic and genetic datasets presents an opportunity to apply a systematic approach to investigate many RDs at once. One of the most comprehensive resources for RD research is the Human Phenotype Ontology (HPO), which currently contains over 15,200 human phenotypic abnormalities and subtraits, each assigned to unique identifiers and mapped to hierarchically related terms both within the HPO and across other ontologies 4–6. Since 2008, the HPO has been continuously updated using knowledge from the medical literature, as well as by integrating databases of expert validated gene-phenotype relationships, such as OMIM 7–9, Orphanet, and DECIPHER 10. Currently, the HPO contains gene annotations for over 9,600 phenotypes. In tandem, the single-cell omics technologies have led to an explosion of cell type signature atlases 11.
Here, we strategically combine and extend these resources to identify cell type-specific therapeutic gene targets for thousands of RD with the goal of greatly improving efficacy and limiting side-effects. We provide a fully reproducible framework to systematically and robustly identify the primary cell types associated with RDs and RD-associated phenotypes. Specifically, we used Expression Weighted Cell type Enrichment (EWCE) 12 to perform a series of bootstrapped cell types enrichment tests between the gene lists of 6,173 HPO phenotypes (after quality control filtering) and each of the 77 cell types derived from a single-cell RNA seq (scRNA-seq) reference atlas of 15 organs across the developing human body 13. From this, we identified 8,379 significant cell type-phenotype associations across 2,832 unique phenotypes. Each of the 77 cell types were enriched for at least one phenotype.
We then leveraged our cell type-phenotype results to systematically propose cell type- and gene-specific candidates to target in gene therapies. Specifically, we focused on candidates suitable for recombinant adeno-associated virus (rAAV) vectors to transduce single-stranded DNA into target cells 14–18, which have recently shown success in treating several rare diseases such as Leber’s congenital amaurosis (LCA) 19, 20, spinal muscular atrophy (SMA) 21, and amyotrophic lateral sclerosis (ALS) 22. Together, in this study we uncover the molecular aetiology of thousands of phenotypes and diseases at once, and provide a systematic, evidence-based framework for identifying novel therapeutic targets in each of these diseases.
Additionally, we created two accompanying open-source R packages (HPOExplorer and MultiEWCE) to navigate the HPO data, search and postprocess the enrichment results from this study, and facilitate novel analyses of multiple gene lists in parallel. Finally, to make our data easily accessible we developed a series of interactive web app that allows exploration of all our results: (https://neurogenomics.github.io/rare_disease_celltyping_apps/home/) We hope that these tools will ensure reproducibility and facilitate future analyses as more phenotypic, genotypic, and transcriptomic data becomes available.
Results
Summary
Within the results using the Descartes cell type signature reference, 8,379 / 475,321 (1.76%) of bootstrapped enrichment tests across 77 cell types and 6,173 phenotypes revealed significant cell type-phenotype associations after multiple-testing correction. Within these results, 2,832 phenotypes were significantly enriched for at least one cell type after multiple testing correction (q ≤0.05) and 5,989 were nominally significant (p ≤ 0.05). Hereafter we will only refer to the results that were significant after multiple testing correction. The number of enriched cell types per phenotype suggest reasonable specificity of the enrichment strategy (percentages are shown relative to all 77 cell types): min=1 (1.30%), median=2 (2.60%), mean=2.959 (3.84%), max=27 (35.07%). This was also true for the number of enriched phenotypes per cell type (percentages are shown relative to all 6,173 phenotype gene lists): min=1 (0.016%), median=90 (1.46%), mean=108.8 (1.76%), max=338 (5.48%).
With the Tabula Muris enrichment results, 5,509 / 213,028 (2.58%) tests were significant after multiple-testing correction (q ≤ 0.05), and 20,221 tests were nominally significant (p ≤ 0.05). All 38 cell types were enriched in at least one phenotype, and 2,579 / 5,509 (46.81%) phenotypes were enriched in at least one cell type. The proportion of enriched cell types per phenotype were comparable to those observed using the Descartes dataset (percentages are shown relative to all 38 cell types): min=1 (2.632%), median=1 (2.632%), mean=2.136 (5.621%), max=17 (44.737%). The number of enriched phenotypes per cell type were slightly greater than that the results using Descartes (percentages are shown relative to all 6,173 phenotype gene lists): min=2 (0.032%), median=119 (1.93%), mean=145 (2.35%), max=506 (8.20 %).
Enrichment tests highlight expected as well as novel cell type-phenotype associations
We first sought to confirm that our enrichment analyses were able to recover well-established cell type-phenotype relationships. We expected the terms “Abnormality of the nervous system”, “Abnormality of the cardiovascular system”, and “Abnormality of the immune system” to be strongly associated with neural cells, heart cells, and immune cells, respectively. Indeed, a hypergeometric test showed that terms related to (i) the “Abnormality of the nervous system” showed an overrepresentation in all nervous system related cell types (21 cell types), with the strongest enrichment in limbic system neurons (n enrichments = 194, p = 5.74-44; Fig. 1C); (ii) the “Abnormality of the cardiovascular system” were overrepresented in 3/4 cardiovascular related cell types with cardiomyocytes being the most enriched (n enrichments = 94, p = 1.23-53; Fig. 1C); (iii) the “Abnormality of the immune system” were overrepresented in 9/10 immune cell types with lymphoid cells being the most enriched (n enrichments = 111, p = 5.23-56; Fig. 1C). Additionally, cell types that hierarchically clustered together (based on transcriptomic similarity) were also significantly associated with a particular term. For instance, the cluster of nervous system related cell types were enriched for terms related to the abnormality of the nervous system (n enrichments = 1768, p < 2.23-308; Fig. 1C).
Somewhat unexpectedly, a significant number of phenotypes related to the “Abnormality of the cardiovascular system” were associated with hepatoblasts (n enrichments = 17, p = 0.027; Fig. 1C). On closer inspection, these phenotypes were associated with damage to arteries caused by lipid deposition, such as cerebral artery atherosclerosis and myocardial steatosis. Given the prominent role that the liver plays in lipid metabolism 23, it is therefore logical that dysfunction of hepatoblasts would be implicated in abnormal cardiovascular phenotypes.
To further demonstrate that our approach finds expected cell type-phenotype associations, we extracted all the HPO terms enriched within excitatory neurons, cardiomyocytes, and antigen presenting cells, and show that the more significantly associated terms within these cell types were disproportionately related to the expected parent term (Fig. 1D-F). Taking excitatory neurons as an example, the more significant the association between the phenotype and excitatory neuron, the more likely this association was related to the abnormality of the nervous system (r = 0.82, p = 0.03; Fig. 1d).
Specific phenotypes are associated with fewer cell types and genes, but higher cell type-specificity of gene expression
We reasoned that lower ontology levels representing more specific phenotypes were likely to be associated with fewer cell types. In contrast, phenotypes with higher ontology levels would tend to be more broad and enriched for a wider variety of cell types. We confirmed that this is the case by counting the number of associated cell types with phenotypes in each ontology level, observing a strong positive correlation between ontology level and the number of associated cell types (Spearman’s rank correlation coefficient, r = 0.33, p < 2.2 x 1016; Fig. 2A). In addition, lower ontology levels were associated with fewer genes (Spearman’s rank correlation coefficient, r = 0.55, p < 2.2 x 1016; Fig. 2C) but the cell type-specificity of expression of the associated genes increased (Spearman’s rank correlation coefficient, r = -0.65, p < 2.2 x 1016; Fig. 2B).
Just as observed for the broader phenotypes (e.g. “abnormality of the immune system”), we expected more specific phenotypes, such as recurrent infections, to also be associated with their expected cell types. Extracting all children terms of recurrent infections, which includes 72 HPO terms at ontology levels ranging from 0 to 3 (relative to each other), we predicted that these would be primarily enriched within immune system-related cell types. As predicted, significant enrichments were found in immune related cell types, but also in less anticipated cell types (Fig. 3). “Recurrent staphylococcal infections” were enriched within myeloid cells (p = 0.0098; Fig. 3B), an association that has been previously documented in the literature 24–27, whereas “Neisserial infections” highlighted a novel association with hepatoblasts (p = 0.013; Fig. 3B). To confirm this association, we repeated the analysis using an independent scRNA-seq dataset from mouse (Tabula Muris) 28 and found a similar enrichment for “Recurrent Neisserial infections” in two hepatic cell types, namely Kupffer cells (p = 0.0094) and hepatoblasts (p = 2.23 x 10-308).
Exemplar results identify known associations while revealing novel multi-scale disease mechanisms
Here we highlight several exemplary results that recapitulate known aspects of disease aetiology while revealing a more comprehensive view of the disease by connecting mechanisms at multiple scales: phenotype ancestors (groups), phenotypes, cell types, genes. One such example is the association between respiratory failure and bronchiolar cells, alveolar epithelial cells, ciliated epithelial cells, and skeletal muscle cells. Specifically, the two airway epithelial cells initiate local and systemic inflammation, which lead to alveolar hypoventilation and eventual respiratory failure 29. The weakening of the diaphragm, the primary respiratory muscle, can independently lead to life-threatening respiratory failure 30. Additional cell type specificity filtering and sorting identified the gene CCNO acting via ciliated epithelial cells as the most promising target for respiratory failure.
As a second example, “Recurrent Neisserial infections” were significantly enriched for both alpha-Fetoprotein (AFP) / Albumin (ALB) -positive cells (fold-change=11.517, p=0.00010, q=0.00847) and hepatoblasts (fold-change=9.902, p=0.00016, q=0.0125). In both cell types, the associations with the phenotype are mediated by the same set of complement system genes: C7, C5, C6, C8B, CFB, CFI, and MBL2. Hepatoblasts are the precursor cells to hepatocytes (the primary cell type of the liver). AFP/ALB-positive cells are a canonical biomarker for liver damage or hepatocarcinoma in adults, but are also produced in normally developing foetuses 31.
Third, “Mental deterioration” is a phenotype characterised by “Loss of previously present mental abilities, generally in adults” that is associated with several forms of amyloidosis, leukodystrophy, and a variety of other degenerative neurological conditions (https://hpo.jax.org/app/browse/term/HP:0001268). As expected, “Mental deterioration” was strongly associated with neurons of the central nervous system (excitatory, inhibitory, limbic system, and Purkinje neurons). However, amacrine and ganglion cells of the retina were also significantly enriched, primarily mediated through the genes SNORD118, APOE, CHCHD10 and CSTB.
Prioritising cell type-specific gene targets for severe disease phenotypes
Next, we identified putative cell type-specific gene targets for several severe disease phenotypes. After all filtering and sorting steps, there remained 62 gene targets associated with 78 phenotypes across 26 cell types (Fig. 4). These prioritised targets were then visualised as a directed graph (Fig. 5). Grouped by higher-order ontology category, “Abnormality of the nervous system” had the greatest number of enriched phenotypes (26 phenotypes, 36 genes), followed by “Abnormality of the cardiovascular system” (17 phenotypes, 15 genes), “Abnormality of the musculoskeletal system” (8 phenotypes, 16 genes), “Abnormality of the respiratory system” (6 phenotypes, 5 genes), and ”Abnormality of the eye” (5 phenotypes, 15 genes).
Within the “Abnormality of the nervous system” category, 11 different “Abnormality of higher mental function” / “Neurodevelopmental abnormality” phenotypes survived the prioritisation filters, including: “Coma”, “Developmental regression”, “Global developmental delay”, “Intellectual disability”, “Intellectual disability, mild”, “Intellectual disability, moderate”, “Intellectual disability, severe”, “Mild global developmental delay”, “Neurodevelopmental abnormality”, “Neurodevelopmental delay”, “Severe global developmental delay”. The most common cell types enriched within these phenotypes were excitatory and granule neurons (both enriched in 6/11 phenotypes), followed by Inhibitory neurons (5/11 phenotypes). Across these phenotypes, the most commonly appearing genes were SOX3 (appearing in 17 cell type-phenotype associations), SOX2 (12 associations), POU3F4 (9 associations), and FOXH1 (8 associations), and . However, none of the “Mental deterioration” targets survived the filters due to the low cell type specificity (median quantile=6/40) and expression levels (median quantile=6/40) of the target genes. Unlike the other phenotypes in these categories, “Coma” was strongly enriched for islet endocrine cells (Fig. 5E). This association was mediated through genes critical for glucose regulation, such as INS and KCNJ11.
The “Abnormality of the nervous system” phenotypes also included non-cognitive phenotypes. Specifically, within the “Abnormality of movement” subcategory, the phenotype “Inability to walk”, which was enriched for both excitatory neurons (p<2.23 x 10-308, q<2.23 x 10-308, fold-change=1.832, prioritised gene target=FOXG1) and Schwann cells (p=0.00071, q=0.0421, fold-change=1.546, prioritised gene target=NHLRC1). Second, the “Seizure” subcategory included the phenotype “Status epilepticus”, which was enriched for excitatory neurons (p<2.23 x 10-308, q<2.23 x 10-308, fold-change=2.147, prioritised gene target=FOXG1). Finally, there were 13 phenotypes belonging to “Morphological central nervous system abnormality”, which included a variety of neuroanatomical features (e.g. “Cerebellar atrophy”, “Lissencephaly” and “Hypoplasia of the corpus callosum”) and “Stroke” (enriched for cardiomyocytes and stellate cells).
17 phenotypes within the “Abnormality of the cardiovascular system” category remained after the target prioritisation filtering. Of those, “Arrhythmia” showed strong enrichment for Cardiomyocytes (p<2.23 x 10-308, q<2.23 x 10-308, fold-change=2.915; Fig. 5B), with six prioritised target genes (NPPA, TNNC1 NKX2-5, TCAP, KCNA5). Of those genes, NKX2-5 is annotated within the HPO as being very frequently associated with arrhythmia (∼72% of cases on average). NKX2-5 is a transcription factor previously demonstrated to have highly specific expression in heart tissue, which is in congruence with the fact that this gene belongs to the top quantile (40) within both our specificity quantile and mean expression quantile metrics. This gene was in fact the first known genetic risk factor for congenital heart disease, and its expression is necessary not only for the development of cardiomyocytes but also the continued functioning of heart cells into adulthood 32, 33.
Within the “Abnormality of the musculoskeletal system” there were 9 unique phenotypes that survived the prioritisation pipeline: “Generalized hypotonia”, “Spasticity”, “Hypotonia”, “Distal amyotrophy”, “Spastic tetraplegia”, “Abnormality of upper limb joint”, “Aplasia/hypoplasia of the extremities”, “Aplasia/hypoplasia involving bones of the extremities”. As an example, “Hypotonia” was highly enriched for a variety of neuronal and glial cell types (Fig. 5D).
Finally, for a more comprehensive list of putative targets across a wider variety of phenotypes, we removed or relaxed many of the default arguments in our prioritisation pipeline (see Methods for details). This yielded putative therapeutic targets for 1,307 phenotypes across 37 cell types and 246 genes. Across all phenotypes, excitatory neurons were commonly implicated (236 phenotypes), followed by antigen presenting cells (214 phenotypes), cardiomyocytes (183 phenotypes), limbic system neurons (173 phenotypes), enteric nervous system (ENS) glia (167 phenotypes), and and ganglion cells (163 phenotypes).
Both the reduced and the extended versions of the prioritised targets network, as well as all code to reproduce them, are available as an interactive report online: https://neurogenomics.github.io/RareDiseasePrioritisation/reports/prioritise_targets
Genetic correlations reveal cross-phenotypic pleiotropy
Pairwise correlations between all phenotypes within the (reduced) prioritised targets based on gene annotation overlap. Phenotypes were then grouped into three k-mean clusters. Based on the phenotypes present in each cluster, cluster 1 appeared to be a mixture of different phenotype categories and included a subcluster of vascular abnormalities across multiple anatomical systems (e.g. “Peripheral arteriovenous fistula”, “Abnormal cerebral vascular morphology”, “Stroke”). Clusters 2 and 3 corresponded closely to “Abnormality of the nervous system” and “Abnormality of the cardiovascular system”, respectively. Within cluster 3 there was a subcluster corresponding to congenital heart conditions (e.g. “Abnormal aortic valve cusp morphology” and “Congenital malformation of the great arteries”).
Discussion
By applying our systematic approach to the HPO and the Descartes cell type atlas, we identified 8,379 significant cell type-phenotype associations across 2,832 unique phenotypes. All 77 cell types were enriched in at least one phenotype. Enriched phenotypes were distributed across all ontology levels within the HPO (Fig. 1B). From more general terms such as “Abnormality of the immune system”, down to more specialised (children) terms such as “Mycobacterial infections”. Enrichment tests revealed expected as well as novel cell type-phenotype associations. Whilst terms related to the “Abnormality of the cardiovascular system” highlighted cardiomyocytes, a significant number of these terms were also associated with hepatoblasts. More specific phenotypes with lower ontology levels were associated with fewer cell types and genes, but higher cell type-specificity of gene expression. Taken together, this suggests that the lower ontology level phenotypes may provide more viable avenues for therapeutics discovery. Being able to model the phenotype using a small set of prioritised genes and a specific cell type will not only benefit the RDs associated to the phenotype in question, but can also guide repurposing of the cell type-specific gene therapy for treatment of similar phenotypes, or even unrelated phenotypes that are underlied by genes and/or cell types with similar functions.
We then applied our computational prioritisation pipeline eto identify putative cell type-specific gene targets with the greatest chance of success as therapeutics (Fig. 5). This pipeline was based on a variety of relevant criteria, including disease severity, cell type-specificity, and gene-disease association frequency (Fig. 4). These targets spanned 78 severe disease phenotypes from Tiers 1 and 2, as classified by Lazarin et al. 34. For the phenotype “Respiratory failure”, we prioritised the gene CCNO acting via ciliated epithelial cells. A review of the literature revealed that Primary Ciliary Dyskinesia (PCD) is known to act via cilia of the respiratory epithelium 35, and is especially severe in patients with mutations to the CCNO gene 36–38. This example result demonstrates our approach is capable of identifying true positive cell type-phenotype relationships. However, based on a search of the literature and ClinicalTrials.gov (see Supplementary Information for search result links), it appears that therapeutics targeting CCNO have yet to be developed.
As a second example, “Coma” was found to be closely associated with islet endocrine cells, which regulate the secretion of insulin and glucagon, hormones that play a role in blood glucose homeostasis 39. Our target prioritisation procedure narrowed down the results to two gene targets for “Coma”; the insulin gene (INS) and potassium inwardly rectifying channel subfamily J member 11 (KCNJ11), which encodes for K-ATP channels that trigger insulin release in response to circulating glucose levels. Mutations in either of these genes can cause permanent neonatal diabetes and induce diabetic coma 40, a condition that can occur when blood glucose levels become too high or too low 41–43. This result provides further evidence that our framework recovers valid relationships between phenotypes, cell types, and genes.
For the phenotype “Mental deterioration” we report expected associations with central nervous system neurons, as well as less expected associations with amacrine and ganglion cells of the retina. Prior studies have shown that visual impairment or blindness is prevalent in people with intellectual disability (>8.5-fold increased risk) 44–46 and pathological cognitive decline in older individuals 47. These results suggest that not only are these phenotypes correlated with one another, but they are causally related to the same genetic risk factors. Although “mental deterioration” did not survive the final target gene prioritisation filters, six other intellectual disability phenotypes did (Table 1). Within these five phenotypes, the most commonly appearing genes were SOX3 (12 times across multiple cell types), SOX2 (10 times), FOXH1 (7 times), and POU3F4 (7 times). All four of these genes are transcription factors that play an important role in the development of the nervous system. The disruption of SOX3 has been implicated in a variety of intellectual disabilities through observations in both patients and experimental models 48, 49. SOX2 has also been implicated in nervous system development and its disruption can lead to profound deficits in cognition, vision, and motor function 50. Interestingly, POU3F4 has primarily been implicated in the development of semicircular canals and inherited deafness 51–54 but has also been linked to deficits in cognition and mental health (including attention deficit hyperactivity and developmental language disorder) which are significantly comorbid with this form of deafness 55. These non-auditory deficits are more profound than those observed in controls with other forms of deafness, suggesting that POU3F4 provides a common molecular aetiology underlying aspects of both central and peripheral nervous system development 56, 57. Confirming the relevance of these results, all enriched cell types within intellectual disability phenotypes were neuronal or glial cells. However, given the ethical implications and technical constraints of treating a genetic disorder during gestation, these gene targets should be considered candidates for further preclinical research, rather than therapeutic targets in developing human embryos.
Phenotypes at both higher- and lower levels of the HPO ontology were predominantly associated with their expected cell types (Figs. 1C-F). This testified to the credibility of this approach and allowed us to explore novel findings. One such example is the association of “Recurrent Neisserial infections” with hepatoblasts. Whilst unexpected, a convincing explanation involves the complement system, a key driver of innate immune response to Neisserial infections. Hepatocytes, which derive from hepatoblasts, produce the majority of complement proteins 58, and Kupffer cells express complement receptors 59. In addition, individuals with deficits in complement are at high risk for Neisserial infections 60, 61, and a genome-wide association study in those with a Neisserial infection identified risk variants within complement proteins 62. Indeed, all seven of the genes mediating this cell type-phenotype association (C7, C5, C6, C8B, CFB, CFI, and MBL2) are part of the complement system. While the potential of therapeutically targeting complement in RDs (including Neisserial infections) has been proposed previously 63, 64, performing this in a gene- and cell type-specific manner may help to improve efficacy and reduce toxicity (e.g. due to off-target effects). Importantly, there are over 56 known genes within the complement system (see Supplementary Information) 65, highlighting the need for a systematic, evidence-based approach to identify effective gene targets.
Finally, we interrogated shared genetic mechanisms between our prioritised RDs and other phenotypes (Fig. 6). This allowed us to infer which phenotypes tend to co-occur in patients due to pleiotropy, in which mutations in the same gene cause multiple phenotypes. Sometimes the links between the phenotypes are expected due to their being highly related to one another within the HPO, as is the case for multiple phenotypes of abnormal corpus callosum morphology (a subcluster within cluster 2; Fig 6). However other phenotypic relationships are less immediately obvious, such as that between “Abnormality of neuronal migration” and “Abnormality of mouth shape”, or between “Severe global developmental delay” and “Optic atrophy”. These insights may help to improve diagnostic criteria for various RDs while simultaneously revealing the cell type-specific genetic mechanisms underlying their clinical comorbidity.
Conclusions
Across the 77 cell types and 6,173 RD-associated phenotypes investigated, more than 8,000 significant cell type-phenotype associations were observed. The examples we have highlighted above align with what is expected, already known, or at least has a plausible biological explanation. Furthermore, he terms “abnormality of the cardiovascular system” and “recurrent Neisserial infections” were both associated with liver cell types, highlighting the potential in investigating and treating RDs collectively. Within the >8,000 enrichments we identified, there will be many previously understudied or unknown links between RD-associated phenotypes and specific cell types. In addition to prioritising cell type-specific gene targets, our approach presents an opportunity to therapeutically treat multiple phenotypes via the same target. This may be especially effective in patients that express more than one disease phenotype, as is frequently the case 66. Taken together, this reflects the utility and potential of our approach in advancing understanding, modelling, and treatment of RDs.
For the impact of our results to be fully realised, it was essential that they could be easily accessed and navigated by domain experts, clinicians, and patients alike. To facilitate this, we developed a publicly available interactive web app (https://neurogenomics.github.io/rare_disease_celltyping_apps/home). Importantly, this web app does not require any coding expertise to search for, visualise, and download relevant subsets of our enrichment results. Together with the reproducible workflows available as R packages, we aim to make our high-throughput findings useful to a wide variety of RD stakeholders and facilitate the extension of these analyses as new RD data becomes available over time. Ultimately, we hope that this work will help to overcome some of the difficulties that have hindered RD research in the past and accelerate the development of effective therapeutics across a wide variety of disorders.
Methods
Cell type-phenotype associations
In this study, the gene by cell type specificity matrix was constructed using the Descartes human cell atlas of fetal gene expression, which contains 377,456 cells representing 77 distinct cell types 13. To independently replicate our findings, we also used the Tabula Muris murine whole-body dataset, made up of 100,605 cells representing 38 distinct cell types from 20 organs and tissues 28. Genes from the Tabula Muris dataset were converted to human orthologs using the One2One R package, and genes without 1:1 mouse:human orthologs were dropped. For each cell type, the specificity metric was obtained by dividing the expression of each gene by the sum of the expression of that gene in all cell types. The target gene sets used here are obtained from the HPO, such that each phenotype has its own associated gene set. If a given gene set is significantly enriched in a cell type, then it is likely that the cell type plays a role in the pathology and therefore may be a valuable target for future research.
We used EWCE (v1.1.0) to evaluate significant cell type-phenotype associations 12. EWCE takes as input a gene by cell type specificity matrix, a target gene list of length n, referred to as, and a set of background genes referred to as. Where is the specific expression of gene in cell type. is the number of, and is an expression of in cell (indexed from). As EWCE requires input genes per test, 6,173 HPO gene lists remained after filtering.
Variable definitions
target gene list
length of target gene list
: background gene list
: gene identity
cell type identity
: cell index
: number of cell types
: specific expression of gene in cell type
: expression of gene in cell
: gene by cell type specificity matrix
: disease-associated phenotype identity
Genes with very low expression were considered to be uninformative and were therefore removed before computing the specificity matrix (mean < 0.2 across all cell types).
We then summed the specificity scores of the genes in to get each gene’s total expression specificity score in a given cell (). This is done for all cells, enabling us to quantify the level of specific expression of gene list (indexed by) in each cell type ().
Bootstrapping was then used to determine the probability of cell type-specific enrichment for each cell type in target gene list. We used 100,000 bootstrap iterations to ensure robustness and reduce the rate of false positive associations. To do this, the same cell type-specific expression calculation described above is then calculated for 100,000 random gene sets in each cell type. This gives a probability distribution of cell type-specific expression for gene sets of length in any given. The mean and standard deviation of this distribution are normalised (centred to 0 and 1 respectively) and then used to calculate a Z-score. We can then determine the probability of enrichment of in based on the number of bootstrap gene lists that have a higher cell type specific expression than. Gene sets with higher specific expression than most random gene sets of the same length have a high probability of enrichment in a given cell type. This procedure was repeated for each RD-associated phenotype.
In total, 475,321 EWCE enrichment tests were performed using the Descartes cell type signature reference (6,173 phenotypes x 77 cell types). An additional 213,028 tests were performed using the Tabula Muris cell type reference for the phenotypes that had at least four remaining genes after removing genes without 1:1 mouse:human orthologs (5,606 phenotypes x 38 cell types). Within the results from each cell type reference, EWCE p-values were corrected with the Benjamini-Hochberg method to produce q-values 67. To facilitate these analyses and to make them more easily reproducible by others, we created several open-source R packages. MultiEWCE (https://github.com/neurogenomics/MultiEWCE) facilitates the analysis of multiple gene lists across many computing cores in parallel, reducing the time necessary to complete large-scale enrichment testing. HPOExplorer (https://github.com/neurogenomics/HPOExplorer) aids in managing and querying the directed acyclic ontology graph within the HPO.
Interactive website
The landing page for the website was made using HTML and CSS, and the web apps were created using the Shiny Web application framework for R and deployed on the ShinyApps server. The website can be accessed here: https://neurogenomics.github.io/rare_disease_celltyping_apps/home
Gene therapy target identification
We developed a systematic and automated strategy for identifying putative cell type-specific gene targets for each phenotype based on a series of filters at phenotype, cell type, and gene levels. The entire target prioritisation procedure can be replicated with a single function: MultiEWCE::prioritise_targets. This function automates all of the reference data gathering (e.g. phenotype metadata, cell type metadata, cell type signature reference, gene lengths, severity tiers) and takes a variety of arguments at each step for greater customisability. Default parameters for all arguments can be found in the function documentation.
Descriptions of each step in the prioritisation pipeline are as follows:
start: All cell type-phenotype association results.
q_threshold: Keep only results that were significant after multiple-testing correction (q<0.05).
fold_threshold: Keep only results with fold change>=1.
keep_ont_levels: Keep only phenotypes at certain absolute ontology levels within the HPO.
keep_onsets: Keep only phenotypes with postnatal age of onsets to circumvent technical and ethical challenges associated with antenatal gene therapeutics delivery.
keep_tiers: Keep only phenotypes with high severity Tiers.
a. We used a combination of manual curation and automated text-based substring queries to assign each phenotype a severity Tier as characterised in a survey of healthcare professionals 34.
b. Tier 1: Diseases that shortened life span in adolescence or earlier or resulted in intellectual disability.
c. Tier 2: Diseases that shortened lifespan prematurely in adulthood, or resulted in impaired mobility or internal physical malformation.
d. Tier 3: Diseases causing sensory impairments (hearing, vision, touch, pain, or other), immunodeficiency/cancer, mental illness, or dysmorphic features.
e. Tier 4: Diseases that reduce fertility. Of the 49 phenotypes that were available in this severity ranking, we selected three that were classified as Tier 1 (the most severe disease category): mental deterioration, coma and respiratory failure.
severity_threshold: Keep only phenotypes with mean severity score equal to or below the threshold.
Severity scores were computed by assigning each severity modifier term found in the HPO annotations a numerical value. In order of increasing severity:
HP:0012825 “Mild” (Severity_score=4)
HP:0012827 “Borderline” (Severity_score=3)
HP:0012828 “Severe” (Severity_score=2)
HP:0012829 “Profound” (Severity_score=1)
pheno_frequency_threshold: Keep only phenotypes with mean frequency equal to or above the threshold (i.e. how frequently a phenotype is associated with any diseases in which it occurs).
Keep phenotypes with a mean frequency ≥10% or are NA by default.
keep_celltypes: Keep only terminally differentiated cell types.
Of the 77 cell types tested in the Descartes cell type reference, the 40 terminally differentiated cell types were identified through a literature search. Of these, three (extravillous trophoblasts, syncytiotrophoblasts and trophoblast giant cells) were excluded as they only played a role in pregnancy 68–70, which would raise additional technical and ethical challenges as rAAV therapy has not yet been used to target foetuses in clinical trials.
keep_seqnames: Remove genes on non-standard chromosomes.
Only keep chromosomes 1-22, X, and Y.
gene_size: Keep only genes <4.3kb in length.
Due to limitations in the length of the gene that can be carried by the rAAV vector, genes with a length of >4.3kb were excluded.
keep_biotypes: Keep only genes belonging to certain biotypes (e.g. “protein_coding”, “processed_transcript”, “snRNA”, “lincRNA”, “snoRNA”, “IG_C_gene”).
Keep all biotypes by default.
gene_frequency_threshold: Keep only genes at or above a certain mean frequency threshold (i.e. how frequently a gene is associated with a given phenotype when observed within a disease).
Keep genes with a mean frequency ≥10% or are NA by default.
keep_specificity_quantiles: Keep only genes in top specificity quantiles from the cell type dataset.
To further narrow down genes, we extracted relevant metrics from the Descartes reference for each gene in each cell type. These included mean expression, specificity, and specificity quantiles (using 40 bins). Only genes with the most specific quantiles (39-40) were included for further analysis, as cell type-specific genes may be less likely to have off-target effects in other cell types.
keep_mean_exp_quantiles: Keep only genes in top mean expression quantiles from the cell type dataset.
top_n: Sort candidate targets by a preferred order of metrics and only return the top N targets per cell type-phenotype combination.
Finally, results were sorted by the following columns (in order of precedence, where 1=ascending order and -1=descending order):
b. “tier”=1
c. “tier_auto”=1
d. “Severity_score_mean”=1
e. “q”=1
f. “fold_change”=-1
g. “specificity_quantile”=-1
h. “mean_exp_quantile”=-1
i. “specificity”=-1
j. “mean_exp”=-1
k. “pheno_freq_mean”=-1
l. “gene_freq_mean”=-1
m. “width”=1
end: Final table of prioritised cell type- / phenotype-specific gene targets.
Finally, for more comprehensive target search, the we removed the filters for onsets (keep_onsets=NULL), Tier (keep_tiers=NULL), severity (severity_threshold=NULL), as well as relaxed the filters for phenotype frequency threshold (pheno_frequency_threshold=c(10,NA)), gene frequency threshold (gene_frequency_threshold = c(10,NA)), gene specificity quantiles (keep_specificity_quantiles = seq(20,40)), and gene expression quantiles (keep_mean_exp_quantiles = seq(20,40)).
Phenotype x phenotype genetic correlations
Lastly, we computed genetic correlations between all phenotypes that appeared within the reduced list of prioritised targets. For this analysis, the complete gene lists for each phenotype were extracted from the HPO (not just the genes present in the prioritised targets list) and recast into a binary gene x phenotype matrix, where 0 indicated the absence of a gene-phenotype association and 1 indicated the presence of a gene-phenotype association. Pairwise Pearson correlations were then computed between all phenotypes to generate a phenotype x phenotype matrix. Hierarchical clustering was performed on the resulting correlation matrix and visualised as a heatmap using MultiEWCE::correlation_heatmap, which utilises the R package ComplexHeatmap 71. Cluster group assignment was determined using 1,000 iterations of k-means where k=3.
Data and Code Availability
All data and code is made freely available through preexisting databases and/or GitHub repositories / software associated with this publication.
Human Phenotype Ontology
Descartes scRNA-seq atlas
Tabula Muris scRNA-seq atlas
Web app
HPOExplorer
MultiEWCE
EWCE
Code to replicate analyses
Results for all enrichment tests
Cell type-specific gene target prioritisation
Data Availability
All data and code is made freely available through preexisting databases and/or GitHub repositories / software associated with this publication.
https://neurogenomics.github.io/rare_disease_celltyping_apps/home
https://github.com/neurogenomics/MultiEWCE
https://github.com/neurogenomics/HPOExplorer
https://doi.org/doi:10.18129/B9.bioc.EWCE
https://github.com/neurogenomics/rare_disease_celltyping
https://github.com/neurogenomics/rare_disease_celltyping/tree/master/results
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/prioritise_targets
https://descartes.brotmanbaty.org/bbi/human-gene-expression-during-development
Funding
This work was supported by a UK Dementia Research Institute (UK DRI) Future Leaders Fellowship [MR/T04327X/1] and the UK DRI which receives its funding from UK DRI Ltd, funded by the UK Medical Research Council, Alzheimer’s Society and Alzheimer’s Research UK.
Supplementary Information
Links
ClinicalTrials.gov search for “Primary Ciliary dyskinesia”: https://clinicaltrials.gov/ct2/results?cond=primary+ciliary+dyskinesia
ClinicalTrials.gov search for “CCNO”: https://clinicaltrials.gov/ct2/results?cond=&term=ccno
Complement system gene list: https://www.genenames.org/data/genegroup/#!/group/492
Supplementary Tables
Table S1. Prioritised targets
Cell type- and gene- specific targets for each phenotype. Targets were prioritised using the filtering and sorting procedure implemented in the MultiEWCE::prioritise_targets function.
An interactive version of this table (with sorting, searching, and downloading features) is available online: https://neurogenomics.github.io/RareDiseasePrioritisation/reports/prioritise_targets
Table S2. Cell type groupings
Cell type groupings for testing overrepresentation of nervous system related cell types, immune related cell types, and cardiovascular related cell types from the Descartes dataset in the HPO branches “Abnormality of the nervous system”, “Abnormality of the immune system”, and “Abnormality of the cardiovascular system”, respectively.
Table S3. Cell type-branch enrichment tests
Hypergeometric test results for overrepresentation of cell type-phenotype associations by HPO branch. The selected branches were children terms of “Phenotypic abnormality” and each phenotype was annotated to a branch if it was a descendant of the branch e.g. recurrent infections was annotated to “Abnormality of the immune system”. Terms that were not descendants of “Phenotypic abnormality” e.g. those related to “Mode of inheritance”, were not included in this analysis. Hypergeometric p-values were corrected with the Benjamini-Hochberg method 67 to produce q-values.
Acknowledgements
For the purpose of open access, the authors has applied a creative commons attribution (CC BY) licence (where permitted by UKRI, ‘open government licence’ or ‘creative commons attribution no-derivatives (CC BY-ND) licence’ may be stated instead) to any author accepted manuscript version arising.
Footnotes
↵* Co-first authors