Abstract
Early characterization of drug targets associated with disease can greatly reduce clinical failures attributed to lack of safety or efficacy. As single-cell RNA sequencing (scRNA-seq) of human tissues becomes increasingly common for disease profiling, the insights obtained from this data could influence target selection strategies. Whilst the use of scRNA-seq to understand target biology is well established, the impact of single-cell data in increasing the probability of candidate therapeutic targets to successfully advance from research to clinic has not been fully characterized. Inspired by previous work on an association between genetic evidence and clinical success, we used retrospective analysis of known drug target genes to identify potential predictors of target clinical success from scRNA-seq data. Particularly, we investigated whether successful drug targets are associated with cell type specific expression in a disease-relevant tissue (cell type specificity) or cell type specific over-expression in disease patients compared to healthy controls (disease cell specificity). Analysing scRNA-seq data across 30 diseases and 13 tissues, we found that both classes of scRNA-seq support significantly increase the odds of clinical success for gene-disease pairs. We estimate that combined they could approximately triple the chances of a target reaching phase III. Importantly, scRNA-seq analysis identifies a larger and complementary target space to that of direct genetic evidence. In particular, scRNA-seq support is more likely to prioritize therapeutically tractable classes of genes such as membrane-bound proteins. Our study suggests that scRNA-seq-derived information on cell type- and disease-specific expression can be leveraged to identify tractable and disease-relevant targets, with increased probability of success in the clinic.
Introduction
Drug discovery begins with the identification of candidate targets, drug-binding molecules whose modulation is hypothesized to be useful for the treatment of disease [1]. The discovery and development of a novel drug for a candidate target progresses in the following steps: target validation, compound screening and lead identification, characterization of mechanism of action, indication(s) selection, safety and efficacy clinical trials, and finally, in successful cases, regulatory approval. Development of a single new drug takes an average of 12-15 years and costs (including concurrent program failures) are estimated to range from 900 million – 2.6 billion USD per success [2,3]. A drug discovery program can fail at each step between early research to regulatory approval, and it is estimated that in >90% of cases failures can be attributed to suboptimal target selection for a given disease, resulting in safety or efficacy issues [4]. Together, these observations point to the need to improve the strategies and the data used in early stages of drug discovery to support the selection of candidate therapeutic targets, to increase the likelihood of clinical success.
Single-cell RNA sequencing (scRNA-seq) data is a particularly promising source of evidence for target selection, providing cell-level resolution of molecular profiles in disease-relevant tissues. Single cell technologies have already been applied extensively to characterize disease biology, in emerging diseases like COVID-19 [5,6], cancer [7–10], and common complex diseases across tissues [11–14]. The rapidly growing body of disease-relevant scRNA-seq data has already begun to inform the development of novel diagnostics and cell-targeting precision therapies [15]. This led us to ask to what extent information on cell type specific expression can boost the selection of promising drug targets.
Retrospective analysis of known drug targets has been used to identify features predictive of target success. Notably, such analyses have shown that targets linked to genetic variants associated with the relevant disease are twice as likely to reach clinical approval as targets with no genetic support [16–18]. These studies greatly impacted decision-making in biotech and pharmaceutical industries. Out of 428 newly FDA-approved drugs from 2013 to 2022, 271 (63%) are backed by direct or indirect human genetic evidence [19,20]. Even though establishing whether this influenced their discovery or development phases is difficult, 250 out of 271 genetics-backed drugs had publicly accessible genetic support before approval.
Given this precedent, in this work we used retrospective analysis to identify potential predictors of target clinical success from scRNA-seq data. We investigated two cell type specific expression modes that are commonly used in scRNA-seq disease analysis and can support target discovery. The modes include cell type specific expression in a disease-relevant tissue (hereafter cell type specificity) and cell type specific over-expression in disease patients compared to healthy controls (hereafter disease cell specificity). We used a uniform workflow to identify cell type specific and disease cell specific target-disease pairs across 30 complex diseases in 13 disease-relevant tissues using the CZ CellxGene Discover database [21]. We then evaluated how scRNA-seq supported target-disease associations correlate with target success in clinical trials, benchmarking against direct genetic associations as reported from the Open Targets platform [22]. We found that scRNA-seq support significantly increased the odds of clinical success for target-disease pairs and identified a complementary target space to that of direct genetic evidence. These results highlight the value of scRNA-seq data as a key resource, complementary to genetics, to increase probability of clinical success in drug development.
Results
Definition of scRNA-seq support for targets
As a cause or consequence of disease, pathology arises when cells of a particular type develop abnormal traits within a disease-relevant tissue. Safe and effective therapies should precisely target these aberrant cells, without eliciting on-target toxicities in other cells and tissues. Given this need, scRNA-seq data can support target prioritization by identifying genes expressed in a cell type specific manner in tissue from healthy and diseased individuals. We aimed to assess whether cell type specific genes, as identified by scRNA-seq analysis, are more likely to be targets of clinically successful drugs. We considered diseases for which scRNA-seq data was available via the CZ CellxGene Discover database [21]. We defined a disease-relevant (DR) tissue for each disease term. Of the 58 disease terms in the CellxGene database, 30 terms were retained for association analysis, based on availability of data from disease-relevant tissue and overlap with OpenTargets disease annotation terms (see Supplementary Table 1 for a complete list of diseases and reasons to exclude from analysis). The most prevalent diseases were lung and immune disorders (Figure 1A). For each disease term, we collected gene expression count matrices and coarse cell type labels, harmonized using the Cell Ontology [23] (Figure 1B, Supplementary Figure 1, see Methods), for disease-relevant tissue samples from healthy and diseased individuals (Supplementary Table 2).
We next defined two classes of scRNA-seq supported genes for target discovery: (1) cell type specific genes in healthy disease-relevant tissue (cell type specific) and (2) genes specifically over expressed in a cell type in tissue from disease patients, compared to healthy tissue (disease cell specific) (Figure 1C). We reasoned that drugs targeting cell type specific genes inhibit expansion and function of normal cells acquiring aberrant phenotypes in disease. For example, the GLP-1 receptor, targeted by commonly used anti-diabetic drugs, is normally expressed in pancreatic beta cells, which become dysfunctional in disease [13]. Conversely, drugs targeting disease cell specific genes suppress aberrant gene programmes directly. For example, inflammatory bowel disease patients are treated with antibodies targeting the tumor necrosis factor (TNF) which is over-expressed in regulatory T cells and other immune subtypes in disease [24].
Enrichment of clinically successful targets in genes with scRNA-seq support
For each disease, we identified cell type specific and disease cell specific genes with highly variable gene (HVG) selection and differential expression (DE) analysis, aggregating mRNA counts across cell types and donors (Figure 1D, see Methods). With this analysis across 30 diseases, we annotated 33654 gene-disease (G-D) pairs as cell type specific and 60851 G-D pairs as disease cell specific (Supplementary Figure 2). To associate scRNA-seq support with clinical success, we extracted information about targets of drugs approved or in trial from the Open Targets platform [1,22,25] (n = 2358 drugs for which the studied diseases are an approved or investigational indication). Across diseases, we annotated 2925 G-D pairs as safe (passed phase I), of which 1646 pairs where also effective (passed phase II), and 601 pairs were also approved (passed phase III) (Supplementary Figure 2, Supplementary Table 3).
We then computed the odds of clinical success, with or without support from scRNA-seq data (Figure 1C, see Methods). Of note, our analyses are disease-specific: we count successful G–D pairs with corresponding scRNA-seq support from analysis of healthy and diseased individuals in the disease-relevant tissue. For example, a gene that is found to be cell type specific in esophagus is not considered as having scRNA-seq support for pulmonary fibrosis.
To enumerate the space of possible G-D pairs, we multiplied the number of diseases considered (N=30) with a “universe” of genes. We define four different universes: all protein-coding genes (N=19620), representing the space of genes that are typically analysed in scRNA-seq data; genes that are antibody-tractable (N=12527) or small molecule-tractable (N=6550) based on Open Targets tractability assessment, representing genes that are tractable by any therapeutic agent; finally, genes already targeted by therapies in clinical trial for any indication (known drug targets, N=936), representing demonstrably druggable proteins (Supplementary Figure 3).
Out of 2925 target-indication pairs which passed at least phase I, 858 were prioritized as either cell type specific or disease cell specific by scRNA-seq analysis (Figure 2A). Considering protein-coding genes, antibody- and small molecule-tractable genes, cell type specific and disease cell specific G-D pairs with scRNA-seq support were always significantly enriched in targets of safe, effective, or approved drugs (Figure 2B, Supplementary Table 4). Out of 2840 protein-coding G-D pairs passing phase I, 356 (12%) were cell type specific in the DR tissue (OR=2.47, p-value = 3.57e-46) and 594 (20%) were disease cell specific (OR=2.34, p-value=4.43e-64). The enrichment of disease cell specific genes in clinically successful targets was the highest amongst antibody-tractable genes. When restricting the analysis to known drug targets, only disease cell specific genes were significantly enriched in effective and approved targets (Figure 2B). This might indicate that specific expression in the disease-relevant tissue is already implicitly used by drug discovery programmes for selecting targets that progress to clinical development. Combining both classes of scRNA-seq support (cell type and disease specific genes) led to significantly higher association with success in phase I and effectiveness (phase II) than each class individually, especially for protein coding and small molecule tractable targets (Figure 2B).
Comparison between scRNA-seq supported and genetic supported targets
We compared genes supported by scRNA-seq with genes associated to the disease by human genetics data, using the Open Targets direct genetic association score [22,26]. Throughout the manuscript, we refer to genes that are prioritized by either genetic association, cell type specificity or disease cell specificity as genes with “omic support”. Consistent with previous findings [16,18], genetic-supported genes were strongly associated with clinical success (Figure 2B, OR for approved targets = 5.94, p-value = 1.8e-11). Cell type and disease specific protein-coding genes were as likely to be targets of drugs passing phase I and II as those that have genetic support. In contrast, for targets that are clinically approved (i.e. passed phase III), genetic evidence gave stronger prediction. The identification of genetic evidence as a predictor of clinical success may have biased recent programs toward development of genetically supported drugs, noting that only a subset of the drugs under consideration here were approved in the last 10 years (Supplementary Figure 4).
We observed several differences between scRNA-seq supported targets and targets supported by genetics. Firstly, scRNA-seq supports a larger number of successful target-disease pairs. Amongst the G-D space of safe targets (2925 G-D pairs), 29.3% are scRNA-seq supported, while only 2.3% are directly supported by genetics (Figure 2A). Secondly, we found that different sources of omic evidence support distinct target spaces: only 24% of safe G-D pairs targeted with genetic support overlap with either kind of scRNA-seq evidence (Figure 2C). We tested for association between clinical success and support from both genetic and scRNA-seq, but due to the limited overlap, this analysis likely lacked sufficient statistical power to detect significant differences compared to using genetics alone (Supplementary Figure 5A). Thirdly, genetic and scRNA-seq support were predictive of clinical success in different classes of tractable targets (Supplementary Figure 5B). Genetic support increased chances of approval up to 20-fold for kinases and catalytic receptors but was notably less predictive of success than scRNA-seq support for other classes, such as transporters and rhodopsin-like GPCRs. These classes of genes show high tolerance to loss-of-function mutations (Supplementary Figure 5C), whereas it has been reported that genes associated with GWAS variants are under strong evolutionary constraints [27]. Furthermore, at the compound-level we found that drugs targeting scRNA-seq supported genes are approved or in trial for a significantly higher number of indications, compared to not supported targets (Adjusted R2 = 0.167; p = 2.086e-7, see Methods) (Supplementary Figure 6; Supplementary Table 5). Genetic association was not associated with significantly higher number of indications per drug.
We also observed significant differences when considering the genes with omic support that are not already in clinical development (unexplored supported genes). A large fraction of scRNA-seq supported genes, and especially cell type specific genes, are considered tractable by therapeutic agents (Figure 2D). Across all diseases considered, on average 77% of cell type and disease cell specific genes are antibody tractable, against 51% of genes supported by genetic association (t-test p-value: 5.9-e08). Genetic-supported genes showed a slightly higher average fraction of small molecule tractable genes (40% against 31%, t-test p-value = 0.02), although this was mainly driven by a few diseases (Supplementary Figure 7A). This indicates that scRNA-seq support prioritizes genes with therapeutic potential, especially membrane-bound proteins. This difference between genetic and scRNA-seq support could at least in part be explained by differences in evolutionary constraints: antibody tractable genes have significantly higher tolerance to loss-of-function than non-tractable genes, while small molecule tractable genes are significantly more constrained (Supplementary Figure 7B). This could be due to stronger evolutionary constraints on the sequences of proteins with small molecule binding pockets, as compared to larger, flatter surfaces of protein-protein interaction interfaces [28].
Robustness of association of scRNA-seq support and clinical success
We next tested the robustness of association with clinical success to several parameters used for the definition of genes with scRNA-seq support. Firstly, in our scRNA-seq analysis workflow we do not test for differential expression across all genes, but we pre-select highly variable genes before each comparison (see Methods), as per standard practice for DE analysis [29]. To independently quantify the impact of feature selection before DE analysis, we computed enrichment of successful targets considering only genes selected as highly variable genes for each disease scRNA-seq dataset. DE testing led to significant enrichment of successful targets also within selected HVGs, although with lower odds-ratios (Supplementary Figure 8A). This suggests that both HVG selection and DE testing on scRNA-seq data enrich for successful targets.
Next, we explored the relationship between cell type specificity and differential expression fold change between cell types and disease conditions. Estimated fold changes in gene expression between cell types are higher than those observed in the comparison between disease and healthy states within cell types (Supplementary Figure 8B). Notably, genes significantly over-expressed in a cell type at lower log-fold changes are often ubiquitously highly expressed, while those at higher fold changes are genuinely cell type specific (Supplementary Figure 8C) and more likely to be successful targets (Supplementary Figure 8D, left). Conversely, most disease cell specific genes, including successful clinical targets, are over-expressed in disease patients at low fold changes (Supplementary Figure 8D, right)
According to our definition, disease cell specific genes include both those over-expressed in disease within one or a small subset of cell types and genes over-expressed across multiple cell types. Since the latter category may also be identifiable through bulk expression analysis on whole tissue, we explored whether both tissue-level and cell type-level DE genes contribute to the enrichment of clinically successful targets. To explore this, we aggregated scRNA-seq counts to estimate bulk tissue expression per donor and compared this to genes specifically pinpointed through cell type-aware DE analysis (Supplementary Figure 9A). 74% of disease cell specific successful targets (passing at least phase I) could be identified only with cell type-level DE analysis (Supplementary Figure 9B). In other words, single cell rather than bulk expression data is required to identify most disease cell specific genes. Both tissue-level and cell type-level disease cell specific genes were significantly more likely to be targets of successful drugs (Supplementary Figure 9C). The OR was slightly higher for tissue-level disease markers compared to those only detectable with cell type-aware analysis. This is expected, since bulk expression profiling methods have been incorporated in target discovery pipelines for many years, whilst single cell data has only become available more recently. In addition, we confirmed that drug targets are more strongly enriched in up-regulated genes than down-regulated genes (Supplementary Figure 9A). This aligns with the fact that 890 (73.0%) of 1219 drugs past phase I and 474 (69.5%) of the 695 drugs in phase III or phase IV trials for the diseases in this analysis are categorized as inhibitors, antagonists, degraders, blockers and/or negative regulators of their targets.
We note that our analysis may be constrained by a lack of consistently curated cell type annotations across various scRNA-seq disease datasets. We use cell type labels based on the Cell Ontology [23], leading to broad and possibly inconsistent cell type annotations. The preferred annotation strategy in several data integration studies which re-use public scRNA-seq data is to cluster gene expression profiles in different datasets de novo and manually re-annotate clusters [30,31]. We hypothesised that accurate cell type annotations could further improve the ability to prioritize cell type specific genes for target discovery. We explored this hypothesis through analysis of three lung diseases (pneumonia, cystic fibrosis and pulmonary fibrosis) for which curated fine-grained annotations from data integration projects are available in the extended Human Lung Cell Atlas (eHLCA) dataset [30] (Supplementary Figure 10A). We computed cell type specific and disease cell specific genes using Cell Ontology-based annotations and eHLCA fine annotations and compared the enrichment of successful targets between these two gene sets. The gene sets with scRNA-seq support testing on fine or coarse annotations was largely overlapping (Supplementary Figure 10B). The fraction of recovered successful targets and the odds of clinical success were comparable, with slightly increased odds of success by using fine annotations to detect cell type specific genes (Supplementary Figure 10C). For disease specific expression the odds of success were slightly decreased with fine grained annotation, possibly because in this case differences between health and disease may manifest as changes in cell type proportions rather than within-cluster differential gene expression.
Target analysis in diseases with scRNA-seq support
Considering the 24 diseases with at least one target with an approved drug, genetic support was significantly associated with clinical success (targets of effective drugs) for 6 indications, cell type specificity for 10 indications and disease cell specificity for 9 indications (Figure 3A, Supplementary Figure 11, Supplementary Table 6). We considered technical factors influencing the variability across diseases in targets supported by scRNA-seq. Firstly, the total number of supported targets correlates with the number of cell types considered in differential expression analysis (Supplementary Figure 12A). For disease cell specific genes, the number of cell types that can be tested is significantly dependent on the number of disease patients in the scRNA-seq cohort (R2 = 0.39, p-value = 1.87e-11). Indeed, we found that with a larger patient cohort we detected more disease cell specific genes (Supplementary Figure 12B). Moreover, when the datasets included at least 10 disease patients, a greater proportion of the supported genes were successful targets (Supplementary Figure 12C). These results support the notion that larger patient cohorts can improve accuracy of detection of disease cell specific targets. Conversely, cell type specific genes appear less dependent on the numbers of donors for the disease-relevant tissue dataset (Supplementary Figure 12B).
As an exemplar disease with high-quality scRNA-seq data, we examined the characteristics of supported targets for systemic lupus erythematosus (SLE). SLE, commonly referred to as lupus, is a chronic autoimmune disease that can affect various organs and tissues. SLE is characterised by auto-antibody production that triggers inflammation and tissue damage. Current therapy options for SLE include broad acting non-steroidal anti-inflammatory drugs, corticosteroids, and immunosuppressants such as methotrexate and azathioprine to control the immune system’s activity. In addition, newer cell-targeted biologics like belimumab, which targets B-lymphocyte stimulator protein encoded by TNFSF13B, have been approved for treating certain patients with SLE [32,33].
In SLE many genes have been associated to the disease through genetic analyses (Supplementary Figure 2). However, these genes are not significantly enriched for effective drug targets (Figure 3A). Disease cell specific genes point to drugs with systemic immuno-suppressant effects such as paracetamol (targeting FAAH, PTGS2), inhibitors of DNA replication (targeting polymerases and tubulin genes), and B cell stimulators (targeting TNFSF13B, CD40LG) (Figure 3B). Cell type specific known targets include genes acting in disease-relevant cells, such as toll-like receptors which are involved in autoantibody production in B cells [34]. The unexplored supported genes prioritized by different omic support classes are all enriched in immune-function gene sets. However, we noticed that different data prioritizes genes with distinct molecular function (Supplementary Figure 13A-C). For example, different support classes prioritize different genes involved in interferon gamma signalling: genetic association prioritizes genes encoding for DNA binding proteins and transcription factors in the pathway, including SMAD and IRF transcription factors; disease cell specific genes are induced by interferon signalling downstream in the pathway, including IFIT and ISG genes. Cell type specific genes include chemokines and membrane bound receptors (e.g. KLRK1, CMKLR1, IL2RB) (Supplementary Figure 13D).
As a second example, we examined supported targets in pulmonary emphysema. Pulmonary emphysema is a condition characterized by the gradual destruction of the air sacs (alveoli) in the lungs, resulting in enlarged and rigid air spaces that impair gas exchange [35]. When pulmonary emphysema is coupled with inflammation of the airways, the two conditions are known as chronic obstructive pulmonary disease (COPD). The primary therapy options include bronchodilators, such as short- or long-acting agonists of beta-2-adrenergic receptors that cause the relaxation of airway smooth muscles and anticholinergic medications that inhibit bronchoconstriction [36]. Oral phosphodiesterase protein family inhibitors such as Roflumilast are similarly used to manage smooth muscle relaxation, vasodilatory, and bronchodilatory effects in patients with pulmonary emphysema and COPD. Inhaled corticosteroids may be used as an add-on therapy to reduce local inflammation.
In our analysis, known drug targets were not supported by direct genetic evidence (Figure 3A). Given that pulmonary emphysema is a stage of a progressive lung disease, the absence of robust genetic evidence could be attributed to limited size of patient cohorts at this specific stage of disease. Despite single cell data being available only from 3 patient samples, multiple safe, effective, and approved therapeutic targets were prioritised using our analysis as cell type specific in the disease-relevant tissue (lung) (Figure 3C). For example, angiotensin II receptor (encoded by AGTR1 gene) antagonist Sacubitril/Valsartan is an effective drug in patients with pulmonary hypertension/emphysema [37], despite it being predominantly used for the treatment of cardiac diseases. Even though AGTR1 lacked genetic association with lung disease or function, our analysis suggests that AGTR1 is specifically expressed in lung smooth muscle cells and fibroblasts in scRNA-seq data (Supplementary Figure 14). AGTR1 presents an example of targets where single cell data analysis might enable interpretability of cell type relevance for disease progression.
We also found that for broad therapeutics that affect a family of genes, single cell data could provide evidence for the most relevant family members based on specificity of expression in the disease-relevant tissues. For example, the non-selective inhibitor Roflumilast targets all phosphodiesterase-4 genes (PDE4A-D), however, only PDE4C shows selective expression in activated smooth muscle cells and alveolar type 2 cells in the lung (Supplementary Figure 14). Non-selective inhibitors can cause multiple side effects. In the case of Roflumilast, expression of PDE4B and PDE4D in the sensory nerves is thought to be responsible for nausea side effects [38,39]. Therefore, single cell data can provide rationale for development of selective PDE4C inhibitors for the treatment of pulmonary emphysema and other lung conditions associated with hypertension.
Discussion
Lack of efficacy and safety are the leading causes for phase II and III clinical trial failures [40]. Additionally, a promising target may fail to progress to phase I because of multiple reasons. These include inability to establish a mechanistic link between target biology and indication (target validation failure), insufficient promising chemicals, and/or safety risks found during pharmacokinetic and early toxicology studies [4]. Taken together, all these different causes account for the limited probability of a candidate therapeutic target and its cognate drug passing all stages of pre-clinical, clinical research, and regulatory approval (2005-2010 industry average: 5% [4]). Data-driven frameworks in drug discovery can effectively mitigate some of these risks, as demonstrated by the use of genetics data to support target-disease associations [16], but attrition from target ID to clinic remains high [4]. To further increase chances of success, target discovery workflows increasingly access additional information aggregated from pre-clinical data resources, including data from animal models, over-expression in disease-relevant bulk tissue samples, disease pathway analyses, and other bioinformatics resources, as exemplified by the Open Targets Platform [1]. Characterizing the potential impact and biases of different data sources for target credentialing pipelines is critical to push new technologies to translational applications.
Single-cell technologies, along with the growing availability of large, shared single-cell datasets on diseases and healthy controls [21] have opened-up unprecedented opportunities to understand target biology at cellular resolution across disease areas and in diverse patient populations. Single-cell RNA-seq has been applied to investigate pathways driving onset and progression of diseases [41–43], to understand the mechanism of action of different therapeutics [44,45], and to discover biomarkers for patient stratification [46]. This suggests a remarkable depth and breadth of information extractable from scRNA-seq datasets that could support drug discovery.
The goal of this study was to measure how much using single-cell RNA sequencing data from disease-relevant tissues can improve the chances of success for therapeutics by systematically identifying connections between targets and diseases. By aggregating data for 30 diseases affecting 13 tissues, we found that candidate target genes supported by scRNA-seq evidence have approximately three times the chances to lead to clinically successful therapies (Figure 2B).
The association between scRNA-seq support and target clinical success is in line with the fact that human diseases are typically tissue and cell type specific [47]. For example, tissue and cell-type specific eQTLs are enriched for disease-associated SNPs [48–50]. Given the typical timeframes of drug development, it is highly unlikely that any of the targets considered have been initially prioritized or validated using single-cell transcriptomics. While it is possible that other types of tissue-level transcriptomic data have driven decisions in target development, we do not expect these instances to significantly bias the results of our analysis on cell type specific expression. Furthermore, we found that scRNA-seq supported targets were more likely to pass phase I and II than reaching approval. It is possible that cell type specificity is a better indicator of low toxicity than broad efficacy, although this question remains to be further explored.
We compared targets prioritized by scRNA-seq with those prioritized by genetic evidence, which has been highlighted as an important predictor of clinical success [16,18]. Consistent with previous results, for the diseases and target sets included in this analysis, we observed a strong and statistically significant association between direct genetic support for target-disease pairs and clinical development success (Figure 2B). Previous work has highlighted that targets supported by human genetic data are more likely to be successful [16]. It is likely that this has led the pharmaceutical industry to allocate greater resources to development of drugs for these targets and has therefore created a bias amongst the targets in clinical development. However, we also find that direct genetic association support exists only for a subset of target-disease pairs with drugs in clinical development, and scRNA-seq support exists for a larger set of target-disease pairs, with few targets supported by both types of omic evidence (Figure 2A; Supplementary Figure 4). These complementary sets of targets have distinct molecular and druggability characteristics (Figure 2C, Supplementary Figure 5B). For example, we observed that genetic support tends to prioritize evolutionarily conserved genes (Supplementary Figure 5B-C, Supplementary Figure 7B), as previously reported [27]. Loss-of-function-tolerant classes of druggable targets, such as GPCRs and transporters, are instead prioritized by cell type or disease cell specificity, although scRNA-seq data might be biased towards other classes, such as highly expressed genes. We speculate that cell type specificity might prioritize targets of therapies managing symptoms or modulating disease-relevant biological processes parallel to or downstream of genetic causation, which are seldomly prioritized by genetic analysis [19,20]. Importantly, detecting associations between genetic variants and disease requires data from hundreds to thousands of individuals. In our analysis, association between clinical success and scRNA-seq support was drawn from analysis of tissue from tens of individuals, and we show that increasing the size of the scRNA-seq cohort to hundreds of patients increases the fraction of prioritized successful targets even further (Supplementary Figure 12C).
In this study, we considered two distinct patterns of cell type specific expression: cell type specific expression in disease-relevant tissue (cell type specificity) and cell type specific over-expression in disease-relevant tissue from disease patients compared to controls (disease cell specificity). Both classes of genes were significantly associated with clinical success in several diseases (Figure 3A). Cell type specific targets were less dependent on technical features of the scRNA-seq dataset (Figure 2A, Supplementary Figure 12). This is important because measuring cell type specificity does not require patient data, and this could be computed systematically on open resources such as the Human Cell Atlas Data Portal (data.humancellatlas.org) or the CZ CellxGene database [21].
When considering disease cell specific genes, we found that both genes over-expressed in disease within small subsets of cell types, and genes over-expressed at tissue-level, contribute to the association with clinical success (Supplementary Figure 9). Bulk transcriptomics methods have been used for longer in clinical development pipelines and this is reflected in stronger associations with success, although most disease cell specific successful targets were only identified with cell type-aware analysis. Of note, in this study we define disease cell specificity with naïve cell type-level differential expression analysis, where technical effects are only partially mitigated. We expect that improved experimental design and statistical methods to recover expression differences in scRNA-seq in normal and diseased tissues [51–53] and to distinguish disease-associated cell states [54–56] could further improve the set of target genes and will be highly impactful for target discovery programmes.
Our study is not free of limitations. We rely on the Cell Ontology-based cell type labels [23] provided by data curators upon submission to the CZ CellxGene Discover database. This approach has two primary drawbacks. Firstly, the Cell Ontology’s incompleteness may result in labelling rare tissue-specific subpopulations with broad cell type terms. Secondly, inconsistencies may arise as different data curators use the same term for transcriptionally distinct cells or conflicting terms for identical phenotypes. While our label harmonization strategy addresses the latter issue to some extent, it introduces coarser annotations. We anticipate that these issues will be mitigated by increased availability of expertly curated cell type annotations across human tissues, and by unified models for cell type annotation [57]. These will not only enhance the identification of promising drug targets (Supplementary Figure 10) but also facilitate more precise identification of disease-relevant cell types and cellular mechanisms. Additionally, our analysis encompassed both historical and active clinical development data for drug targets, for some of which the ultimate outcomes are still unknown. Finally, we did not account for the similarity between indications, which is important when considering related diseases where genetic association may be lacking for a specific indication (e.g. pulmonary emphysema) but is present for related traits (e.g. lung function).
Looking forward, more sophisticated analyses of cell atlases will boost further drug discovery efforts. For example, analysis of drug target expression patterns across cell types have been used to assess re-purposing potential and on-target toxicities [58]. Methods to infer differentiation trajectories [59,60], cell-cell interactions [61,62], regulatory networks [63], and immune repertoires [64] provide additional unexplored space for novel targets. Furthermore, we envision that high-resolution spatial transcriptomics will provide an added level of insight into drug target relevance based on their expression and disease tissue context [65–67]. Insights on cell and disease cell specific targets gained using high-throughput genomics will inform the design of next generation precision therapeutics, for example antibody-drug conjugates or lipid nanoparticle-mRNA vaccines. Overall, our study provides a framework to assess the potential impact of alternative data analysis methods and modalities on target discovery.
In summary, our work indicates that single-cell data can be a valuable tool for guiding the process of drug target prioritisation and enhancing our understanding of the cellular basis of safe, effective, and approved treatments for diseases.
Methods
Single-cell RNA-seq data collection from CZ CellxGene Discover platform
To select a set of diseases and scRNA-seq datasets, we downloaded cell- and dataset-level metadata for all H.Sapiens datasets from the CZ CellxGene Discover database, using the cellxgene_census python API (census version: 2023-07-25) [21]. Disease-relevant (DR) tissues were manually annotated for the 58 disease terms in the database. We excluded datasets profiled with targeted scRNA-seq assays (BD Rhapsody), inDrop and STRT-seq. We further excluded fetal samples, based on Human Developmental Stage Ontology [68], where available, and by manual curation for 12 datasets where stages were annotated as “unknown”. 10 disease terms were grouped into 4 broader terms (Supplementary Table 1).
After curation, 30 disease terms were retained for association analysis. Reasons to exclude diseases included: missing overlapping disease terms in Open Targets, missing data from DR tissue, data available from less than 3 donors with the disease, download errors (see Supplementary Table 1 for a complete list of diseases and reasons to exclude from analysis). After selecting suitable datasets, for each disease we downloaded full transcriptome gene expression profiles for all cells from the DR tissue from healthy donors and disease patients, as well as cell type labels (Cell Ontology terms [23]) and sample-level technical metadata (scRNA-seq assay and suspension type, Supplementary Figure 15).
To ensure consistency in granularity of cell type annotations across studies, we implemented a rollup procedure on the Cell Ontology tree, by relabelling cells with parent terms if a given term is a descendant of another term in the dataset (see example outcome in Supplementary Figure 1). For each term, the search for parent terms was limited only to a level of depth in the ontology tree given by the total number of ancestors of the term divided by a factor of 5. For example, if a term had 20 ancestors in the ontology tree, we searched for the 4 closest parent terms in the dataset for relabelling. We recognize that this step reduces the resolution of cell type annotations, yielding broader and partially redundant annotation labels. However, it mitigates the need for batch correction, clustering, and manual cell type annotation across 30 datasets. We defined the cell type labels used after roll-up as high-level cell type annotations.
Differential expression analysis and extraction of scRNA-seq supported gene-disease pairs
We identified cell type specific and disease cell specific genes for each disease using differential expression (DE) analysis.
For each disease dataset, we aggregated cell-level gene expression profiles summing counts and size factors (total counts per cell) by donor and high-level cell type annotations (hereafter, pseudo-bulks), following best practice recommendations for DE analysis on scRNA-seq data [69,70]. Only cell types found in at least 3 healthy donors (and 3 disease donors for disease cell specificity analysis) were included in DE testing. To identify cell type specific genes, we selected pseudo-bulks from healthy donors from the disease-relevant tissue and we tested for DE between pseudo-bulks of one cell type against all other cell types. To identify disease cell specific genes, for each cell type we tested for DE between diseased donors and healthy donors. For each test, we selected the top 5,000 highly variable genes amongst considered pseudo-bulks, using the method implemented in the R package scran [71]. We tested for differential expression between groups with the edgeR quasi-likelihood test [72] using the implementation in the R package glmGamPoi [73]. In all tests, we modelled the number of cells per pseudo-bulk as a confounder, as well as suspension type (cell or nuclei) and scRNA-seq assay where possible (when the confounder was not perfectly collinear with the disease label). After DE analysis, we obtained the effect size (log-fold change, logFC) and Benjamini-Hochberg adjusted p-values for each tested gene in each tested cell type.
We annotated a gene-disease (G-D) pair as cell type specific when the gene is significantly over-expressed in at least one cell type compared to all other cell types in healthy disease-relevant tissue (adjusted p-value < 0.01, logFC > 5). The choice of logFC threshold was motivated by the observation that genes significantly over-expressed at lower log-fold changes are often ubiquitously highly expressed, while those at higher fold changes are genuinely cell type specific (Supplementary Figure 8C). We annotated a G-D pair as disease cell specific when the gene is significantly over-expressed in disease in at least one cell type in disease-relevant tissue (adjusted p-value < 0.01, logFC > 0.5). The total number of supported G-D pairs for each disease is shown in Supplementary Figure 2. We annotated a G-D as cell type and disease cell specific if supported by both classes of scRNA-seq support.
Known drug relationships from Open Targets
Open Targets direct association evidence was accessed via download from the Open Targets Platform (version 23.02) [1,25]. Downloads used for this analysis were the ‘Diseases’ and ‘Direct Associations by Type’ tables. Experimental Factor Ontology (EFO) disease terms used in Open Targets were mapped to their corresponding term in used in the CellxGene database (MONDO IDs) using the ontology tree available in the Open Biological and Biomedical Ontology Foundry (https://obofoundry.org/ontology/mondo.html). We annotated G-D pairs for which approved or clinical candidate drugs exist using the ChEMBL evidence score from the Open Targets Platform. Briefly, each G-D pair is assigned a score between 0 and 1 based on clinical precedence, then the score is down-weighted by half if the clinical trial has stopped early for negative results (no effect of the drug) or safety and side effects concerns. Following the ChEMBL evidence scoring in Open Targets (https://platform-docs.opentargets.org/evidence#chembl), we classified G-D pairs with a ChEMBL evidence score > 0.1 as safe (> phase I), pairs with score > 0.2 as effective (> phase II), and pairs with score > 0.7 as approved (> phase III). While we do not explicitly exclude gene-disease pairs supported by failed trials, the down-weighting in Open Targets ensured that targets failed in early clinical trials are excluded, and targets failed in phase III were at most classified as passing phase II.
Genetic association
We annotated G-D pairs with genetic support using the genetic direct association score provided in Open Targets, aggregating evidence for association of genes and rare and common variants from several sources (https://platform-docs.opentargets.org/evidence) [1]. We classified as supported by genetics any G-D pair with genetic association score > 0.
Association between omic evidence and clinical success
To test for association between omic evidence (cell type specificity, disease cell specificity, genetic association) and clinical success (passing clinical phase I, II or III) we computed the odds ratio and Fisher exact test p-value under the null hypothesis that the true ratio between the odds of being a successful G-D pair with omic support and of being successful without support is 1. In all association tests, drug indications for clinical success and data for omic support are aligned by disease. To compute odds ratios, 95% confidence intervals and p-values, we used the odds ratio calculation implementation in the python package scipy [74].
To enumerate the space of possible G-D pairs for odds ratios analysis, we used the following gene sets as “gene universes”: protein-coding genes (N=19620) were obtained from Ensembl v108; antibody-tractable (N=12527) and small molecule-tractable (N=6550) genes, based on the Open Targets’ druggability assessment (https://platform-docs.opentargets.org/target/tractability), were obtained from Minikel et al. [18]; Genes targeted by therapies in clinical trial for any indication (known drug targets, N=936) were obtained from Open Targets v23.02; sets of typically druggable targets (Supplementary Figure 5B-C) were obtained from Minikel et al. [18]. Unless otherwise specified, odds ratios shown in the manuscript were computed using protein-coding genes as the gene universe.
Drug-level analysis
We extracted compound-level data from Open Targets for 17,095 drug molecules together with their year of first approval, list of indications, list of targets, and maximum clinical phase using Open Targets “molecule” and “mechanismOfAction” data objects. Among these drugs, we then identified those that had in their approved or investigational indications list any of the 30 diseases considered in the target-level analysis (n = 2358 drugs) and then further narrowed this list of drugs to those in phase II or greater (n = 1219) and phase III or phase IV clinical trials for the 30 diseases considered in this analysis (n=695). Drugs were annotated as having single cell or direct genetic association support for the considered indications if any of their target-disease pairings had this evidence in the preceding target-disease evidence analysis. To examine the number of indications for each drug for one of the 30 diseases in our analysis with genetic or scRNA-seq support, we aggregated Open Targets drug information and counted the total number of approved or investigational indications for each of these drugs.
We used a multiple linear regression model to investigate the possible associations of single cell support, and direct genetic support with the number of indications approved or under investigation per drug, accounting for year of the clinical trial as a confounder (Supplementary Figure 6). To satisfy model assumptions, log(number of indications per drug) was used as the dependent variable to address right-skew in number of indications. Single cell and genetic evidence could be synergistic, so an interaction term was used between these during modelling (Supplementary Table 5).
Comparison of fine annotation and ontology-based annotation on lung diseases
To compare gene-disease pairs prioritized with ontology-based annotation and with uniform integration-based annotations, we downloaded the extended Human Lung Cell Atlas (eHLCA) [30] using the CellxGene census API (CellxGene census datasetID: 9f222629-9e39-47d0-b83f-e08d610c7479), selecting normal lung and patient data for 3 diseases (pneumonia, cystic fibrosis and pulmonary fibrosis). These diseases were selected because all scRNA-seq data considered in the ontology-based analysis was included in the eHLCA dataset, therefore allowing us to compare the impact of annotations on matched data. We pseudo-bulked each disease dataset using the finest author-provided annotation (column: ann_finest_level in CellxGene metadata) and performed differential expression analysis as described above.
Disease-specific target analysis
To categorize the targets supported by different classes of omic evidence in systemic lupus erythematosus and pulmonary emphysema, we used the annotation of tractable gene classes as defined by Minikel et al. [18]. Gene ontology enrichment analysis was performed using the Enrichr method [75] as implemented in the Python package GSEApy [76]. The categorization of IFN-gamma pathway genes into receptors, transcription factors, targets, and secreted proteins (Supplementary Figure 13D) was obtained from OmniPath [77] and Dorothea [78,79].
Data availability
All scRNA-seq data analysed in this study is available via the CZ CellxGene Discover database and CxG Census API (https://chanzuckerberg.github.io/cellxgene-census/, version: 2023-07-25). Data on clinical precedence for known drugs for each target-disease pair, as well as gene-disease genetic association scores, was downloaded from Open Targets (version 23.02, https://platform.opentargets.org/downloads/data). Data on gene tolerance to loss-of-function mutations (LOEUF, loss-of-function observed/expected upper bound fraction) was extracted from gnomAD.v2.1’s pLoF metrics by gene data [80] (https://gnomad.broadinstitute.org/downloads). Gene sets used as universes for association analysis are available at https://github.com/emdann/sc_target_evidence/blob/master/data/universe_genes.csv. Processed datasets and analysis outputs are available as supplementary tables and via figshare (doi:10.6084/m9.figshare.25360129).
Code availability
All code to reproduce data downloads, processing and analysis is available at https://github.com/emdann/sc_target_evidence.
Author contributions
ED, ET, RE, GG, VS, EdR and SAT conceptualized the study. ET performed curation of Open Targets data and drug-level data analysis. ED performed curation and processing of scRNA-seq data, differential expression analysis, statistical analysis of association between omic evidence and clinical success, and disease-level target analysis. All authors interpreted the results. ED and ET made the figures. ED, ET, RE and EdR wrote the original manuscript draft. All authors edited and approved the final version of the manuscript. EdR and SAT supervised the work.
Conflicts of interest
ED has consulted for Ensocell Therapeutics. ET, GG, FN, EdR are employees of Sanofi and own Sanofi stock. VS has been leading the application of single-cell biology for drug development at Sanofi since 2018 and owns Sanofi stock. RE is a co-founder and employee of Ensocell Therapeutics. SAT has consulted for or been a member of scientific advisory boards at Qiagen, Sanofi, GlaxoSmithKline and ForeSite Labs. She is a consultant and equity holder for TransitionBio and Ensocell Therapeutics.
Supplementary Tables
Supplementary Table 1: Table of diseases available in CZ CellxGene database considered for study [disease] name of disease used in study
[disease_ontology_id] MONDO identifier for disease used in study
[disease_relevant_tissue] Manually curated annotation for disease-relevant tissue
[disease_name_original] Name of disease found in CZ CellxGene database
[disease_ontology_id _original] MONDO identifier for disease found in CZ CellxGene database
[reason2exclude] if not NA, description of reason to exclude disease from final analysis
Supplementary Table 2: Sample-level metadata for scRNA-seq datasets from CZ CellxGene database used in study
[assay] scRNA-seq protocol
[tissue] original tissue annotation
[tissue_general] high-level mapping of a tissue
[suspension type] indicates whether cells or nuclei were isolated
[disease] disease condition of donor
[dataset_id] Identifier for dataset in CellXGene Census
[donor_id] Identifier for donor in dataset
[development_stage_ontology_term_id] Human Developmental Stages ontology term for age of donor
[sample_id] sample identifier (donor, assay, tissue)
[disease_name_original] name of disease found in CZ CellxGene database
[disease_ontology_id _original] MONDO identifier for disease found in CZ CellxGene database
[disease_ontology_id] MONDO identifier for disease used in study
[disease_relevant_tissue] Manually curated annotation for disease-relevant tissue
Supplementary Table 3: Table of target-disease pairs with annotation of clinical success and omic support
[gene_id] Ensembl ID for gene
[disease_ontology_id] MONDO identifier for disease
[disease] name of disease
[gene_name] gene name
[gene_class] annotation of tractable gene classes
[genetic_association] OpenTargets genetic association score (https://platform-docs.opentargets.org/evidence#evidence-data-sources)
[known_drug] OpenTargets known drug score (https://platform-docs.opentargets.org/evidence#evidence-data-sources)
[is_druggable, is_safe, is_effective, is_approved] clinical status for each gene-disease pair
[GWAS_evidence] is gene-disease pair supported by genetic association
[ct_marker_evidence] is gene-disease pair supported by cell type specificity
[disease_evidence] is gene-disease pair supported by disease cell specificity
[ct_marker_and_disease_evidence] is gene-disease pair supported by cell type and disease cell specificity
[disease_evidence_celltype] is gene-disease pair supported by disease cell specificity (celltype-level)
[disease_evidence_tissue] is gene-disease pair supported by disease cell specificity (tissue-level)
Supplementary Table 4: Results of association analysis between omic support and clinical success across diseases
[odds_ratio] Odds ratio of association between evidence and clinical success
[ci_low] 95% confidence interval of odds ratio (bottom)
[ci_high] 95% confidence interval of odds ratio (top)
[pval] Fisher exact test p-value for enrichment (alternative hypothesis: odds ratio higher than 1)
[n_success] Number of successful gene-disease pairs
[n_insuccess] Number of not successful gene-disease pairs
[n_supported_approved] Number of successful gene-disease pairs supported by omic evidence
[n_supported] Total number of gene-disease pairs supported by omic evidence
[evidence] omic support class (all_sc_evidence indicates cell type and disease cell specific genes)
[clinical status] Clinical success class
[universe] Name of considered gene universe
[universe_size] Number of genes in gene universe
Supplementary Table 5: Results of multiple linear regression model predicting log(number of investigational or approved indications of a drug) from its year of first approval, drug target-disease support by any single cell evidence, and drug target-disease support by any direct genetic association.
Supplementary Table 6: Results of association analysis between omic support and clinical success for each disease (gene universe: protein-coding genes)
[odds_ratio] Odds ratio of association between evidence and clinical success
[ci_low] 95% confidence interval of odds ratio (bottom)
[ci_high] 95% confidence interval of odds ratio (top)
[pval] Fisher exact test p-value for enrichment (alternative hypothesis: odds ratio higher than 1)
[n_success] Number of successful gene-disease pairs
[n_insuccess] Number of not successful gene-disease pairs
[n_supported_approved] Number of successful gene-disease pairs supported by omic evidence
[n_supported] Total number of gene-disease pairs supported by omic evidence
[evidence] omic support class (all_sc_evidence indicates cell type and disease cell specific genes)
[clinical status] Clinical success class
[disease_ontology_id] MONDO identifier for disease
[disease] name of disease
[disease_relevant_tissue] Manually curated annotation for disease-relevant tissue
Supplementary Figures
Acknowledgements
We thank Jeffrey Greves and members of the Teichmann group for valuable discussions on this project. ED, KBM and SAT. acknowledge Wellcome Sanger core funding (WT206194).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.
- 50.↵
- 51.↵
- 52.
- 53.↵
- 54.↵
- 55.
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵