Abstract
The cost of drug discovery and development is driven primarily by failure, with just ∼10% of clinical programs eventually receiving approval. We previously estimated that human genetic evidence doubles the success rate from clinical development to approval. In this study we leverage the growth in genetic evidence over the past decade to better understand the characteristics that distinguish clinical success and failure. We estimate the probability of success for drug mechanisms with genetic support is 2.6 times greater than those without. This relative success varies among therapy areas and development phases, and improves with increasing confidence in the causal gene, but is largely unaffected by genetic effect size, minor allele frequency, or year of discovery. These results suggest we are far from reaching peak genetic insights to aid the discovery of targets for more effective drugs.
The cost of drug discovery and development is driven primarily by failure1, with just ∼10% of clinical programs eventually receiving approval2–4. We previously estimated that human genetic evidence doubles the success rate from clinical development to approval5. In this study we leverage the growth in genetic evidence over the past decade to better understand the characteristics that distinguish clinical success and failure. We estimate the probability of success for drug mechanisms with genetic support is 2.6 times greater than those without. This relative success varies among therapy areas and development phases, and improves with increasing confidence in the causal gene, but is largely unaffected by genetic effect size, minor allele frequency, or year of discovery. These results suggest we are far from reaching peak genetic insights to aid the discovery of targets for more effective drugs.
Human genetics is one of the only forms of scientific evidence that can demonstrate the causal role of genes in human disease. It provides a crucial tool for identifying and prioritizing potential drug targets, providing insights into the expected effect (or lack thereof6) of pharmacological engagement, dose-response relationships7–10, and safety risks6,11–13. Nonetheless, many questions remain about the application of human genetics in drug discovery. Genome-wide association studies (GWAS) of common, complex traits, including many diseases, generally identify variants of small effect. This contributed to early skepticism of the value of GWAS14. Anecdotally, such variants can point to highly successful drug targets7–9, and yet, genetic support from GWAS is somewhat less predictive of drug target advancement than support from Mendelian disease5,15.
In this paper we investigate several open questions regarding the use of genetic evidence for prioritizing drug discovery. We explore the characteristics of genetic associations that are more likely to differentiate successful from unsuccessful drug mechanisms, exploring how they differ across therapy areas and among discovery and development phases. We also investigate how close we may be to saturating the insights we can gain from genetic studies for drug discovery and how much of the genetically-supported drug discovery space remains clinically unexplored.
To characterize the drug development pipeline, we filtered Citeline Pharmaprojects for monotherapy programs added since 2000 annotated with a highest phase reached and assigned both a human gene target (usually the gene encoding the drug target protein) and an indication defined in Medical Subject Headings (MeSH) ontology. This resulted in 29,476 target-indication (T-I) pairs for analysis (Extended Data Fig. 1A). Multiple sources of human genetic associations totaled 81,939 unique gene-trait (G-T) pairs, with traits also mapped to MeSH terms. Intersection these datasets yielded an overlap of 2,166 T-I and G-T pairs (7.3%) where the indication and the trait MeSH terms had a similarity ≥0.8; we defined these T-I pairs as possessing genetic support (Extended Data Fig. 1B, 2A, see Methods). The probability of having genetic support, or P(G), was higher for launched T-I pairs than those in historical or active clinical development (Figure 1A). In each phase, P(G) was higher than previously reported5,15, owing, as expected15,16, more to new G-T discoveries than to changes in drug pipeline composition (Extended Data Fig. 3A-F). For ensuing analyses, we considered both historical and active programs. We defined success at each phase as a T-I pair transitioning to the next development phase (e.g. from phase I to II), and we also considered overall success — advancing from phase I to a launched drug. We defined relative success (RS) as the ratio of the probability of success, P(S), with genetic support to the probability of success without genetic support (see Methods). We tested the sensitivity of RS to various characteristics of genetic evidence. RS was sensitive to the indication-trait similarity threshold (Extended Data Fig. 2A), which we set to 0.8 for all analyses herein. RS was >2 for all sources of human genetic evidence examined (Figure 1B). RS was highest for OMIM (RS = 3.7), in agreement with prior reports5,15; this was not the result of a higher success rate for orphan drug programs (Extended Data Fig. 2B), a designation commonly acquired for rare diseases. Rather, it may owe partly to the difference in confidence in causal gene assignment between Mendelian conditions and GWAS, supported by the observation that the RS for Open Targets Genetics (OTG) associations was sensitive to the confidence in variant-to-gene mapping as reflected in the minimum share of locus-to-gene (L2G) score (Fig. 1C). The differences common and rare disease programs face in regulatory and reimbursement environments4 and differing proportions of drug modalities9 likely contribute as well. OMIM and GWAS support were synergistic with one another (Fig. S2B). Somatic evidence from IntOGen had an RS of 2.3 in oncology (Extended Data Fig. 2C), similar to GWAS, but analyses below are limited to germline genetic evidence unless otherwise noted.
As sample sizes grow ever larger with a corresponding increase in the number of unique G-T associations, some expect17 the value of GWAS genetic findings to become less useful for the purpose of drug target selection. We explored this in several ways. We investigated the year that genetic support for a T-I pair was first discovered, under the expectation that more common and larger effects are discovered earlier. Although there was a slightly higher RS for discoveries from 2007-2010 that was largely driven by early lipid and cardiovascular-related associations, the effect of year was overall non-significant (P = 0.46, Fig. 1D). Results were similar when replicate associations or OMIM discoveries were included (Extended Data Fig. 2D-F). We next divided up GWAS-supported drug programs by the number of unique traits associated to each gene. RS nominally increased with the number of associated genes, by 0.048 per gene (P = 0.024, Fig. 1D). This is unlikely due to successful genetically-supported programs inspiring other programs, as most genetic support was discovered retrospectively (Extended Data Fig. 2G); the few examples of drug programs prospectively motivated by genetic evidence were primarily for Mendelian diseases9. There were no statistically significant associations with estimated effect sizes (P = 0.90 and 0.57, for quantitative and binary traits, respectively; Fig. 1D, Extended Data Fig. 2H) nor minor allele frequency (P = 0.026, Fig. 1D). That ever larger GWAS can continue to uncover support for successful targets is also illustrated by two recent large GWAS in type 2 diabetes (T2D)18,19 (Extended Data Fig. 4). When these GWAS quadrupled the number of T2D associated genes from 217 to 862, new genetic support was identified for 7 of 95 mechanisms in clinical development while the number supported increased from 5 to 7 out of 12 launched drug mechanisms.
Previously5, we observed significant heterogeneity amongst therapy areas in the fraction of approved drug mechanisms with genetic support, but did not investigate the impact on probability of success5. Here, our estimates of RS from phase I to launch showed significant heterogeneity (P < 1.0e-15), with nearly all therapy areas having estimates greater than one, 11 of 17 were >2, and hematology, metabolic, respiratory, and endocrine >3 (Fig. 2A-E). In most therapy areas, the impact of genetic evidence was most pronounced in phases II and III and least impactful in phase I, corresponding to capacity to demonstrate clinical efficacy in later development phases. Accordingly, therapy areas differed in P(G) and in whether P(G) increased throughout clinical development or only at launch (Extended Data Fig. 5); data source and other properties of genetic evidence including year of discovery and effect size also differed (Extended Data Fig. 6). We also found that genetic evidence differentiated likelihood to progress from preclinical to clinical development for metabolic diseases (RS = 1.38, 95% CI = 1.25 – 1.54), that may reflect preclinical models that are more predictive of clinical outcomes. Probability of genetic support by therapy area was correlated with probability of success, or P(S) (ρ = 0.59, P = 0.013) and with RS (ρ = 0.72, P = 0.0011; Extended Data Fig. 7), which led us to explore how the sheer quantity of genetic evidence available within therapy areas (Fig. 2F, Extended Data Fig. 8A) may influence this. We found that therapy areas with more possible gene-indication (G-I) pairs supported by genetic evidence had significantly higher RS (ρ = 0.71, P = 0.0010, Fig. 2G), although respiratory and endocrine were notable outliers with high RS despite fewer associations.
We hypothesized that genetic support might be most pronounced for drug mechanisms with disease-modifying effects, as opposed to those that manage symptoms, and that the proportion of such drugs differ by therapy area20,21. We were unable to find data with these descriptions available for a sufficient number of drug mechanisms to analyze, but we reasoned that targets of disease-modifying drugs are more likely to be specific to a disease, whereas targets of symptom-managing drugs are more likely to be applied across many indications. We therefore examined the number and diversity of all-time launched indications per target. Launched T-I pairs are heavily skewed towards a few targets (Fig. 2H). Of 450 launched targets, the 42 with ≥10 launched indications comprise 713 (39%) of 1,806 launched T-I pairs (Fig. 2H). Many of these are used across diverse indications for management of symptoms such as inflammatory and immune responses (NR3C1, IFNAR2), pain (PTGS2, OPRM1), mood (SLC6A4), or parasympathetic response (CHRM3). The count of launched indications was inversely correlated with the mean similarity of those indications (ρ = −0.72, P = 4.4e-84; Fig. 2H). Among T-I pairs, the probability of having genetic support increased as the number of approved indications decreased (P = 6.3e-7) and as the similarity of a target’s approved indications increased (P =1.8e-5, Fig. 2I). We observed a corresponding impact on RS, increasing in therapy areas where the similarity among approved indications increased, and decreasing with increasing indications per target (ρ = 0.74, P = 0.0010, and ρ = −0.62, P = 0.0080, respectively, Fig. 2J-K).
Only 4.8% (284/5,968) of T-I pairs active in phase I-III possess human germline genetic support (Figure 1A), similar to T-I pairs no longer in development (4.2%, 560/13,355), a difference that was not statistically significant (P = 0.080). We estimated (see Methods) that only 1.1% of all genetically supported G-I relationships have been explored clinically (Fig. 3A), or 2.1% when restricting to the most similar indication. Given that the vast majority of proteins are classically “undruggable”, we explored the proportion of genetically supported G-I pairs that had been developed to at least phase I, as a function of therapy area across several classes of tractability and relevant protein families22 (Fig. 3A). Within therapy areas, oncology kinases with germline evidence were the most saturated: 109 of 250 (44%) of all genetically supported G-I pairs had reached at least phase I; GPCRs for psychiatric indications were also notable (14/53, 26%). Grouping by target rather than G-I pair, 3.6% of genetically supported targets have been pursued for any genetically supported indication (Extended Data Figure 8). Of possible genetically supported G-I pairs, most (68%) arose from OTG associations, mostly within the past 5 years (Fig. 2F). Such low utilization is partly due to recent emergence of most genetic evidence (Extended Data Fig. 2F-G, 7A), since drug programs prospectively supported by human genetics have had a mean lag time from genetic association of 13 years to first trial21 and 21 years to approval9. Because some types of targets may be more readily tractable by antagonists than agonists, we also grouped by target and examined human genetic evidence by direction of effect for tumor suppressors versus oncogenes (Fig. 3B), identifying a few substrata for which a majority of genetically supported targets had been pursued to at least phase I for at least one genetically supported indication. Oncogene kinases received the most attention, with 19/25 (76%) reaching phase I.
To focus on demonstrably druggable proteins, we further restricted the analysis to targets with both i) any program reaching phase I, and ii) ≥1 genetically supported indication. Out of 1,147 qualifying targets, only 373 (33%) had been pursued for one or more supported indications (Fig. 3C), and most (307, 27%) of these targets were pursued both for indications with and without genetic support. Overall, an overwhelming majority of development effort has been for unsupported indications, at a 17:1 ratio. Within this subset of targets, we asked whether genetic support was predictive of which indications would advance the furthest. Grouping active and historical programs by D-I pair, we found that the odds of advancing to a later stage in the pipeline is 82% higher for indications with genetic support (P = 8.6e-73, Fig. 3D).
While there have been anecdotes such as HMGCR to argue that genetic effect size may not matter in prioritizing drug targets, here we provide systematic evidence that small effect size, recent year of discovery, increasing number of genes identified, or higher associated allele frequency do not diminish the value of GWAS evidence to differentiate clinical success rates. One reason for this is likely because genetic effect size on a phenotype rarely accounts for the magnitude of genetic effect on gene expression, protein function, or some other molecular intermediate. In some circumstances, genetic effect sizes can yield insights into anticipated drug effects. This is best Illustrated for cardiovascular disease therapies, where genetic effects on cholesterol and disease risk and treatment outcomes are correlated23. A limitation is that, other than Genebass, we did not include whole exome or whole genome sequencing association studies, which may be more likely to pinpoint causal variants. Moreover, all of our analyses are naïve to direction of genetic effect (gain versus loss of gene function) as this is unknown or unannotated in most datasets utilized here.
Our results argue for continuing investment to expand GWAS-like evidence, particularly for many complex diseases with treatment options that fail to modify disease. Although genetic evidence has value across most therapy areas, its benefit is more pronounced in some areas than others. Furthermore, it is possible that the therapy areas where genetic evidence had a lower impact have seen more focus on symptom management. If so, we would predict that for drugs aimed at disease modification, human genetics should ultimately prove highly valuable across therapy areas.
The focus of this work has been on the relative success of drug programs with and without genetic evidence, limited to drug mechanisms that have entered clinical development. This metric does not address the probability that a gene associated with a disease, if targeted, will yield a successful drug. At the early stage of target selection, is evidence of a large loss of function effect in one gene usually a better choice than a small non-coding SNP effect on the same phenotype in another? We explored this question for T2D studies referenced above. Of the 7 targets of launched drugs with genetic evidence, 4 had Mendelian evidence (in addition to pre-2020 GWAS evidence), out of a total of 19 Mendelian genes related to T2D (21%). 1 launched T2D target had only GWAS (and no Mendelian) evidence among 217 GWAS associated genes prior to 2020 (0.46%), while 2 launched targets were among 645 new GWAS associations since 2020 (0.31%). At least in this example, the “yield” of genetic evidence for successful drug mechanisms was greatest for genes with Mendelian effects, but similar between earlier and later GWAS. Clearly, just because genetic associations differentiate clinical stage drug targets from launched ones does not mean that a large fraction of associations will be fruitful. Moreover, genetically supported targets may be more likely to require upregulation, to be druggable only by more challenging modalities4,9, or to enjoy narrower use across indications. More work is required to better understand the challenges of target identification and prioritization given the genetic evidence precondition.
The utility of human genetic evidence in drug discovery has had firm theoretical and empirical footing for several years5,7,15. If the benefit of this evidence were canceled out by competitive crowding24, then currently active clinical phases should have higher rates of genetic support than their corresponding historical phases, and might look similar to, or even higher than, approved pairs. Instead, we find that active programs possess genetic support only slightly more often than historical programs and remain less enriched for genetic support than approved drugs. Meanwhile, only a tiny fraction of classically druggable genetically supported G-I pairs have been pursued even among targets with clinical development reported. Human genetics thus represents a growing opportunity for novel target selection and improving indication selection for existing drugs and drug candidates. Increasing emphasis on drug mechanisms with supporting genetic evidence is expected to increase success rates and lower the cost of drug discovery and development.
Methods
Definition of metrics
Except where otherwise noted, we define genetic support of a drug mechanism (i.e. a target-indication or T-I pair) as a genetic association mapped to the corresponding target gene for a trait that is ≥0.8 similar to the indication (see MeSH term similarity below). We defined the probability of genetic support, or P(G), as the proportion of drug mechanisms satisfying the above definition of genetic support. Probability of success, or P(S), is the proportion of programs in one phase that advance to a subsequent phase (for instance, phase I to phase II). Overall P(S) from phase I to launched is the product of P(S) at each individual phase. Relative success, or RS, is the ratio of P(S) for programs with genetic support to P(S) for programs lacking genetic support, which is equivalent to a relative risk or risk ratio. Thus, if N denotes the total number of programs that have reached the reference phase, and X denotes the number of those that advance to a later phase of interest, and the subscripts G and !G indicate the presence or absence of genetic support, then P(G) = NG / (NG + N!G); P(S) = (XG + X!G) / (NG + N!G); RS = (XG/NG)/(X!G/N!G). RS from phase I to launched is the product of RS at each individual phase. The count of “programs” for X and N is target-indication (T-I) pairs throughout, except for Figure 3D, which uses drug-indication pairs (D-I) in order to specifically interrogate P(G) where the same drug has been developed for different indications. For clarity, we note that where other recent studies22,25 have examined the fold enrichment and overlap between genes with a human genetic support and genes encoding a drug target, without regards to similarity, herein all of our analyses are conditioned on the similarity between the drug’s indication and the genetically associated trait.
Drug development pipeline
Citeline Pharmaprojects26 is a curated database of drug development programs including preclinical, all clinical phases, and launched drugs. It was queried via API (Dec 22, 2022) to obtain information on drugs, targets, indications, phases reached, and current development status. T-I pair was the unit of analysis throughout, except where otherwise indicated in the text (D-I pairs were examined in Figure 3D). Current development status was defined as “active” if the T-I pair had at least one drug still in active development, and “historical” if development of all drugs for the T-I pair had ceased. Targets were defined as genes; as most drugs do not directly target DNA, this usually refers to the gene encoding the protein target that is bound or modulated by the drug. We removed combination therapies, diagnostic indication, and programs with no human target or no indication assigned. For most analyses, only programs added to the database since 2000 were included, while for the count and similarity of launched indications per target, we used all launches for all time. Indications were considered to possess “genetic insight” — meaning the human genetics of this trait or similar traits have been successfully studied — if they had ≥0.8 similarity to i) an OMIM or IntOGen disease, or ii) a GWAS trait with at least 3 independently associated loci, based on lead SNP positions rounded to the nearest 1 Mb. For calculating relative success, we used the number of T-I pairs with genetic insight as the denominator. The rationale for this choice is to focus on indications where there exists the opportunity for human genetic evidence, consistent with the filter applied previously5. However, we observe that our findings are not especially sensitive to the presence of this filter, with RS decreasing by just 0.17 when the filter is removed (Extended Data Fig. 3G-H). Note that the criteria for determining “genetic insight” are distinct from, and much looser than, the criteria for mapping GWAS hits to genes (see locus-to-gene or L2G scores under Open Targets Genetics below). Many drugs had more than one target assigned, in which case all targets were retained for target-indication pair analyses. As a sensitivity test, running all analyses restricted to only drugs with exactly one target assigned yielded very similar results (Figures S1-S11).
OMIM
Online Mendelian Inheritance in Man (OMIM) is a curated database of Mendelian gene-disease associations. The OMIM Gene Map (downloaded Sep 21, 2023) contained 8,671 unique gene-phenotype links. We restricted to entries with phenotype mapping code 3 (“the molecular basis for the disorder is known; a mutation has been found in the gene”), removed phenotypes with no MIM number or no gene symbol assigned, and removed duplicate combinations of gene MIM and phenotype MIM. We used regular expression matching to further filter out phenotypes containing the terms “somatic”, “susceptibility”, or “response” (drug response associations) and those flagged as questionable (“?”), or representing non-disease phenotypes (“[”). A set of OMIM phenotypes are flagged as denoting susceptibility rather than causation (“{”); this category includes low-penetrance or high allele frequency association assertions that we wished to exclude, but also germline heterozygous loss-of-function mutations in tumor suppressor genes, where the underlying mechanism of disease initiation is loss of heterozygosity, which we wished to include. We therefore also filtered out phenotypes containing “{” except for those that did contain the terms “cancer”, “neoplasm”, “tumor”, or “malignant” and did not contain the term “somatic”. Remaining entries present in OMIM as of 2021 were further evaluated for validity by two curators, and gene-disease combinations for which a disease association was deemed not to have been established were excluded from all analyses. All of the above filters left 5,670 unique gene-trait links. MeSH terms for OMIM phenotypes were then mapped using the EFO OWL database using an approach previously described27, with additional mappings from Orphanet, full text matches to the full MeSH vocabulary, and finally, manual curation, for a cumulative mapping rate of 93% (5,297/5,670). Because sometimes distinct phenotype MIM numbers mapped to the same MeSH term, this yielded 4,510 unique gene-MeSH links.
Open Targets Genetics
Open Targets Genetics (OTG) is a database of GWAS hits from published studies and biobanks. OTG version 8 (October 12, 2022) variant-to-disease (V2D), locus-to-gene (L2G), variant index, and study index data were downloaded from EBI. Traits with multiple EFO IDs were excluded as these generally represent conditional, epistasis, or other complex phenotypes that would lack mappings in the MeSH vocabulary. Of the top 100 traits with the greatest number of genes mapped, we excluded 76 as having no clear disease relevance (example: “red cell distribution width”) or no obvious marginal value (example: excluded “trunk predicted mass” because “body mass index” was already included). Remaining traits were mapped to MeSH using the EFO OWL database, full text queries to the MeSH API, mappings already manually curated in PICCOLO (see below) or new manual curation. In total, 25,124/49,599 unique traits (51%) were successfully mapped to a MeSH ID. We included associations with P < 5e-8. OTG L2G scores used for gene mapping are based on a machine learning model trained on gold standard causal genes28; inputs to that model include distance, functional annotations, eQTLs, and chromatin interactions. Note that we do not utilize Mendelian randomization29 to map causal genes, and even gene mappings with high L2G scores are necessarily imperfect. OTG provides an L2G score for the triplet of each study or trait with each hit and each possible causal gene. We defined L2G share as the proportion of the total L2G score assigned each gene among all potentially causal genes for that trait-hit combination. In sensitivity analyses we considered L2G share thresholds from 10% to 100% (Figure 1B and Extended Data Fig. 3A), but main analyses used only genes with ≥50% L2G share (which are also the top-ranked genes for their respective associations). OTG links were parsed to determine the source of each OTG data point: the EBI GWAS catalog30 (N=136,503 hits with L2G share ≥0.5), Neale UK BioBank (http://www.nealelab.is/uk-biobank; N=19,139), FinnGen R631 (N=2,338), or SAIGE (N=1,229).
PICCOLO
PICCOLO32 is a database of GWAS hits with gene mapping based on tests for colocalization without full summary statistics by using Probabilistic Identification of Causal SNPs (PICS) and a reference dataset of SNP linkage disequilibrium values. As described32, gene mapping utilizes QTL data from GTEx (N=7,162) and a variety of other published sources (N=6,552). We included hits with GWAS P < 5e-8, and with eQTL P < 1e-5, and H4 ≥ 0.9, as these thresholds were determined empirically32 to strongly predict colocalization results.
Genebass
Genebass33 is a database of genetic associations based on exome sequencing. Genebass data from 394,841 UK Biobank participants (the “500K” release) were queried using Hail (October 19, 2023). We used hits from four models: pLoF (predicted loss-of-function) or missense|LC (missense and low confidence LoF), each with SKAT or burden tests, filtering for P < 1e-5. Because the traits in Genebass are from UK Biobank, which is included in OTG, we used the OTG MeSH mappings established above.
IntOGen
IntOGen is a database of enrichments of somatic genetic mutations within cancer types. We used the driver genes and cohort information tables (May 31, 2023). IntOGen assigns each gene a mechanism in each tumor type; occasionally a gene will be classified as a tumor suppressor in one type and an oncogene in another. We grouped by gene and assigned each gene its modal classification across cancers. MeSH mappings were curated manually.
MeSH term similarity
MeSH terms in either Pharmaprojects or the genetic associations datasets that were Supplementary Concept Records (IDs beginning in “C”) were mapped to their respective preferred main headings (IDs beginning in “D”). A matrix of all possible combinations of drug indication MeSH IDs and genetic association MeSH IDs was constructed. MeSH term Lin and Resnik similarities were computed for each pair as described34,35. Similarities of −1, indicating infinite distance between two concepts, were assigned as 0. The two scores were regressed against each other across all term pairs, and the Resnik scores were adjusted by a multiplier such that both scores had a range from 0 to 1 and their regression had a slope of 1. The two scores were then averaged to obtain a combined similarity score. Similarity scores were successfully calculated for 1,006/1,013 (99.3%) of unique MeSH terms for Pharmaprojects indications, corresponding to 99.67% of Pharmaprojects T-I pairs, and for 2,260/2,262 (99.9%) unique MeSH terms for genetic associations, corresponding to >99.9% of associations.
Therapeutic areas
MeSH terms for Pharmaprojects indications were mapped onto 16 top-level headings under the Diseases [C] and Psychiatry and Psychology [F] branches of the MeSH tree (https://meshb.nlm.nih.gov/treeView), plus an “other”. The signs/symptoms area corresponds to C23 Pathological Conditions, Signs, and Symptoms and contains entries such as inflammation and pain. Many MeSH terms map to >1 tree position; these multiples were retained and counted towards each therapy area, except for the following conditions: for terms mapped to oncology, we deleted their mappings to all other areas; and “other” was used only for terms that mapped to no other areas.
Analysis type 2 diabetes GWAS
We included 19 genes from OMIM linked to Mendelian forms of diabetes or syndromes with diabetic features. For Vujkovic et al, 202018, we considered as novel any genes with a novel nearest gene, novel coding variant, or a novel lead SNP colocalized with an eQTL with H4 ≥0.9. Non-novel nearest genes, coding variants, and colocalized lead SNPs were considered established variants. For Suzuki et al, 202319, we used the available L2G scores that OTG had assigned for the same lead SNPs in previously reported GWAS for other phenotypes, yielding mapped genes with L2G share > 0.5 for 27% of loci. Genes were considered novel if absent from the Vujkovic analysis. Together, these approaches identified 217 established GWAS genes and 645 novel ones (469 from Vujkovic and 176 from Suzuki). We identified 347 unique drug targets in Pharmaprojects reported with a type 2 diabetes or diabetes mellitus indication, including 25 approved. We reviewed the list of approved drugs and eliminated those where there were questions around the relevance of the drug or target to T2D (AKR1B1, AR, DRD1, HMGCR, IGF1R, LPL, SLC5A1). Because Pharmaprojects ordinarily specifies the receptor as target for protein or peptide replacement therapies, we also remapped the minority of programs where the ligand, rather than receptor, had been listed as target (changing INS to INSR, GCG to GCGR) To assess the proportion of programs with genetic support, we first grouped by drug and selected just one target, preferring the target with the earliest genetic support (OMIM, then established GWAS, then novel GWAS, then none). Next we grouped by target and selected its highest phase reached. Finally, we grouped by highest phase reached and counted the number of unique targets.
Universe of possible genetically supported gene-indication pairs
In all of our analyses, targets are defined as human gene symbols, but we use the term gene-indication pair (G-I) to refer to possible genes that one might attempt to target with a drug, and target-indication pair (T-I) to refer to genes that are the targets of actual drug candidates in development. To enumerate the space of possible G-I pairs, we multiplied the N=769 Pharmaprojects indications considered here by the “universe” of N=19,338 protein-coding genes, yielding a space of N= 14,870,922 possible G-I pairs. Of these, N=101,954 (0.69%) qualify as having genetic support per our criteria. A total of 16,808 T-I pairs have reached at least Phase I in an active or historical program, of which 1,155 (6.9%) are genetically supported. This represents an enrichment compared to random chance (OR = 11.0, P < 1.0e-15, Fisher exact test), but in absolute terms, only 1.1% of genetically supported G-I pairs have been pursued. A genetically supported G-I pair may be less likely to attract drug development interest if the indication already has many other potential targets, and/or if the indication is but the second-most similar to the gene’s associated trait. Removing associations with many GWAS hits and restricting to the single most similar indication left a space of 34,190 possible genetically supported G-I pairs, 719 (2.1%) of which had been pursued. This small percentage might yet be perceived to reflect competitive saturation, if the vast majority of indications are undevelopable and/or the vast majority of targets are undruggable. We therefore asked what proportion of genetically supported G-I pairs had been developed to at least Phase I, as a function of therapy area cross-tabulated against Open Targets predicted tractability status or membership in canonically “druggable” protein families, using families from ref. 22 as well as UniProt pkinfam for kinases36. We also grouped at the level of gene, rather than G-I pair (Extended Data Fig. 8).
Druggability and protein families
Antibody and small molecule druggability status was taken from Open Targets37. For antibody tractability, Clinical Precedence, Predicted Tractable – High Confidence, and Predicted Tractable – Medium to Low Confidence were included. For small molecules, Clinical Precedence, Discovery Precedence, and Predicted Tractable were included. Protein families were from sources described previously22, plus the pkinfam kinase list from UniProt36. To make these lists non-overlapping, genes that were both kinases and also either enzymes, ion channels, or nuclear receptors were considered to be kinases only.
Statistics
Analyses were conducted in R 4.2.0. For binomial proportions P(G) and P(S), error bars are Wilson 95% confidence intervals, except for P(S) for phase I-launch where the Wald method is used to compute the confidence intervals on the product of the individual probabilities of success at each phase. RS at each individual phase uses Wilson 95% confidence intervals, while RS for phase I–launch is defined as a product of the three phase-wise risk ratios, with Katz 95% confidence intervals. Effect of continuous variables on probability of launch were assessed using logistic regression. Differences in RS between therapy areas were tested using the Cochran-Mantel-Haenszel chi-square test (cmh.test from the R lawstat package). Pipeline progression of drug-indication pairs conditioned on the highest phase reached by a drug was modeled using an ordinal logit model (polr with Hess=TRUE from the R MASS package). Correlations across therapy areas were tested by weighted Pearson’s correlation (wtd.cor from the R weights package); to control for the amount of data available in each therapy area, the number of genetically supported T-I pairs having reached at least phase I used as the weight. Enrichments of T-I pairs in the utilization analysis were tested using Fisher’s exact test. All statistical tests were two-sided.
Source code availability and data availability
An analytical dataset and source code are available at https://github.com/ericminikel/genetic_support/ and are sufficient to reproduce all figures and statistics herein.
Data Availability
All source code and data produced in the present work are contained in the supplementary material an online at https://github.com/ericminikel/genetic_support/
SUPPLEMENTARY FIGURES
Figures S1-S11. The three main and eight extended data figures restricted to drugs with one target only.
Footnotes
This manuscript has been extensively revised to improve clarity and address feedback. Changes include updating of OMIM and IntOGen data to latest available as well as additional analyses and figures to better describe the study, data, and results. None of the changes had a material impact on the results and conclusions.