Abstract
Genome-wide association studies have successfully discovered thousands of common variants associated with human diseases and traits, but the landscape of rare variation in human disease has not been explored at scale. Exome sequencing studies of population biobanks provide an opportunity to systematically evaluate the impact of rare coding variation across a wide range of phenotypes to discover genes and allelic series relevant to human health and disease. Here, we present results from systematic association analyses of 3,700 phenotypes using single-variant and gene tests of 281,850 individuals in the UK Biobank with exome sequence data. We find that the discovery of genetic associations is tightly linked to frequency as well as correlated with metrics of deleteriousness and natural selection. We highlight biological findings elucidated by these data and release the dataset as a public resource alongside a browser framework for rapidly exploring rare variant association results.
Introduction
Coding variation has been the most readily interpretable class of genomic variation since the development of the gene model and mapping of the human genome. As such, it has facilitated the mapping and interpretation of variants with immediate clinical importance such as the American College of Medical Genetics actionable variant list (1). More recently, exome sequencing has yielded the discovery of specific causal variants for hundreds of rare diseases, particularly dominant acting de novo variants for severe diseases (2).
As the sample sizes of exome sequencing datasets continue to grow, so do the opportunities to identify associations between rare variants and phenotypes (both complex traits and diseases). In complex diseases, identifying causal genetic factors for a given disease can provide direct insight into the potential for therapeutic avenues. For instance, gain-of-function variants in PCSK9 have been demonstrated to increase LDL levels and thus risk for cardiovascular disease (3). Accordingly, loss-of-function (LoF) variants are protective for cardiovascular disease (4), and less than 15 years after the discovery of this effect, therapeutic approaches to inhibit PCSK9 have been brought to market (5).
Deeply phenotyped biobanks present a unique opportunity to simultaneously analyze multiple diseases and traits within a single cohort. These datasets enable the discovery of new disease genes with therapeutic potential at a large scale across phenotypes. For instance, a recent joint association analysis of the UK Biobank and the FinnGen study cohorts revealed rare variants in ANGPTL7 that protect against glaucoma (6). The UK Biobank is a collection of over 500,000 participants with standardized, detailed phenotypic data on which GWAS have been run extensively. Recent studies have leveraged previous releases of the exome sequence data to explore various aspects of rare variant associations in this dataset (7–9). Here, we present and publicly release results from a systematic, large-scale rare variant association analysis of 3,700 phenotypes in more than 300,000 exome sequenced individuals.
Generating high-quality exome data for rare variant associations
We built an end-to-end pipeline for read mapping, processing, joint variant calling, quality control, and mixed model association analysis, and applied this pipeline to 302,325 individuals with exome sequence data from the UK Biobank. The read mapping and processing pipeline adopted the GATK Best Practices pipeline (GRCh38), and the resulting gVCF files were joint-called using a scalable implementation in Hail (Supplementary Information; Fig. S1) (10). We processed a set of 3,700 phenotypes including 1,117 quantitative traits as well as 2,583 binary traits with at least 200 cases, which included 681 disease endpoints based on ICD-10 codes (Fig. S2).
After performing quality control (QC) in a similar but augmented (e.g. array concordance; see Supplementary Information) manner as for the Genome Aggregation Database (gnomAD) (11), we generated a high-quality dataset of 286,310 individuals (Figs. S3 to S5; table S1), including 281,850 individuals of European ancestry in which we find 20,343,543 high-quality variants (Fig. S6). For each of 19,591 protein-coding genes, we considered up to three functional annotation categories: predicted LoF (pLoF), missense (including low-confidence pLoF variants and in-frame indels), and synonymous, resulting in 57,650 groups for association testing (i.e., one group per gene and functional annotation category).
Creating a high-quality set of rare variant associations
We performed group tests using the mixed model framework SAIGE-GENE (12), which includes single-variant tests and gene-based burden (mean) and SKAT-O (hybrid variance/mean) tests (Fig. S7). In total, we performed up to 7,575,993 single-variant tests and 57,650 group tests for each of 3,700 phenotypes (Fig. 1). Additionally, we randomly generated 314 heritable phenotypes to test the asymptotic properties of the mixed-model association testing framework (Figs. S8 to S9), and to determine empirical p-value thresholds for Type I error control. Based on this analysis, for each phenotype, we consider genome-wide p-value thresholds of 2.5 × 10−8 for SKAT-O tests, 6.7 × 10−7 for burden tests, and 8 × 10−9 for single-variant tests (see Supplementary Information; Fig. S10), corresponding to approximately 0.05 expected false positives per phenotype.
Quality control of rare variant association tests. The number of phenotypes (A), variants (B), and groups (i.e., gene-annotation pairs; C) before and after quality control. After quality control, the number of variants (D) and genes (E) are broken down by annotation and frequency bin (alternate allele frequency for variants, cumulative allele frequency for genes).
We performed extensive quality control on these summary statistics (Fig. 1; Table S2; Supplementary Information), including an 0.01% minor and cumulative allele frequency filter for variants and genes, respectively, as well as genomic control (lambda GC) for each phenotype and each gene (Figs. S11 to S15). Further, we pruned to a set of 1,362 high-quality independent phenotypes encompassing 559 continuous traits and 803 binary traits, including 310 ICD codes (Figs. 1A, S16; Table S2). We confirm the robustness of our results by comparing them to a previous large-scale study of height (Tables S3 to S5, Fig. S17) and red blood cell phenotypes (Table S6), for which our analysis replicates the majority of associations with consistent direction of effect (13, 14).
We filtered to 349,941 variants, including 8,865 pLoF variants, 208,723 missense variants, and 132,353 synonymous variants with a cohort frequency of at least 0.01% (corresponding to an allele count of approximately 60; Fig. 1B). For group tests, we filter to a high-quality set of 40,492 gene tests with at least 20X coverage (Fig. S13) and an aggregate allele frequency of at least 0.01% for pLoF (N=9,558 genes), missense (N=15,457), and synonymous (N=15,477) (Fig. 1C).
Using these criteria, we identified a total of 27,421, and 4,560 associations meeting our p-value threshold, with a mean of 20.1 and 3.35 associations per phenotype, for single-variant tests and group tests, respectively (disease results shown in Fig. 2A-B). Comparing the group test results to single-variant association test results, we find 1,069 associations (on average 0.8 per phenotype) from group tests where no association reached our p-value threshold for any single variant in the corresponding gene (Fig. 2C).
Rare variant association testing is enhanced by group tests. (A-B), Manhattan plots depict the distribution of p-values for all single-variant (A) and SKAT-O gene-based (B) associations, where the minimum p-value across phenotypes in each ICD chapter is shown. (C): The number of gene-level associations per phenotype is shown as a barplot, broken down by trait type (left) and normalized within each trait type (right). The single-variant tests are grouped into genes where at least one associated variant is necessary to be “Significant by variant” which is shown alongside group tests (“Significant by gene”) as well as genes where an association is found both for group and single-variant tests.
Displaying rare variant associations
The utility of human genetic variation datasets are substantially enhanced when made accessible in the form of online portals that enable non-technical domain experts to quickly browse, interpret, and export results for downstream follow-up (15). We extended our gnomAD browser toolkit to create the genebass (gene-biobank association summary statistics) browser (https://genebass.org), a new, highly interactive tool for exploring large numbers of gene-based PheWAS analysis results. This resource provides users with direct access to all 3,700 phenotypes, serving up 609,538,421 gene-level association statistics (across 19,591 genes, 3 annotation sets, and 3 burden tests) and 20,800,574,337 single variant association statistics across 7,575,993 exome variants. For completeness, the released dataset includes all association statistics, including pre-quality controlled data, but we provide functionality to filter to only the highest quality data presented herein. Our web application features a novel layout and navigational scheme for rapidly browsing phenome-wide associations by integrating results across genes and variants. Customizable controls, plots, and tables enable flexible filtering and visualization of phenotypes, genes, and variants of interest; results can be exported for downstream analyses; and variant associations across traits can be compared to inform pathways associated with complex traits and develop therapeutic hypotheses (see Supplementary Information).
Frequency and selection affect the landscape of rare variant associations
A major complicating factor in the analysis and interpretation of association statistics, particularly from rare variants, is the relationship between natural selection, allele frequency, effect size, and power for discovery. Sham et al. showed that the power to detect association is proportional to the variance explained of a biallelic variant (16). Specifically, for a continuous trait the variance explained of a biallelic variant that is purely additive is 2pqa2 where p is the allele frequency, q = 1-p and a is the allelic effect of the variant. Thus, for a fixed effect size, a more common variant will capture more variance and by extension show stronger association.
However, the process of negative selection will tend to decrease the frequency of functional damaging variants, suggesting that variants with large effect sizes are more likely to be rare. Indeed, partitioned heritability analyses for common variants support the presence of these countervailing forces, as comparatively lower frequency variants have larger absolute effect sizes but this growth in effect size is slower than the loss in variance explained from their lower frequency (17). In evaluating the landscape of rare variant association, we observe a similar pattern with increasing proportion of variants associated with at least one phenotype as frequency increases (Fig. 3A); however, within each frequency category, we observe the effect of functional annotation, a known correlate for deleteriousness, on the proportion of variants with at least one association.
The influence of variant allele frequency and functional annotation in exome association testing. The proportion of single variants (A) and genes (B) with at least one suggestive hit is shown broken down by allele frequency category (A) or cumulative allele frequency category (B), each shown below the plot, broken down by functional annotation. This metric is also plotted by the proportion expressed across transcripts for splice variants (C), and ClinVar pathogenicity status (D).
Comparing the number of associations by variant annotation within each allele frequency category, we find that pLoF variants have a larger number of associations than missense variants, followed by synonymous variants for single-variant tests (Fig. 3A) as well as group tests (Fig. 3B). Within missense variants, variant deleteriousness as predicted by PolyPhen2 (18) is correlated with the number of associations meeting our p-value threshold (Fig. S18). For splice donor variants, we find a correlation between the proportion expressed across transcripts (pext) (19) and the number of associations (Fig. 3C). Additionally, the pathogenicity level of ClinVar variants is correlated with phenotypic association (Fig. 3D).
Gene function influences association statistics
We examined the phenotypic impact of gene categories previously known to have functional relevance and/or a role in disease. In particular, we find that 470 genes previously implicated in developmental delay (20) are more likely to be associated with a phenotype in the UK Biobank (Fisher’s exact p = 2 × 10−3; Fig. 4). Further, we observe a correlation between selection against pLoFs in a gene and the phenotypic impact of pLoFs in that gene: specifically, constrained genes (i.e., those in the highest decile of LOEUF, a metric of loss-of-function intolerance) are more likely to be associated with a phenotype (5.9%) than a frequency-matched set of genes in the genome (2.0%; Fisher’s exact p = 2.1 × 10−6; Fig. 4). Similarly, autosomal dominant and autosomal recessive genes, as well as genes with previously established hits in the GWAS catalog and FDA approved drug targets, show an increased phenotypic impact of pLoFs and missense variants. Finally, cell essential genes show an increased proportion of associated phenotypes through a pLoF mechanism, while non-essential genes do not show an enrichment.
The effect of gene function on the landscape of rare variant associations. The proportion of gene-annotation pairs with at least one association (SKAT-O p < 2.5 × 10−8) is shown for a number of gene categories, each compared to a background set of genes matched on cumulative allele frequency. Error bars represent 95% confidence intervals. Asterisks denote a significant difference between the background set and test set (* and ** indicate p < 0.05 and p < 0.001, respectively).
Biological insights from rare variant association results
The biological information encapsulated in this dataset is extremely high-dimensional, and we release the full dataset of results for the benefit of the community. Here, we highlight a set of known and putative associations as examples of the power of this dataset. First, we recapitulate many known associations from previous studies, including associations between PCSK9 and LDL cholesterol (pLoF burden p = 3 × 10−94), COL1A1 and bone density (pLoF burden p = 1.5 × 10−8) (21), KLF1 and several red blood cell traits (pLoF burden < 2 × 10−8) (22), and LRP5 (Wnt coreceptor) and bone density and osteoporosis phenotypes (pLoF burden < 5 × 10−7) (23).
Finally, we highlight novel biological signals identified in the exome dataset, enabled by the results browser. In particular, we find an association between predicted loss-of-function of SCRIB and white matter integrity of tapetum (Fig. 5). Notably, this association is not significant at any single pLoF variant, but when aggregated into a SKAT-O or burden group test, the overall ablation of the transcript is associated at a p-value of 6 × 10−13 (Fig. 5A). This provides additional context to a signal observed in a recent GWAS of white matter integrity (24) averaged across regions of the brain, as well as in the body of corpus callosum (Fig. 5B). To our knowledge, this gene has not been associated in previous genome-wide association studies, although it is a constrained gene (pLI = 0.93) that shows evidence for neural tube defects in mice (25) with ultra-rare occurrences in humans (26).
Refined association between SCRIB and white matter integrity of tapetum. (A): The association between pLoF variants in SCRIB with mean OD (orientation dispersion index) in tapetum on FA (fractional anisotropy) skeleton (from dMRI data). 6 rare pLoF variants are discovered, all of which have a positive beta value (bottom). In aggregate, these variants show an association at p = 6 × 10−13. (B): A Manhattan plot of a previous GWAS of FA averaged across brain regions (top), body of corpus callosum (middle), and splenium of corpus callosum (bottom). Horizontal dashed line indicates a GWAS genome-wide significance threshold (5 × 10− 8), and vertical line indicates the location of SCRIB.
Discussion
We have generated rare variant association analysis summary statistics for 3,700 phenotypes and made these data available to the public, via bulk data downloads as well as a public-facing browser (https://genebass.org).
There are a number of limitations to our analysis. Although we have performed extensive QC to improve the reliability of these results, we urge caution in interpreting association results, particularly for the rarest binary traits (prevalence < 10−4) and for ultra-rare variants (frequency < 10−4). For pLoF variants, the median cumulative allele frequency across genes is approximately 1.5 × 10−4, suggesting that group tests at current sample sizes are only powered to detect individual gene effects for quantitative traits that capture at least 0.02% of variance, as well as diseases and traits that have a high prevalence (well above 10%; Fig. S10). This is further observed in the lack of asymptotic properties of the mixed-model tests for rarer binary traits (Fig. S9). Nonetheless, global biological trends are apparent, such as the relative ordering of functional impact (pLoF > missense > synonymous; Fig. 3), highlighting that the ability to accurately annotate variants with the functional consequences on a gene is critical to powering discovery in rare variant analysis. Further, measures of natural selection at the gene level continue to highlight that certain classes of genes, such as LoF-intolerant genes, are clearly enriched for phenotypic associations.
Finally, these association analyses were only performed for individuals of European ancestry, the largest group in the dataset. Notably, these analyses only interrogate a slice of human genetic diversity, and expanding to additional ancestries has been shown to increase power and resolution for genetic discovery (27–29); however, as the sample sizes of non-European individuals in the UK Biobank dataset are very limited, these analyses would be underpowered for most binary traits including many disease outcomes. Additional datasets with large sample sizes of diverse individuals will be required to overcome these limitations.
Biobank teams
Abbvie
Steering team: Jeff Waring, Howard Jacob, J. Wade Davis
Data management team: A. Jason Grundstad, Silvia Orozco
Extended Scientific team: Bridget Riley-Gillis, Sahar Esmaeeli, Fedik Rahimov, Ali Abbasi, Nizar Smaoui, Xiuwen Zheng, Emily King, John Lee, Reza Hammond, Mark Reppell, Hyun Ji Noh
Biogen
Steering team: Ellen Tsai, Christopher D. Whelan, Paola Bronson, David Sexton, Sally John, Heiko Runz
Data management team: Eric Marshall, Mehool Patel, Saranya Duraisamy, Timothy Swan
Extended Scientific team: Denis Baird, Chia-Yen Chen, Susan Eaton, Jake Gagnon, Feng Gao, Cynthia Gubbels, Yunfeng Huang, Varant Kupelian, Kejie Li, Dawei Liu, Stephanie Loomis, Helen McLaughlin, Adele Mitchell, Nilanjana Sadhu, Benjamin Sun, Ruoyu Tian
Pfizer
Steering team: Hye In Kim, Xinli Hu, Morten Sogaard, A. Katrina Loomis, Eric Fauman, Melissa R. Miller
Data Management team: Jay Bergeron, Andrew Hill, Juha Sarimaa, Zhan Ye, Xing Chen
Extended Scientific team: Yi-Pin Lai, Jean-Philippe Fortin, Joanne Berghout, Robert Moccia, Craig L. Hyde
Acknowledgements
We thank Danielle Ciofani and Cathy Marshall for their efforts in launching this project. We thank the participants and leadership of the UK Biobank: this work was done under UK Biobank applications 26041 and 48511.