Abstract
The genome-wide burdens of deletions, loss-of-function mutations, and duplications correlate with many traits. Curiously, for most of these traits, variants that decrease expression have the same genome-wide average direction of effect as variants that increase expression. This seemingly contradicts the intuition that, at individual genes, reducing expression should have the opposite effect on a phenotype as increasing expression. To understand this paradox, we introduce a concept called the gene dosage response curve (GDRC) that relates changes in gene expression to expected changes in phenotype. We show that, for many traits, GDRCs are systematically biased in one trait direction relative to the other and, surprisingly, that as many as 40% of GDRCs are non-monotone, with large increases and decreases in expression affecting the trait in the same direction. We develop a simple theoretical model that explains this bias in trait direction. Our results have broad implications for complex traits, drug discovery, and statistical genetics.
Introduction
Genome-wide association studies (GWASs) have identified thousands of variants associated with complex traits [1, 2], most of which are in non-coding regions of the genome [3]. This suggests that a substantial proportion of phenotypic variance is likely explained by variation in gene expression [3–8].
One source of variation in gene expression is gain or loss of functional copies of a gene [9]. Such gene dosage changes can have phenotypic consequences, with aneuploidies being an extreme example [10–13]. Copy number variants (CNVs), which typically perturb the dosage of a few genes, are another source of expression variation [14]. CNVs segregate in humans [15–18] and have measurable effects on various traits and disorders [19–27], presumably by varying the expression of one or more of the genes with atypical copy numbers [9].
There has been substantial work associating the total genome-wide CNV burden to traits [19–22, 24, 25, 27–29]. An individual’s genome-wide burden is estimated by counting the total number of CNVs carried by an individual, regardless of which genes are affected. Associating this burden with a trait can be thought of as roughly estimating how much deleting or duplicating a random gene affects the trait. The large number of traits significantly associated with genome-wide deletion burden suggests that the effect of reducing the expression of a randomly chosen gene is often biased in a particular trait direction. Interestingly, genome-wide duplication burden and genome-wide deletion burden often have the same direction of effect [20–22]. That is, increasing expression has the same effect as decreasing expression on average (Appendix A). This seemingly contradicts the intuition that, for a given gene, increasing expression should have the opposite effect of decreasing expression.
To make sense of these genome-wide CNV burden test results from the perspective of what is happening at individual genes, we estimated the gene-level effects of loss-of-function (LoF) variants, deletions, and duplications using whole-exome sequencing (WES) and whole-genome sequencing (WGS) data from the UK Biobank (UKB) [30, 31].
Our analysis of 91 continuous traits revealed that, across traits, the average effects of deletions and duplications are often non-zero and usually affect the trait in the same direction. Although the data confirm the intuition that for most genes deletions have opposite effects to duplications, we identified a substantial number of cases where deletions and duplications of a single gene affect a trait in the same direction. To explain these counterintuitive observations, we introduce the concept of gene dosage response curves (GDRCs), study their properties, and develop a simple model of their evolution.
Results
Gene-level burden tests for loss-of-function variants and duplications
To better understand how variants with different effects on dosage impact traits, we performed burden tests for 410 continuous traits using LoF variants, deletions, and duplications in the UKB. In the analysis presented, we focused on LoF variant burden tests because they are better powered than deletions, but deletions showed broadly similar patterns of effect (Appendix F). We took care to keep the analysis of the different variant types as similar as possible so that the burden effect sizes could be interpreted jointly. We used the gene burden test implemented in REGENIE [32]. Continuous traits were selected from various blood biochemistry, blood count, blood metabolite, and anthropometric measurements available in the UKB (Data Availability). We accounted for sex, age, batch effects, and other sources of confounding by including appropriate covariates (Methods) and performed quality control steps (Appendix F). The summary statistics from these burden tests are provided as a resource to the community (Data Availability). We selected a subset of 91 continuous traits for analysis in the paper (Data Availability).
Gene dosage response curves could explain paradoxical patterns in burden tests
Estimating the effect of genes on complex traits is central to understanding the mechanisms underlying trait biology. Changes in gene expression, which we refer to here as ‘gene dosage’, are often assumed to have a linear effect on a trait (Figure 1A). For example, transcriptome-wide association studies (TWASs) use a linear model for the relationship between predicted expression and trait value [5, 6]. However, LoF and duplication burden tests assess phenotypic effects at opposite ends of the dosage spectrum and under a linear model would have opposite directions of effect. Thus, a linear model is incompatible with the genome-wide burden effects of LoF variants and duplications having the same sign (Appendix A). Recent work on the effects of gene dosage on molecular traits shows these relationships are often non-linear [33, 34]; as we will show below, an assumption of nonlinearity is necessary but not sufficient to explain these observations.
We next extended the observations of directionality by estimating the effect of each individual gene on the trait using burden tests, and then averaging those estimates across all genes for a trait. We found that the estimated average burden effect for LoF variants was almost always in the same direction as that of duplications (Figure 1B). Curiously, however, when we look at top hits from GWASs, we find no evidence for such an effect (Figures 1C and F.8).
Taken together, these observations provide contradictory information about the relationship between dosage and trait. To explain these observations, we need to understand the gene-level heterogeneity present in the dosage-response relationships. We envision the relationship between gene expression and trait as a gene dosage response curve (GDRC): the continuous relationship between gene expression and average trait value. The core idea is that if a gene is involved in the biology of a trait, then different baseline expression values for the gene will result in different trait values on average. A given point on the GDRC is the hypothetical mean trait value of individuals whose gene expression is set to a particular level. In other words, the GDRC characterizes the relationship between expression and trait value for all possible dosage perturbations. The GDRC acts as a conceptual framework to unify various estimates of gene effects, allowing us to understand the source of non-linearity in the data.
Averaging over genetic backgrounds, a particular variant will have a specific effect on expression and thus correspond to a specific part of the GDRC. We define the origin to be the mean trait and expression value in the population (Appendix A). A burden estimate of the LoF effect, ,is an estimate of a point on the left tail of the curve, while a burden estimate of the duplication effect, ,is an estimate of a point on the right tail of the curve (Figure 1A). The effect estimated by TWAS, ,is the slope of a linear approximation to the GDRC in the region of expression levels spanned by common expression quantitative trait loci (eQTL). While we visualize LoF variants and duplications as having specific dosage effects on expression (50% and 150% respectively), our theoretical arguments will require only that they generally decrease, or increase, expression respectively to understand the general characteristics of GDRCs.
Intuition suggests that GDRCs should be monotone, meaning that reducing and increasing expression should have opposite phenotypic effects (Figure 1D) [24, 25, 27, 35, 36]. The genome-wide average burden effects that we computed can be interpreted as points on the average GDRC obtained by averaging the GDRC of each gene for a given trait. If average deletion and duplication effects are in the same direction, it would imply a non-monotone GDRC. It therefore seems surprising that aggregating presumably monotone GDRCs would result in a non-monotone average GDRC.
We suggest two models that could explain observations of non-monotone average GDRCs. In the monotone model, the non-monotone average GDRC can be explained using only monotone gene-level GDRCs if the monotone curves are systematically buffered against one trait direction (Figure 1E). In the non-monotone model, a subset of GDRCs are non-monotone, and are more often non-monotone in a particular direction, driving the average GDRC to be non-monotone (Figure 1F). These models are not mutually exclusive and may both contribute to the non-monotone average GDRC observed for a trait.
To understand the relative contribution of these models to the average burden effects, we started by looking at top genes for various traits (Figure 2). As we expected, most top genes have monotone effects on traits. However, surprisingly, a sizable minority of the top hits had non-monotone effects. Thus, both models could plausibly contribute to the observed average GDRCs.
Most traits are monotone on average
To better understand the contribution of our proposed models to the non-monotone average burden effects (Figures 1E and 1F), we wanted to understand the overall monotonicity of GDRCs, beyond just significant hits.
We started with the observation that monotone GDRCs imply that LoF variants and duplications should have effects in opposite directions (Figure 3A), which we visualized by plotting the LoF burden effect against the duplication burden effect. If a duplication and an LoF variant have a large effect in the same direction, the GDRC would be non-monotone and the product of their effects would be positive. If the GDRC is monotone, then LoF variants and duplications will have opposite effects, and their product will be negative. We define ϕ to be a normalized version of the negative product of the effect sizes and term this the monotonicity (Methods, Appendix B). ϕ is positive if genes with large effects on a trait tend to have monotone effects on average.
Across traits, we found that the burden signal was explained predominantly by monotone GDRCs, although five traits had significantly negative monotonicity estimates (Figures 3B and F.9). To get a sense of the average level of monotonicity across traits, we inferred a distribution over the monotonicity estimates across traits, which confirmed that both positive and negative ϕ may be observed (Figure 3B). 71% of the density of the inferred distribution was over positive values of ϕ, while the rest was negative.
Our monotonicity estimates do not explain the non-monotone average burden effect. For example, while forced vital capacity (FVC) has a high degree of non-monotonicity and a non-monotone average GDRC, height and cystatin C both have a high degree of monotonicity but also have non-monotone average GDRCs (Figure 3B). However, the predominance of positive values of ϕ suggests that most of the overall burden signal comes from monotone genes. This matches our biological expectations of monotone gene-level effects, and is potentially consistent with the monotone model.
Non-monotone genes align with the direction of average effects
The distribution of monotonicity estimates suggests that the genome-wide signal is composed of both monotone and non-monotone GDRCs. To systematically assess if we could detect non-monotone GDRCs, we ascertained genes with marginally significant LoF and duplication effects (|z| > 2) in any trait. We identified 633 gene-trait association pairs across all of our traits, of which 287 (45.3%) exhibited a non-monotone relationship (Figure 4A). This suggests that the burden data is potentially consistent with the non-monotone model.
The non-monotone model (Figure 1F) proposes that there are more non-monotone curves aligned with the average burden effect than not. To assess this, we polarized ascertained gene-trait pairs based on the direction of the average burden effect. Consistent with our model, non-monotone genes were significantly more likely to be found in the quadrant concordant with the average effect (Figures 4A and 4B). As a specific example of a gene that falls in the concordant quadrant, the average LoF and duplication burden effect is to reduce hand grip strength (Table F.1), and one of the top non-monotone genes, G6PC1, reduces hand grip strength upon perturbation in either direction (Figure 2).
The presence of non-monotone GDRCs that contribute strongly to the average burden effect is surprising. We can use a simple mechanistic model to see why such non-monotone GDRCs are unexpected. Consider a gene that affects a trait via a single biological pathway (Figure 4C). Intuition about biological mechanisms suggests that each step along a pathway should be monotone, but possibly non-linear. For instance, protein levels are often buffered against large changes in gene expression [9], which results in a non-linear, but monotone response of protein levels to gene expression (Figure 4C). Similarly, perturbations of the expression levels of transcription factors can result in non-linear changes in the expression of their targets [33, 34]. As the dosage effect percolates through a pathway towards the focal trait, these curves compose with one another to form the relationship between gene dosage and trait. If each curve along the pathway is monotone, then the overall relationship between gene expression and trait is mathematically guaranteed to also be monotone (Figure 4C, Appendix A).
However, non-monotone GDRCs can arise if a gene’s effects flow through two or more pathways (Figure 4D). 3-hydroxybutyrate levels, which increase in blood during ketogenesis, provide an interpretable example. CYP11B1 has a non-monotone relationship with 3-hydroxybutyrate levels in the blood, with both LoF variants and duplications increasing 3-hydroxybutyrate on average ,CYP11B1 encodes a key adrenal enzyme involved in the production of cortisol. Deficiency of the enzyme encoded by CYP11B1, referred to as 11β-hydroxylase deficiency, is a cause of a congenital adrenal condition that results in reduced cortisol levels and subsequent hypoglycemia [37]. Without sufficient glucose levels, the body generates increased amounts of ketones, especially 3-hydroxybutyrate, to serve as an alternative energy source [38]. Meanwhile, duplications in CYP11B1 can lead to increased production of cortisol, which over time can lead to the development of Cushing’s syndrome and insulin resistance [39, 40]. This insulin resistance can then lead to a state of perceived hypoglycemia, resulting in increased ketone levels, especially increased 3-hydroxybutyrate, in the blood [41]. Generalizing these biological principles, non-monotone GDRCs can arise whenever the effect of a gene propagates through multiple pathways to affect the trait (Figure 4D).
Overall, non-monotone GDRCs exist and can have reasonable biological underpinnings. The number of non-monotone GDRCs is imbalanced, and suggestively aligns with the average burden effect across traits. This provides support that our second model contributes to the average burden effect.
Genes are buffered against one trait direction
The analysis of monotonicity and the imbalance in non-monotone GDRCs suggest that both the monotone and non-monotone models are plausible explanations of the non-monotone average GDRCs. If the mixture of GDRCs for the trait are depressed in one direction, then the average GDRC will be non-monotone (Figure 5A). Another interpretation of these curves is that traits are more likely to be perturbed in one direction than the other regardless of the direction of expression perturbation. Therefore, this is a form of buffering that occurs at the trait level. We refer to this as trait buffering. We endeavored to quantify how much each of our models contributed to the non-monotone average GDRC by measuring and decomposing the trait buffering for various traits.
In the burden effect plot, curves experiencing trait buffering preferentially map to a region that is away from a diagonal line (Figure 5A). The diagonal line represents perfectly linear GDRCs (Appendix E). Points above the diagonal line correspond to GDRCs that are buffered against negative trait values, and points below the diagonal line correspond to GDRCs buffered against positive trait values.
To develop some preliminary intuition about the trait buffering model, we estimated γLoF and γDup for each gene. Since burden effect estimates are noisy for individual genes, we performed Bayesian inference using a flexible multivariate adaptive shrinkage (MASH) prior [42], which models the joint distribution of γLoF and γDup as a mixture of bivariate normal distributions. This prior distribution is learned from the data. We used posterior samples of γLoF and γDup for each gene to better understand the distribution of genes on the burden effect plot. As an example, the posterior means of γLoF and γDup for height are displayed in Figure 5B. Both monotone and non-monotone GDRCs are predominantly below the diagonal, concordant with the direction of the average burden effect.
To formalize our intuition, we developed a statistic to measure this trait buffering, ξ, that aggregates the signed strength of deviation from the diagonal line across all GDRCs (Appendix E). We calculate the statistic using posterior samples of γLoF and γDup.
The sign of ξ indicates which direction is being buffered against. For example, our model predicts that height should have a negative value of ξ and cystatin C should have a positive value of ξ. If GDRCs are not preferentially buffered against either direction of a trait, our statistic should be close to zero. Indeed this measure of trait buffering showed strong concordance with both the average LoF and duplication burden effects (Figure 5C).
To understand the extent to which our proposed models (Figures 1E and 1F) contributed to trait buffering, we decomposed ξ into the contribution from non-monotone and monotone GDRCs (Appendix E). These measures revealed that both monotone and non-monotone curves contribute to trait buffering (Figure 5D), with non-monotone GDRCs contributing a median of 80% to trait buffering.
Overall, traits with large non-monotone average burden effects have GDRCs that are buffered against one direction of the trait. Both of our proposed models (Figures 1E and 1F) contribute to produce the non-monotone average GDRCs, with non-monotone GDRCs contributing more strongly than monotone GDRCs.
A dysregulation phenotype may be upstream of many complex traits
To recapitulate, we found evidence for two models contributing to the observed average burden effects. These models imply that GDRCs for complex traits are organized in a manner that results in trait buffering.
Two observations initially seem to imply that the complex traits with non-monotone average GDRCs are under directional selection. First, the GDRCs for these traits are buffered against a particular trait direction. Second, the non-monotone average GDRC suggests that any variant should have an average non-zero effect in the same direction regardless of whether the variant increases or decreases expression.
To see why these observations are signatures of directional selection, we consider how the average expression of a given gene will evolve over time in response to directional selection. Suppose a gene affects a trait that is under negative selection (Figure 6A). Then, selection will tune the expression of the gene until the trait is minimized, resulting in non-monotone GDRCs where both deletions and duplications increase the trait. Thus, the non-monotone average GDRC and trait buffering in the direction of the average burden effects suggests that the GDRCs of these complex traits are shaped by directional selection.
However, prior analyses using GWAS data for many of these traits suggest that they are under a different mode of selection – stabilizing selection [43–47]. Additionally, we do not see a non-zero average effect in GWAS data (Figure 1C, Appendix F), inconsistent with directional selection.
Stabilizing selection affects GDRCs differently than directional selection. Suppose a gene affects a trait under stabilizing selection (Figure 6B). Stabilizing selection acts to maintain an optimum trait value, and will tune the expression of the gene to reach the optimum trait value. Unlike in the case of directional selection, the local shape of the GDRC around the optimum trait value is not restricted to minima. In this case, we do not expect any particular GDRC shape to dominate, and we do not expect LoF variants or duplications to be biased in a particular direction.
To explain the incongruous non-zero average burden effects and zero average GWAS effect, we propose that traits with non-monotone average GDRCs are downstream of an unobserved trait under negative selection that we refer to as ‘dysregulation’ (Figure 6C). If the focal trait is downstream of dysregulation, the GDRC of the gene for the focal trait may be shaped by both forms of selection depending on if the gene is a direct or indirect regulator of the focal trait (Figure 6D). For genes that act primarily on the trait via dysregulation, the GDRCs will be non-monotone and in the direction of increasing dysregulation. For genes that directly modulate the focal trait, we expect no particular preference for monotone or non-monotone GDRCs. Genes that act on the focal trait both directly and via dysregulation explain the monotone curves that experience trait buffering.
Overall, this model explains the non-monotone average effects that we observe in the data. Depending on the balance between direct regulation or indirect regulation via dysregulation, we can still have predominantly monotone GDRCs but have a consistent bias toward one direction of the focal trait.
Our model also explains why the non-zero average effect is apparent in LoF variants and duplications across various traits, but is not discernible in GWAS effect sizes. Consider two variants that have the exact same effect on the focal trait. One variant affects the trait directly, while the other affects the trait via dysregulation. Under our model, the second variant that acts via dysregulation can only affect the focal trait in one direction because it has a non-monotone GDRC, while the first variant has no such restriction. Both variants will experience the exact same fitness consequences from stabilizing selection on the trait. However, the variant acting via dysregulation will experience additional fitness consequences. Thus, holding all other factors equal, the variant acting via dysregulation will have a lower allele frequency and be more challenging to detect in a GWAS, thus making it less likely to contribute to a non-zero average effect.
We believe that this unobserved latent phenotype is a measure of dysregulation because the down-stream traits with large non-monotone average burden effects are readouts of general dysfunction. The traits with significant non-monotone average effects (Table F.1) include measures of organ dysfunction such as alkaline phosphatase, cystatin C, and C-reactive protein, which all have positive average LoF and duplication effects. Put another way, both deletion or duplication of a random gene tend to increase disease risk and measures of organ dysfunction. Similarly, measures of lung capacity and muscle strength have negative average LoF and duplication effects. Overall, the average effect is often in the direction of dysregulation across the continuous traits we studied.
Recent analyses of predicted biological age in the UKB also support the presence of a dysregulation phenotype [48, 49]. Age is associated with general dysregulation and disease [50]. Strong predictors of biological age in predictive models include FEV1, cystatin C, HbA1c, alkaline phosphatase, and hand grip strength [48, 49], suggesting that these continuous traits with large non-monotone average effects may act as readouts of dysregulation.
Discussion
In this study, we interpreted dosage-perturbing rare variants through the lens of the GDRC. Our analysis suggests that both monotone and non-monotone GDRCs contribute to the concordant average burden effects of LoF and duplication variants. This results in GDRCs that are buffered against one trait direction. We hypothesized that this may be explained by the effect of genes on an upstream dysregulation phenotype. The shape of GDRCs may thus be molded by the effects of directional selection.
Implications of the observed characteristics of gene dosage response curves
In a “linear world” where dosage has a linear relationship with trait, ascertaining any point on the GDRC along the dosage spectrum would carry all the information about the gene’s effect on trait. However, biology is inherently non-linear, and our analysis shows that these assumptions break down for variants with large effects. This has significant implications for drug design. If the GDRC of the target gene is non-monotone, perturbation in either direction may only increase disease risk. This occasional lack of concordance between LoF and duplication burden effects motivates the use of assays that test both ends of the dosage spectrum [51]. The shape of the GDRC implied by the burden estimates can also be useful prior for fine-mapping causal variants [52] and linking variants to genes [53–57].
The assumption of linearity in TWAS [5, 6, 58] and Mendelian randomization [59] approaches may cause estimated effect sizes to be close to zero for non-monotone genes. The ability of TWAS to recover trait-associated genes with non-linear GDRCs needs to be explored further. Yet, TWAS-type methods would be invaluable for understanding the behavior of the GDRC in the physiological range of expression, thus motivating the continued development of such methods.
Towards the inference of gene-level curves across tissues and contexts
Moving from genetic associations towards mechanistic insights for complex traits remains challenging. The GDRC captures a large fraction of the biological insights that we want to extract from association studies. Although inference of the GDRC may be challenging, it is worth considering what aspects of the GDRC we can estimate using population-level and experimental approaches.
In principle, each tissue or cell type has its own specific GDRC for a gene-trait pair. In our approach, we focused on the global GDRC – which we could interpret as a weighted sum of tissue-specific GDRCs – by using variants that affect all tissues and contexts to understand the aggregate effect of dosage perturbation on trait. Modern approaches that estimate the effect of genes in specific tissues or cell types, such as TWASs, can be thought of as estimating or approximating aspects of tissue-specific GDRCs.
Even with methodological advances, it is unlikely that we will be able to infer all points of the GDRC of a gene with population-level data alone. The effects of selection make it challenging to detect variants in genes with large effects on traits [60, 61]. This suggests that the ability to effectively infer the GDRC will vary across the genome, and likely be most challenging for genes that are particularly important to the biology of traits.
These limitations of population-level data suggest that sequence-to-expression models [62–65] and experimental approaches [66–68] will play a critical role in fleshing out the GDRC for key genes. Missense variants, which in aggregate have a measurable effect on complex traits [30], are challenging to integrate into the GDRC framework. Accurate computational effect prediction [69] or experimental approaches such as deep mutational scanning [70] are critical for translating the effect of missense variants into an “effective dosage”, which would allow them to be placed on the GDRC. Similarly, for common variants, we are restricted to using dosage effect estimates from eQTL studies in tissues that have been assayed, which are often post-mortem and from adult donors [71]. However, we cannot feasibly assay all tissues and cell types in all contexts. Thus, predicting common variant effects from sequence and multi-omic context or using experimental approaches such as massively parallel reporter assays [67] will be necessary to estimate full GDRCs.
In summary, we have explored the nature of the relationship between gene dosage and trait for various complex traits. Our final model explains the initial observation of non-monotone average effects and provides intuition for how these relationships may be shaped by natural selection. The insights about GDRCs underlying complex traits will be informative for various downstream applications, including therapeutics and methods development.
Methods
The goal of our data analysis approach was to estimate burden summary statistics for LoF variants, whole-gene deletions, whole-gene duplications, and partial deletions using a consistent set of decisions and tools. This consistency between the burden tests allowed us to compare effect sizes across the variant classes. We performed burden test using synonymous variants as a negative control to estimate any residual confounding, which we detail in Appendix F. We improved upon previous LoF burden approaches by using variable frequency filters based on gene constraint (Appendix F). We also improved upon prior duplication burden tests in the UKB [23–25, 27] by using copy number calls from WGS data rather than microarray data. CNV calls from WGS have the advantage of more accurate copy number estimation, higher resolution of CNV breakpoints, and better detection of rare CNVs [72].
To make our approach amenable to researchers without access to individual level data, we modeled the summary statistics rather than the underlying genotype data. We use the regression with summary statistics (RSS) likelihood [73], which provides a likelihood model for association test summary statistics. To our knowledge, this is the first use of the RSS likelihood for burden regression. To demonstrate its applicability and correctness, we performed extensive simulations, which are catalogued in Appendix D.
Individuals
We followed a protocol similar to a prior analysis of rare LoF variants in the UKB [30]. We use all 460K individuals with WES data in the UKB data. This is slightly larger than the fraction used by Backman et al., who focused on 430K individuals with genetic similarity to the EUR “superpopulation” as defined by the 1000 Genomes Project. Although relatedness and population structure can introduce environmental confounding, our pilot analyses with synonymous variants (Appendix F) indicated minimal levels of uncorrected confounding.
Phenotypes
We curated 410 continuous phenotypes in the UKB on which to perform burden tests, of which 91 were used for the main analysis in the paper. We required phenotypes to have measurements in at least 40K individuals. We also excluded any traits where more than 50% of the instances match the mode of the trait. This is to remove quasi-categorical traits, and is more relaxed compared to Backman et al., who exclude traits with a cutoff of 20%. For phenotypes that were collected more than once in returning visits, we use the first instance of the observation for each individual. Backman et al. took the mean across such instances, but we believe that this would reduce the noise for some individuals more than others in a not completely at random fashion. It would also make it challenging to interpret the relationship between mean phenotype and covariates that are observed at specific instances, such as age. Some of our traits had arrayed measurements, but they were all taken in the same visit so we decided to use the mean value across the arrayed items.
Separately, we acquired access to the full 500K baseline metabolomics data from the nuclear magnetic resonance (NMR) metabolomics data set generated by Nightingale Health PLC [74].
Genotypes
We included 15 genotyping principal components (PCs), estimated previously in the UKB [75]. Backman et al. performed ancestry-specific analyses and included 10 genotyping PCs for each ancestry. Since we used all 460K WES samples, we included an additional 5 PCs to address potential stratification. Our analysis of synonymous variants (Appendix F) supported this choice.
Following the lead of Backman et al., we included 20 rare variant PCs, by which we mean PCs derived from rare WES variants. We subsampled 300K variants uniformly at random from the rare fraction, which we defined as variants with minor allele count (MAC) greater than 20 and minor allele frequency (MAF) less than 1%. We used an approximate principal components analysis (PCA) algorithm implemented in FlashPCA2 [76].
We restricted our burden analysis to LoF sites with MAF less than 1%. High-confidence LoF sites were annotated using Ensembl’s Variant Effect Predictor [77] with the LOFTEE plugin [78]. LOFTEE filters remove annotated LoF sites that might escape nonsense-mediated decay (NMD). Instead of using progressively stricter MAF cutoffs as in Backman et al., we use all LoF sites with MAF less than 1% but require that the misannotation probability be less than 10%. The misannotation probability [61] takes into account both the frequency of the site and the constraint experienced by the gene to determine a posterior probability of being misannotated as a LoF. These probabilities were previously reported for LoF-introducing single nucleotide polymorphisms (SNPs) for genes with estimated shet values [61]. A small number of genes do not have shet estimates, and not all variants in the WES data were SNPs. For genes with missing shet values, shet was imputed using the mean shet value. Then, for all LoF variants missing a misannotation probability, misannotation probabilities were imputed using k-nearest neighbors (kNN) regression with 10 nearest neighbors as determined by allele frequency and shet. A 10% cutoff on this probability retains 96.46% of variants. We demonstrated that relative to standard frequency cutoffs, using a misan-notation probability cutoff increases the signal in the burden tests without increasing the size of standard errors (Appendix F).
CNVs were previously called in the UKB using Illumina’s DRAGEN software [31, 79]. A deletion was defined as any loss that affected the entire gene body with an additional 1 Kbp flanking region. A duplication was defined as any gain that affected the entire gene body with an additional 1 Kbp flanking region. Any deletions that affected only part of the gene body were categorized as potential loss-of-function (pLoF) alleles. For deletions and pLoF variants, the alternative allele coded the number of copy loss events. For duplications, a copy number of two was coded as the reference homozygote and a copy number greater than two was coded as a heterozygote. We observed that some samples had a large amount of genome-wide copy gain or loss, which is likely a genotyping error (Appendix F). To account for this, we removed any sample with greater than 10 Mbp of deleted or duplicated genome sequence.
We used GENCODE version 39 to define gene intervals [80]. We excluded genes on the Y chromosome and the mitochondrial genome.
Burden Tests
We used REGENIE to perform gene-level burden tests [32]. For the whole-genome regression, which is the first step in REGENIE, SNPs from the genotyping array were pruned with a 1000 variant sliding window with 100 variant shifts and an R2 threshold of 0.9. SNPs were then filtered to have MAF > 1%, genotype missingness < 10%, and Hardy-Weinberg equilibrium (HWE) test p-value < 10−15.
For LoF burden tests, we used the following covariates: age, sex, age-by-sex, age squared, 15 geno-typing PCs, 20 rare variant PCs, and WES batch. For the duplication, deletion, or pLoF burden tests, we used the following covariates: age, sex, age-by-sex, age squared, 15 genotyping PCs, 20 rare variant PCs, and WGS batch. When analyzing NMR metabolites, we additionally included the spectrometer and processing batch as covariates.
After whole-genome regression, we used the second step of REGENIE to perform burden tests. Phenotypes were rank-inverse-normal transformed in both the first and second step. The same covariates were used in both steps.
Monotonicity Model
We modeled the standardized trait of interest Y in N individuals as where X and Z are the burden genotypes from LoF variants and duplications respectively. The error ϵ was assumed to be drawn from a normal distribution. Polygenic signal, population stratification, and other sources of confounding were accounted for when performing the burden tests by including covariates. We built a hierarchical model by specifying a distribution on the latent effect sizes of the jth gene:
This allowed us to model the quantities of interest. The average effect of deletions and duplications is represented by and respectively. We defined the monotonicity, ϕ ∈ [−1, 1], to be a quantity that is proportional to the negative uncentered second moment of the latent effect sizes,
Thus, if genes with large effects tend to have opposite signs on average, ϕ will be closer to 1. Similarly, if genes with large effects tend to have the same sign on average, ϕ will be closer to −1.
The observed burden effect sizes for the jth gene, and ,are noisy estimators of the underlying latent effect size. Furthermore, due to correlation between duplication burden genotypes, is not an unbiased estimator of γ2j because the observed effect size is inflated by the effect of nearby genes. To account for these concerns, we utilized the RSS likelihood [73]. An extensive description of the model is provided in Appendix B.
We developed unbiased estimators of and .Since the monotonicity is a complex function of the parameters of the prior, it was challenging to derive unbiased estimators. Instead, we opted to use maximum likelihood estimation for ϕ, which guarantees consistency. Inference and simulations under this model are described in Appendices C and D respectively.
Buffering Model
Since the monotonicity, ϕ, is a function of first and second moments of the latent effect sizes, we used a normal distribution as the prior. However, our buffering model proposed a more complex hypothesis about the distribution of the latent effect sizes, which required a more flexible prior over the latent effect sizes.
To address this, we modified our monotonicity model to use the MASH model prior [42], which is a flexible multivariate prior. We continued to use the RSS likelihood to model the observed effect sizes. MASH uses a mixture of K multivariate normal distributions to model the latent space:
Since the true latent distribution is unknown, MASH uses a grid of fixed covariance matrices and learns the mixture weights π using empirical Bayes. In its original implementation, MASH assumes independence between observed samples, allowing for efficient optimization. Since the RSS likelihood necessarily induces dependencies between the observations via linkage disequilibrium (LD), we implemented a stochastic approximation expectation maximization (SAEM) algorithm to optimize π [81, 82]. An alternative approach to this model has been proposed that uses variational inference [83–85].
As we defined it, the amount of directional buffering can be estimated by taking the dot product of the latent effect size with the normal vector of the hyperplane separating the two types of buffering, which we explain further in Appendix E. We used the posterior distribution over γ to estimate which corresponds to the posterior expectation of the dot product given the observed data. To determine how much is contributed to ξ by monotone or non-monotone genes, we conditioned on the type of the GDRC:
Using the law of total expectation, we can decompose ξ into the contribution from non-monotone and monotone GDRCs as
ξ is zero under the null model where GDRCs are equally likely to be buffered against increasing or decreasing the trait. The details of the model can be found in Appendix E.
Data Availability
All genetic and health data was acquired from the UK Biobank, a biomedical database containing information from half a million UK participants. Data for genetic association analysis of continuous traits was acquired under application 52374. Data for genetic association analysis of metabolite traits was acquired under application 30418. These data are available upon application to the UK Biobank. All summary statistics generated from genetic association analysis and other processed data files are deposited in Zenodo.
Data Analysis
Throughout, we use unbiased maximum likelihood estimates (MLEs) for the duplication burden estimates when they are displayed in the paper, which account for LD with neighboring genes.
In Figure 1B, we used total least squares to fit a regression line with no intercept. Weights in the regression corresponded to the inverse variance of the estimates. To provide a better fit, we used multiple initial parameter values. We retrieved GWAS summary statistics from the UKB to generate trumpet plots (Figures 1C and F.8). The collection and analysis of these data is described in Appendix F. We fit a locally estimated scatterplot smoothing (LOESS) curve to visualize the conditional mean of the derived allele effect given the derived allele frequency.
In Figure 2, top gene-trait pairs were ascertained with |z| > 3 for both LoF and duplication burden estimates. Gene-trait pairs were then ordered by since we are more confident in the LoF estimate, and the top trait was chosen per gene. Thus, each gene was only represented once in the plot. Then, we ordered gene-trait pairs by |zLoFzDup| and chose the top hit per trait. By doing so, each trait also appears only once in the plot.
We inferred a distribution of effects for ϕ across traits for Figure 3B using a hierarchical model that was fit using empirical Bayes. The model is described in Appendix F.
In Figure 4A, top hits were ascertained as gene-trait pairs with |zLoF| > 2 and |zDup| > 2 for traits with significant average burden effects (Table F.1). Then, to plot polarized Z scores, we multiplied the Z score by the sign of the average burden effect. This means that genes with positive burden effect estimates are in the same direction as the average burden effect. Since both the LoF and duplication burden effects are noisy estimates, we use the first PC of the Z scores to represent the trend of the data. The data was not centered or scaled before running PCA.
We estimated confident separation from the diagonal line in Figure 5B using the local false sign rate [86]. This rate is calculated using the number of posterior samples that fall on either side of the diagonal for a given gene. A rate of less than 0.05 implies that less than 5% of the posterior density is on one side of the diagonal. Figure 5C represents the sample plot as Figure 1B. Points are colored based on the sign of ξ and the local false sign rate. Traits with significant average burden effects (Table F.1) were used in Figure 5D. The proportion of contribution from non-monotone GDRCs was estimated as
Data Availability
All genetic and health data was acquired from the UK Biobank, a biomedical database containing information from half a million UK participants (https://www.ukbiobank.ac.uk/). Data for genetic association analysis of continuous traits was acquired under application 52374. Data for genetic association analysis of metabolite traits was acquired under application 30418. These data are available upon application to the UK Biobank. All summary statistics generated from genetic association analysis and other processed data files are deposited in Zenodo at https://doi.org/10.5281/zenodo.13852455.
Code Availability
Scripts used to process and analyze data are deposited in Zenodo at https://doi.org/10.5281/zenodo.13864152. Scripts were executed either on the UK Biobank Research Analysis Platform (https://ukbiobank.dnanexus.com/) or on the Sherlock High-Performance Computing Cluster managed by the Stanford Research Computing Center (https://www.sherlock.stanford.edu/ ).
Author Contributions
N.M. Methodology, Software, Formal Analysis, Data Curation, Writing – Original Draft, Visualization. C.J.S. Methodology, Software, Writing – Original Draft, Visualization. H.Z. Methodology, Software, Writing – Original Draft, Visualization. J.P.S. Conceptualization, Methodology, Writing – Original Draft, Supervision. J.K.P. Conceptualization, Methodology, Writing – Original Draft, Supervision, Project Administration, Funding Acquisition.
Acknowledgments
We would like to thank the members of the Pritchard lab for their feedback on this project and manuscript. In addition, we thank Yi Ding and Zeyun Lu (Gusev Lab) for reading over and providing feedback on the initial draft. This research has been conducted using data from the UK Biobank (https://www.ukbiobank.ac.uk/), a major biomedical database. We would like to thank all participants and researchers involved in the UK Biobank for their contribution, and Nightingale Health PLC for early access to the NMR metabolomics data. Computing for this project was performed on the Sherlock High-Performance Computing Cluster. We would like to thank Stanford University and the Stanford Research Computing Center for providing computational resources and support that contributed to this research. N.M. was supported by a National Science Foundation Graduate Research Fellowship and Stanford’s Knight-Hennessy Scholars Program. C.J.S. was supported by Stanford’s Knight-Hennessy Scholars Program and the Stanford Center for Computational, Evolutionary and Human Genomics. H.Z. was supported by Stanford Biology Department’s graduate student assistantship. This work was supported by the National Institutes of Health (R01HG011432 and R01HG008140 to J.K.P.).
Appendices
A Gene Dosage Response Curves
Genome-WIDE BURDEN
We retrieved genome-wide burden effect estimates from work by Auwerx et al., where burden was defined as the number of genes affected by deletions or duplications [1]. We subset to traits where both the genome-wide deletion and duplication burden effects had a p-value of less than 0.05. In all cases, the deletion and duplication burden effect is in the same direction.
Scale of the dosage and trait
The scale and shape of the curve depend on how gene expression and the trait are measured. We assume that expression is measured on a logarithmic scale relative to the population mean. This allows us to define the expression effect of loss-of-function (LoF) and duplication variants on a logarithmic scale such that expression-decreasing variants have negative expression values and expression-increasing variants have positive expression values. A heterozygous LoF genotype thus has an expression effect of , and a duplication resulting in a copy number of three has an expression effect of .For traits, we chose to use an inverse rank normal transform. By centering the trait, we implicitly define trait-decreasing variants as having negative effects and trait-increasing variants as having positive effects.
The consequence of centering both expression and trait is that the gene dosage response curve (GDRC) always passes through the origin.
Composition of monotone curves
Let γ : E ↦ Y be the GDRC for a gene. Suppose that the dosage effect percolates to the trait via a path, as diagrammed in Figure 4C. Consider a path consisting of two edges with associated functions f : ℝ → ℝ and g : ℝ → ℝ. The GDRC along this path will then be Y = γ (E) = f (g (E)) = (f ° g) (E). Here, f and g can are the curves of individual arrows along the path.
Now, suppose that g is a monotone increasing function. That is, x ≤ y implies that g (x) ≤ g (y) for any x, y ∈ ℝ. Similarly, suppose that f is a monotone increasing function. Then, x ≤ y implies that g (x) ≤ g (y), which implies that f (g (x)) ≤ f (g (y)). Thus, the composition (f ° g) is also monotone increasing. Similar arguments for monotone decreasing functions or pairs of increasing and decreasing functions demonstrate that the composition of monotone functions is also monotone.
B Monotonicity Model
Here, we endeavor to jointly model burden summary statistics from different points on the dosage spectrum. Burden summary statistics are often reported as an effect size and a standard error from regression. We use hierarchical models to account for the sampling error and pool information across genes.
Let Y ∈ ℝN be a standardized phenotype of interest, measured in N individuals. Let X, Z ∈ ℝN×M represent the burden genotype matrices for two variant classes respectively. By variant class, we mean a set of variants with a dosage effect in the same direction, such as deletions, duplications, or LoF variants. We assume that there are M genes that are polymorphic for both classes of variants. The genotypes are encoded as copies of the dosage-perturbing allele.
We use the following linear system to model the phenotype:
Here, γ1, γ2 ∈ ℝM are the unobserved, per-allele effect sizes of perturbations of each gene on the phenotype. The residual error is represented by ϵ ∈ ℝN and is assumed to be drawn from an isotropic normal distribution,
Distribution OF Effect Sizes
We model the latent effect sizes as coming from an uncentered normal distribution. Effect sizes between genes are assumed to be independent, but effect sizes for the same gene covary between the variant classes. That is to say, the effect of a deletion or LoF variant is assumed to carry some information about the effect of a duplication of the same gene. This covariance structure is represented with diagonal matrices Σ11, Σ12, and Σ22.
Suppose we collect the block matrices above into and Σ ∈ ℝ2M×2M. Then, we can succinctly state that
Let γ·j ∈ ℝ2 represent the latent effect sizes for the jth gene. For instance, the first coordinate may represent the LoF effect, and the second coordinate may represent the duplication effect. Then, the marginal distribution is written as
Parameter OF Interest
We consider a gene to have a “monotone” effect on trait if the dosage-reducing alleles have an opposite direction of effect compared to dosage-increasing alleles. To this end, we define the monotonicity ϕ as a value proportional to the negative uncentered second moment of the effect sizes,
Thus, a positive ϕ represents monotone behavior. To compare monotonicity across traits, we restrict its codomain to [−1, 1]. We do this using the Cauchy-Schwarz inequality, which guarantees that so that
When the effect sizes are centered (that is,), this is equivalent to the negative correlation coefficient,
Distribution OF Genotypes
We begin by assuming that there is no linkage disequilibrium (LD) between the different variant classes. This is reasonable to assume because different variants arise through different mutational processes, and burden genotypes represent aggregates of multiple rare variants which generally have negligible LD [2]. However, this assumption does not apply across genes within a variant class. Within duplications, for instance, large duplications consisting of multiple genes will induce LD between the burden genotypes.
We assume that the M burden genotypes are in Hardy-Weinberg equilibrium (HWE), which implies a binomial likelihood. Let pj represent the allele frequency of the jth gene’s dosage-perturbing allele for the first variant class and let qj represent the allele frequency of the jth gene’s dosage-perturbing allele for the second variant class. Without loss of generality, we assume that the genotypes are centered but not scaled:
There is no LD between variant classes,
However, there is LD between the same variant class. The correlation matrices R1 and R2 are used to represent this LD. Let and represent the heterozygosity of the burden genotypes for the jth gene. Then, where
Since LoF variants affect individual genes, and we restrict ourselves to analyzing rare variants, we assume that the LoF burden genotypes are independent and that there is no LD between them. That is, R1 = I. Since deletions and duplications can span multiple genes, the burden genotypes are not independent.
Burden Regression
Burden test results are reported as summary statistics from regression between the burden genotype and the phenotype. Regression is performed on centered and scaled phenotypes. The marginal summary statistics are calculated based on the following model for each gene:
The effect size is approximately the following. Note that and ,so
The standard error is approximately the following. Here, we assume that any individual gene explains a small fraction of the total variance. That is, and . Thus,
Therefore, the Z scores are
C Monotonicity Inference
Maximum Likelihood Estimation
For inference, we will use the summary statistics directly rather than the underlying genotype data. That is, we are given the estimated effect sizes (and) and their standard errors (Ŝ1 and Ŝ2). To simplify notation, we represent the prior as where θ and ω represent values used to parameterize the priors.
Next, the likelihood of the estimated effect sizes is defined using the approximate regression with summary statistics (RSS) likelihood [3]. Under this likelihood, the estimated effect sizes are patterned by the LD in the cohort. Furthermore, the standard error of the sampling distribution is used as an estimate of the dispersion around the mean. Zhu et al. showed that this likelihood asymptotically approaches the sampling distribution of the estimated effect sizes [3]. Let and , and let and represent the in-sample correlation matrices of the two variant classes respectively. Then, in notation reflecting the RSS likelihood, we define
Then, the RSS likelihood for the observed effect sizes is
Some genes are in perfect LD with each other. That is, the sample correlation matrix is not strictly positive definite. To improve numerical stability and to account for perfect LD, we project the data onto the linear subspace of dimension L < M spanned by the LD matrix (that is, a projection orthogonal to the null space). Consider the following eigendecomposition, with a matrix with orthogonal columns of eigenvectors U ∈ ℝM×L and a diagonal matrix of eigenvalues Λ ∈ ℝL×L,
In practice, these are derived by dropping small eigenvalues from the numerical eigendecomposition of . Rather than modeling the observations , we model the projected data,
Note that under this model, the marginal likelihood of the estimates is
To simplify notation, let and let K = AΣA⊤ + Λ. Then,
The pseudoinverse A+ is a useful quantity in downstream derivations. Since the rotation of the data is defined, we can construct a pseudoinverse using components of the eigendecomposition. The pseudoinverse is
This pseudoinverse is a right pseudoinverse, as can be seen by
The other properties of the pseudoinverse are readily confirmed using matrix algebra.
Inference
We maximize the marginal likelihood of the observed effect sizes with respect to the parameters of the model. The likelihood is
The log likelihood is
We are interested in obtaining the maximum likelihood estimates and such that
For maximum likelihood estimation of θ and γ, we use natural gradient ascent [4]. This is an optimization approach that uses both first- and second-order information about the log likelihood via the gradient and Fisher information matrix. The gradient of the log likelihood function with respect to is
Using the matrix chain rule, the derivative of the log likelihood with respect to one of the mean parameters is
Therefore, it follows that the gradient of the log likelihood with respect to the mean parameters is
The matrix derivative of the log likelihood with respect to K is
The derivative of K with respect to a given covariance parameter is
Using the matrix chain rule, the derivative with respect to a given covariance parameter is
The Fisher information for multivariate normal distributions has a special form [5] such that the mean and covariance parameters do not share any information,
The Fisher information of the mean parameter is
The Fisher information of the covariance parameters is
Natural gradient ascent involves Newton-Raphson updates with the Fisher information matrix. A dampening parameter 0 < αt ≤ 1 is chosen using a backtracking line search [6] at each iteration to improve stability and serve as a stopping condition,
In practice, we estimate the gradients and Fisher information matrix for each chromosome separately and sum them up because effect estimates are assumed to be independent across chromosomes. The derivatives and are computed using automatic differentiation[7,8]
Uncertainty Estimation
Let and be the maximum likelihood estimates (MLEs) of θ and ω respectively. Let ψ (θ, ω) be any continuous map from the parameters to a real number. ψ can itself be interpreted as a parameter of the model. In our model, ϕ is an example of such a parameter. A natural estimate for ψ is to map the MLEs,
By the properties of the MLE, the estimator converges to the following in distribution.
By the delta method, the estimator for ψ converges to which can be used to derive approximate confidence intervals. Confidence intervals for ϕ were estimated using this method.
Fixed-Effects Model
In addition to consistent estimators provided by maximum likelihood estimation, it is useful to have unbiased estimators of parameters as well. In the fixed-effects model, we assume that γ is fixed, implying that the observations have the following likelihood:
For unbiased estimation, the projection matrix A+A is used frequently throughout our derivations. To see that it is a projection matrix, note that
Then, it follows that
This makes A+A an idempotent matrix and implies that it is a projection matrix.
Mean Effect
Suppose that the mean effect size is γ1 for the first variant class and γ2 for the second variant class. Then, are unbiased estimators for these mean effect sizes under certain conditions discussed below. Since the estimator is the sum of normal random variables, we can easily construct confidence intervals from the variance. A straightforward computation shows that the variances of the estimators are
These estimators are guaranteed to be unbiased if Ŝ−2 [1 0]⊤ and Ŝ−2 [0 1]⊤ are orthogonal to the null space of U⊤.
To see that and are unbiased under these conditions, we will focus on the former without loss of generality. In expectation,
The last equality holds because A+A is block diagonal. Note here that A+A is a projection matrix and has the form A+A = Ŝ2UU⊤Ŝ−2. Therefore, for our estimator to be unbiased, we require that which is true if Ŝ−2 [1 0]⊤ is orthogonal to the null space of U⊤. A similar derivation demonstrates that for to be unbiased, Ŝ−2 [0 1]⊤ must be orthogonal to the null space of U⊤. We tested this numerically with the eigendecomposition involving our specific LD matrix.
Mean Squared Effect
The mean squared effect is a useful measure of the amount of burden signal present for each trait. In this paper, we estimate the squared effect for the LoF burden effects. The LoF burden genotypes are assumed to not be in LD, meaning that the likelihood for the jth gene is
Since the second moment of a univariate normal random variable is we note that is an unbiased estimate of the squared burden effect size.
We invoke the Central Limit Theorem, noting that each element in the sum is an independent and unbiased estimate of the quantity of interest, to derive approximate confidence intervals. We use an empirical estimate of the standard error to derive the approximate confidence intervals.
Random-Effects Model
A random-effects model is required to develop estimators for the covariance components. We use an approach inspired by linkage disequilibrium score regression (LDSC) [9], and similar to that outlined in [3]. We assume that γ is random, with finite first and second moments. We also assume that the latent effect sizes are independent given Ŝ and . We make the following minimal assumptions about first and second moments of the joint distribution of the latent effect sizes:
The Z scores from the burden regression can be defined as
By the properties of the normal distribution, the Z scores are distributed as and Z scores for the two variant classes are independent,
Without loss of generality for the second variant class, we focus on the jth gene of the first variant class. Let be the jith element of . The marginal distribution of the Z score for a given gene is then
We define the scaled LD score for the jth gene as
It is also useful to define
The marginal distribution of the Z scores between the two variant classes is
We define the cross-variant LD score as
It is also useful to define
Covariance Components
Similar to LDSC, we now proceed by evaluating the expected squared Z score to develop estimators for and . We integrate over the latent effect size, γ, using the law of total expectation and the law of total variance. In the following, the conditioning on ,and ℓ1j is implicit to simplify notation. The expected squared Z score is
The first term introduces the LD score into the expectation,
The second term introduces a constant,
The last term is a bias introduced by the uncentered prior,
Taken together, the expected squared Z score is
Given the relationship of the expected squared Z score with we propose the following method-of moments (MoM) estimator. A similar derivation follows for . We use , which is unbiased for the random effects model as well. Our estimators are
We can use the expected product of Z scores to build an estimator for the covariance σ12. We integrate over the latent effect size γ using the law of total expectation and the law of total covariance. In the following, the conditioning on ,and ℓj is implicit to simplify notation. The expected product of Z scores is
The first term introduces the cross-variant LD score into the expectation,
The second term does not contribute to the conditional expectation because the covariance between observed effect sizes is zero given the latent effect size:
The last term is bias introduced by the uncentered prior,
Taken together, the expected product of Z scores is
This suggests the use of the following unbiased estimator for σ12,
Monotonicity
These estimators of individual parameters in the model allow us to define the MoM estimator for monotonicity as
MoM estimators are not unbiased, but they are consistent. Since the estimator represents a function of various estimators, we used the bootstrap with 1000 iterations to estimate the standard error of the estimate. If the estimate was undefined due to a negative square root, we set to zero.
D Simulations
We used the UK Biobank (UKB) LoF and duplication burden genotypes for simulation. These simulations are not from the models we used to derive the various estimators. Thus, these tests are fair evaluations of our estimators and also demonstrate the validity of using the RSS likelihood to model burden summary statistics. Some of our estimators assume fixed effect sizes, while others assume random effect sizes. We performed two separate suites of simulations for these estimators.
The fixed-effect simulations assumed that each gene had some fixed effect via LoF variants and via duplication variants. We also performed fixed-effect simulations where each gene had a unique effect, with a genome-wide average effect of and respectively. The random-effect simulations assumed that the effect size for each gene was generated from
We simulated effect sizes for genes that had both LoF and duplication burden genotypes, corresponding to passing our genotyping filters described in our methods. Once effect sizes were assigned to each gene, we used PLINK [10] to score each sample based on the effect sizes. This provided us with genetic values for each individual under the simulation. Next, we added noise using draws from a normal distribution such that the expected variance of the phenotype would be one.
We used PLINK to run a single-variant association scan with the burden genotypes. We included 15 genotyping principal components (PCs), 20 rare variant PCs, and genotyping batch as covariates. Since we simulated from the actual genotypes, we had to include these covariates to control for confounding from population structure. Each parameter value was tested with 100 simulations. When perturbing one parameter, the other parameters were held constant at .and .The default variance values are based on the order of magnitude of burden heritability estimates from prior work [11].
Unbiased Average Burden Effect Estimators
Fixed-effect simulations were used to test and .These estimators were convincingly unbiased for the various realistic values that we simulated (Figure D.1). We also developed estimators for the variance of these fixed-effect average burden effect estimators. Confidence intervals built using these variance estimates are well-calibrated (Figures D.2 and D.3). We also tested the estimators under a regime where each gene had a unique effect size, with relatively unbiased behavior and good calibration (Figures D.4, D.5, and D.6).
Unbiased Average Squared Burden Effect Estimator
We used the fixed-effect simulations with different gene-level effects to test .The estimator was relatively unbiased for various realistic values that we simulated (Figure D.7). The approximate confidence intervals using the Central Limit Theorem are well-calibrated (Figure D.8).
Method-of-Moments Monotonicity Estimator
Our covariance component estimators were close to unbiased (Figure D.9). We noted a slight upward bias in the LoF burden effect variance estimator .However, the magnitude of this bias was only around 5% of the true parameter value. We believe that this represents residual confounding that is not corrected for using the genotyping and rare variant PCs used in our simulations.
Our MoM estimator for monotonicity was surprisingly close to unbiased, even for extreme deviations from zero (Figure D.10). The confidence intervals estimated using the bootstrap were also relatively well-calibrated (Figure D.11).
Maximum Likelihood Average Burden Effect Estimators
For the maximum likelihood estimation approach, we used random-effect simulations to test and .These estimators were also convincingly unbiased, similar to our unbiased estimators (Figure D.12). The approximate confidence intervals are well-calibrated (Figures D.13 and D.14).
maximum Likelihood Monotonicity Estimator
Our covariance component estimators from the maximum likelihood approach had a similar performance to the unbiased estimators (Figure D.15). We noted a slight upward bias in the LoF burden effect variance estimator, similar to the unbiased estimator. The bias was again only around 5% of the true parameter value.
The maximum likelihood estimate for monotonicity was slightly biased towards zero for very large values of ϕ (Figure D.16). The magnitude of bias was small compared to the true value, and the approximate confidence intervals using the delta method remained well-calibrated (Figure D.17). The variance of the estimator was approximately half that of the MoM estimator (Figure D.10).
E Buffering Model
In the buffering model, we are interested in evaluating if GDRCs are systematically buffered against one trait direction. This could occur if non-monotone GDRCs are preferentially present in one direction over the other. This can also occur if monotone GDRCs achieve larger values for the trait in one direction compared to the other.
We use the multivariate adaptive shrinkage (MASH) model [12] as a flexible prior for our latent effect sizes. The MASH prior consists of a mixture of multivariate normal distributions. Suppose represents a fixed set of covariance matrices in ℝ2×2 for K mixture components. We use bivariate covariance matrices over a grid of variances and correlation values. Let π ∈ ΔK−1 represent the mixture weights. Then, the gene-level prior is
The prior over the entire set of M genes is
The likelihood model remains the same as before.
The joint likelihood is
Inference USING Empirical Bayes
In our model, the mixture weights, π, are unknown. Following MASH [12], we use an empirical Bayes approach that combines both frequentist and Bayesian techniques. The first step, the frequentist step of empirical Bayes, involves maximizing the marginal likelihood to estimate ,which we do using stochastic approximation expectation maximization (SAEM) [13, 14]. The second step, the Bayesian step of empirical Bayes, uses as a plug-in estimate of π for downstream posterior sampling.
The ideal approach to obtain the MLE of π is to directly maximize the marginal likelihood
However, this integral is intractable. Instead, we can use expectation maximization (EM) to maximize this marginal likelihood [15]. Suppose π(t) represents the estimate at the tth iteration of the EM algorithm. The expectation step involves evaluating
Since the analytical form of Q (π |π)(t) is also intractable under our model, we use SAEM to approximate the expectation. We sample using Hamiltonian Monte Carlo [16, 17]. Then, we approximate the expectation with
To speed up computation, we use stochastic gradient ascent, with each chromosome representing one mini-batch [18]. In the mini-batch approach, each expectation is a convex sum of the previous expectation and the current expectation. Let (ct)t≥1 be a positive, decreasing sequence such that and . We use in the expectation step of EM. To define a valid recursion, we set .Maximization, the second step in EM, is performed using a fixed number of iterations of stochastic gradient ascent with a fixed step size to obtain
Gradients were estimated using automatic differentiation [7, 8]. After some number of iterations, T, until convergence, we set .
We use Hamiltonian Monte Carlo [16, 17] to sample from the posterior distribution of γ. Following the empirical Bayes approach, is used as a plug-in estimate. That is, we draw samples . The estimand of interest is which is estimated using the posterior samples. The estimand represents the dot product of γ·j with the normal vector of the diagonal separating the two types of buffering within our model (Figure E.1). To estimate the buffering from monotone curves (ξm) versus non-monotone curves (ξnm), we use posterior samples that are only monotone or non-monotone respectively. That is,
F Data Analysis
Synonymous Variants
Stratification or other confounding can inflate the estimated effect sizes from association analyses. We can ensure that we are adequately controlling for confounding by estimating the effect of synonymous variants, which are expected to not affect protein function and downstream traits.
We used synonymous variants to test the effect of using different subsets of individuals. The original analysis of LoF variants in the UKB [19] used a subset of around 430K individuals with genetic similarity to the EUR superpopulation from the 1000 Genomes Project. In addition to a population with genetic similarity to EUR (called EUR for brevity), we tested the subset of 390K unrelated individuals in the EUR subset (called unrelated) and the subset of all 460K individuals with whole-exome sequencing (WES) data (called WES). This allowed us to determine if relatedness or population stratification inflated our estimates of effect size. The EUR population was defined using self-reported information and boundaries in genotyping PC space from prior genetic analysis in the UKB [20]: −20 ≤ PC1 ≤ 40 and −25 ≤ PC2 ≤ 10 (Array items 1 and 2 from field 22009 in the UKB) for either self-identified White British or self-identified non-British White (Field 21000 in the UKB).
Additionally, we also used synonymous variants to test the effect of using different numbers of geno-typing PCs. The original analysis used 10 genotyping PCs when performing association analyses within populations with high genetic similarity [19]. In addition to 10 genotyping PCs, we tested 15 and 20 geno-typing PCs since we planned to use as many individuals in the UKB as possible, which might introduce additional confounding due to population stratification.
To test for stratification, we ran synonymous variant burden tests on a subset of nine continuous traits: height, body mass index (BMI), low-density lipoprotein cholesterol (LDL-C), mean corpuscular hemoglobin (MCH), red blood cell distribution width (RDW), forced vital capacity (FVC), creatinine, cystatin C, and the north coordinate of the place of birth in the United Kingdom (NC). Strong effects for NC should be a good measure of uncontrolled stratification.
Number OF Genotyping Principal Components
Synonymous variants are expected to not have any effect on traits. Therefore, the mean squared effect of synonymous variant burden associations should provide an estimate of inflation in effect sizes due to other sources of confounding. The gold standard for genetic association analysis is using the cohort of unrelated individuals with high genetic similarity. Compared to this cohort, the amount of inflation was indistinguishable in the EUR and WES cohorts (Figure F.1). Thus, we decided to use the WES cohort to maximize our sample size.
The effect of including 10 genotyping PCs was also indistinguishable from 15 or 20 genotyping PCs (Figure F.1). The original analysis included 10 genotyping PCs [19], but performed analyses in cohorts of high genetic similarity. Since we were including all individuals in the UKB, we decided to conservatively use 15 genotyping PCs.
Inflation FROM Confounding
Since we detected a significant mean squared effect for NC, we were concerned about inflation of effect sizes due to confounding. To test the effect of this inflation, we compared the mean squared effect of LoF variant burden associations with the mean squared effect of synonymous variant burden associations across various traits. We noted that the mean squared effect of LoF burden associations was an order of magnitude larger than the effect of synonymous burden associations (Figure F.2). Thus, we concluded that although confounding is likely present in our association tests, the signal is at least 10 times greater than the bias.
Utility of Misannotation Probability
Burden tests often use various maximum minor allele frequency (MAF) filters. For instance, Backman et al. used MAF filters of 1%, 0.1%, 0.01%, and 0.001% [19]. Presumably, such filters assume that increasingly stringent filters will increase the true positive LoF variants that are aggregated into the burden genotype, as LoF variants that are at high frequency in the population might represent misannotated LoF variants. However, such filters result in asymmetric behavior across genes depending on their constraint. For example, the highly stringent 0.001% filter will reduce the false-positive rate in highly constrained genes, but will remove true LoF variants in unconstrained genes. Ideally, a gene under high constraint should use a stringent MAF filter, while a gene under low constraint should use a liberal MAF filter.
To account for this, Zeng et al. have calculated misannotation probabilities for all potential single nucleotide polymorphisms (SNPs) in genes for which they had estimated shet values [21]. We used various misannotation probability filters instead of MAF filters. We tested filters of 10%, 5%, and 1% misannotation probability for all variants with MAF < 1%. We estimated the total signal, as measured by the mean squared effect size, in various buckets of gene constraint (Figure F.3). Increasingly stringent misannotation probability filters increased the signal across various selection buckets. In addition, the various filters were as or more effective than both the 1% and 0.001% MAF filters from [19].
Increasing stringency with the misannotation probability filters does increase the amount of estimation noise in as more LoF variants are removed from the burden genotypes (Figure F.4). We found that a misannotation probability filter of 10% provided signal comparable to or better than a MAF filter of 0.001% with a minimal increase in noise. This filter was used in all subsequent analyses.
CNV Genotyping Error
When we first estimated the LD matrices for the duplication burden genotypes, we noticed large amounts of long-range LD (Figure F.5).
We reasoned that this is driven by a few poor-quality samples with systematically higher copy number. In their initial report on the UKB whole-genome sequencing (WGS) data, Li et al. reported an average structural variant (SV) burden of 3.6 Mbp per haploid genome [22]. Assuming that the total copy number variant (CNV) burden is on a similar order of magnitude, we decided to exclude any samples with more than 10 Mbp of either deletion or duplication burden. Most samples had less than 10 Mbp of duplications or deletions. However, some samples had upwards of 100 Mbp of affected sequence (Figure F.6).
Filtering out 2,674 samples with deletion or duplication burden greater than 10 Mbp resulted in more reasonable duplication LD estimates (Figure F.7).
Estimating CNV Linkage Disequilibrium
Multiple covariates need to be accounted for when performing burden tests. Suppose that Zc ∈ ℝN×C represents the covariate values for the deletion or duplication burden tests for C covariates. The burden test model used to derive summary statistics in REGENIE is where λc ∈ ℝC are the effects of the covariates on the phenotype. To remove these effects before estimating the LD matrix, we partial out the covariates by projecting the burden genotypes into the subspace orthogonal to the column space of the covariates. Specifically, we define which removes the effect of the covariates by removing their contribution to the burden genotypes using the projection matrix .Then, we estimate the correlation matrix using these augmented burden genotypes Z′.
Significant Average Effects
We estimated average burden effects using unbiased estimators. Traits with significant LoF and duplication burden effects were ascertained at an α = 0.05 level (Table F.1).
Lack OF Average Effect IN Genome-Wide Association Studies
For traits with significant average burden effects (Table F.1), we downloaded genome-wide association study (GWAS) summary statistics from the Neale Lab (https://www.nealelab.is/uk-biobank, version 3, both sexes) based on 337,199 White British individuals in the UKB. In their processing pipeline, phenotypic values were inverse-rank normal transformed. Age, age squared, inferred sex, age-by-sex, age-squared-by-sex, and 20 genotyping PCs were included as covariates.
We estimated conditionally independent hits and their effect sizes using GCTA-COJO [23] with parameters --cojo-p 5e-8 --cojo-slct. We used the genotypes of 10,000 unrelated White British individuals in the UKB to compute the LD reference panel. We polarized significant trait-associated variants into ancestral and derived states with variation features obtained from the Ensembl Variation database [24].
In the trumpet plots of derived allele effect versus derived allele frequency (Figures 1B and F.8), we use top significant conditionally independent hits per locus detected in a GWAS. One concern in such a plot is that some top hits may be tagging SNPs, and the causal SNPs may be in negative or positive LD with the tagging SNPs. Tagging SNPs in negative LD with the causal SNP can mask an underlying true non-zero average effect.
To address these concerns, we point to recent work done to understand selection using common variation [25]. Under neutral coalescent simulations, the causal and tagging SNPs are expected to be in positive LD with high probability for derived allele frequencies in the interval [0, 0.5] as shown in Figure S14 of [25]. The trumpet plot continues to display a nearly zero average effect in this interval, suggesting that the underlying causal SNPs also do not show an average effect on the trait.
Analysis OF Monotonicity Effect Estimates
To better understand the distribution of monotonicity estimates, we used a hierarchical model to infer the distribution of effects. Let represent the observed monotonicity estimates for T traits with standard errors . Suppose that
We assume that ϕi is drawn from a transformed beta distribution over the interval [−1, 1],
The likelihood for the transformed beta distribution can be derived using the transformation of random variables approach. Suppose that U ∼ Beta (α, β). We define V as for a, b ∈ ℝ and a < b. Note that the support of V is [a, b]. We then define
The inverse transform is and the derivative is
The likelihood is therefore
We use an empirical Bayes approach to fit the prior distribution. We do so by maximizing the evidence,
We used Riemann integration to numerically evaluate the evidence and used gradient ascent with a fixed number of steps to estimate and .Gradients were estimated using automatic differentiation [7, 8].
We estimated using both the MLE and the MoM estimates. The main results for the MLEs are presented in Figure 2B. The point estimates and confidence intervals for the MLEs are displayed in Figure F.9.
The same analysis was conducted with the MoM estimates, which provided a similar support for monotonicity across traits (Figure F.10). 99.96% of the density of the inferred effect size distribution was over the positive part of the domain of ϕ. For this analysis, we removed traits with estimates greater than one. We also excluded traits where the estimate was undefined, which can occur with negative estimates of F.11. or . The point estimates and confidence intervals for the MoM estimates are displayed in Figure
Concordance of LoF and Deletion effects
To estimate the extent to which the burden tests using LoF variants are concordant with burden tests performed using deletions, we estimated the genetic correlation between the effects from the two variant classes. We did so by performing maximum likelihood estimation using the monotonicity model. When we assume that the average effects are zero ,the correlation is similar to the genetic correlation derived under the LDSC model [26]. While we assume a normal prior distribution over the latent effect sizes, the original LDSC model assumes no functional prior form and only specifies finite first and second moments.
We derived the MLE using the same inference machinery used for monotonicity, assuming fixed zero average effect sizes. Approximate confidence intervals were estimated using the delta method.
Next, similar to the monotonicity analysis, we estimated the latent distribution of genetic correlations by fitting a transformed beta distribution. That is, let represent the observed genetic correlation estimates for T traits with standard errors .Suppose that
We assume that ρi is drawn from a transformed beta distribution over the interval [−1, 1],
We estimate and using the same strategy as the monotonicity estimates.
This analysis demonstrates that LoF variant and deletion burden tests are markedly similar (Figure F.12). 93.53% of the density of the inferred distribution of genetic correlations is over the positive part of the domain of ρ. For this analysis, we removed traits with ĉi < 10−6 as these might represent unstable estimates from the optimization procedure. The estimated genetic correlation for these estimates was ,meaning that dropping them from our analysis biased our inferred distribution towards zero.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].
- [45].
- [46].
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].
- [55].
- [56].
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].
- [64].
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].
- [85].↵
- [86].↵
Appendix References
- [1].
- [2].
- [3].
- [4].
- [5].
- [6].
- [7].
- [8].
- [9].
- [10].
- [11].
- [12].
- [13].
- [14].
- [15].
- [16].
- [17].
- [18].
- [19].
- [20].
- [21].
- [22].
- [23].
- [24].
- [25].
- [26].