Abstract
Rare genetic variants in LDLR, APOB and PCSK9 are known causes of familial hypercholesterolaemia and it is expected that rare variants in other genes will also have effects on hyperlipidaemia risk although such genes remain to be identified. The UK Biobank consists of a sample of 500,000 volunteers and exome sequence data is available for 50,000 of them. 11,490 of these were classified as hyperlipidaemia cases on the basis of having a relevant diagnosis recorded and/or taking lipid-lowering medication while the remaining 38,463 were treated as controls. Variants in each gene were assigned weights according to rarity and predicted impact and overall weighted burden scores were compared between cases and controls, including population principal components as covariates. One biologically plausible gene, HUWE1, produced statistically significant evidence for association after correction for testing 22,028 genes with a signed log10 p value (SLP) of −6.15, suggesting a protective effect of variants in this gene. Other genes with uncorrected p<0.001 are arguably also of interest, including LDLR (SLP=3.67), RBP2 (SLP=3.14), NPFFR1 (SLP=3.02) and ACOT9 (SLP=-3.19). Gene set analysis indicated that rare variants in genes involved in metabolism and energy can influence hyperlipidaemia risk. Overall, the results provide some leads which might be followed up with functional studies and which could be tested in additional data sets as these become available. This research has been conducted using the UK Biobank Resource.
Introduction
Hyperlipidaemia is an important risk factor for cardiovascular disease which is modified by genetic variation and some of the genes involved have been identified (Sharifi et al., 2017). Rare variants with a large dominant effect in LDLR, APOB and PCSK9 cause 40% of cases of familial hypercholesterolaemia (FH), a severe form of hyperlipidaemia. A whole genome sequencing study of over 16,000 subjects failed to identify rare variants in any additional genes influencing lipid levels, though yielded a total of 26 loci containing common variants which were genome-wide significant, largely replicating findings from previous genome wide association studies (GWAS) (Natarajan et al., 2018; Willer et al., 2013). A GWAS of medication usage in 320,000 European UK Biobank participants identified 55 independent SNPs associated with taking statins, including 19 previously shown to be associated with low density lipoprotein cholesterol (LDLC), and showed that these were enriched in a number of gene sets involved with lipid metabolism (Wu et al., 2019). Another GWAS of European UK Biobank participants identified hundreds of genome-wide significant independent SNPs associated with lipid levels (Richardson et al., 2020).
Although common genetic variation makes a substantial overall contribution to the variance in lipid levels, selection pressures mean that common variants individually have small effects and it can be difficult to elucidate the biological mechanisms underlying their association. It is to be expected that rare variants will also have effects and it is possible that some of them might have large effect sizes. Such analyses may be better able to identify specific genes rather than genetic loci and may indicate a direction of effect. The UK Biobank sample (http://www.ukbiobank.ac.uk/about-biobank-uk/) contains information about medication usage and clinical diagnoses, both as reported by participants and as extracted from their health records. Exome sequence data is available for 50,000 subjects and a gene-wise weighted burden analysis of rare variants was carried out to attempt to identify genes associated with a hyperlipidaemia phenotype, defined as subjects with a diagnosis of hyperlipidaemia and/or taking cholesterol-lowering medication.
Methods
The UK Biobank dataset was downloaded along with the variant call files for 49,953 subjects who had undergone exome-sequencing and genotyped using the GRCh38 assembly with coverage 20X at 94.6% of sites on average (Hout et al., 2019). UK Biobank had obtained ethics approval from the Research Ethics Committee (REC; approval number: 11/NW/0382) and informed consent from all participants. All variants were annotated using VEP, PolyPhen and SIFT (Adzhubei et al., 2013; Kumar et al., 2009; McLaren et al., 2016). To obtain population principal components reflecting ancestry, version 1.90beta of plink (https://www.cog-genomics.org/plink2) was run with the options --maf 0.1 --pca header tabs --make-rel (Chang et al., 2015; Purcell et al., 2007, 2009).
The hyperlipidaemia phenotype was determined from four sources in the dataset: self-reported high cholesterol; reporting taking cholesterol lowering medication; reporting taking a named statin; having an ICD10 diagnosis for hyperlipidaemia in hospital records or as a cause of death. Subjects in any of these categories were deemed to be cases with hyperlipidaemia while all other subjects were taken to be controls.
SCOREASSOC was used to carry out a weighted burden analysis to test whether, in each gene, sequence variants which were rarer and/or predicted to have more severe functional effects occurred more commonly in cases than controls. Attention was restricted to rare variants with minor allele frequency (MAF) <= 0.01. As previously described, variants were weighted by MAF so that variants with MAF=0.01 were given a weight of 1 while very rare variants with MAF close to zero were given a weight of 10 (Curtis, 2020). Variants were also weighted according to their functional annotation using the default weights provided with the GENEVARASSOC program, which was used to generate input files for weighted burden analysis by SCOREASSOC (Curtis, 2016, 2012). For example, a weight of 5 was assigned for a synonymous variant, 10 for a non-synonymous variant and 20 for a stop gained variant. Additionally, 10 was added to the weight if the PolyPhen annotation was possibly or probably damaging and also if the SIFT annotation was deleterious, meaning that a non-synonymous variant annotated as both damaging and deleterious would be assigned an overall weight of 30. The full set of weights is shown in Supplementary Table S1, copied from the previous reports which used this method (Curtis et al., 2019, 2018). The weight due to MAF and the weight due to functional annotation were then multiplied together to provide an overall weight for each variant. Variants were excluded if there were more than 10% of genotypes missing in the controls or if the heterozygote count was smaller than both homozygote counts in the controls. If a subject was not genotyped for a variant then they were assigned the subject-wise average score for that variant. For each subject a gene-wise weighted burden score was derived as the sum of the variant-wise weights, each multiplied by the number of alleles of the variant which the given subject possessed. For variants on the X chromosome, hemizygous males were treated as homozygotes.
For each gene, a ridge regression analysis was carried out with lamda=1 to test whether the gene-wise variant burden score was associated with the hyperlipidaemia phenotype. To do this, SCOREASSOC first calculates the likelihood for the phenotypes as predicted by the first 20 population principal components and then calculates the likelihood using a model which additionally incorporates the gene-wise burden scores. It then carries out a likelihood ratio test assuming that twice the natural log of the likelihood ratio follows a chi-squared distribution with one degree of freedom to produce a p value. The statistical significance is summarised as a signed log p value (SLP) which is the log base 10 of the p value given a positive sign if the score is higher in cases and negative if it is higher in controls. We have shown that incorporating population principal components in this way satisfactorily controls for test statistic inflation when applied to this heterogeneous dataset (Curtis, 2020).
Gene set analyses were carried out using the 1454 “all GO gene sets, gene symbols” pathways as listed in the file c5.all.v5.0.symbols.gmt downloaded from the Molecular Signatures Database at http://www.broadinstitute.org/gsea/msigdb/collections.jsp (Subramanian et al., 2005). For each set of genes, the natural logs of the gene-wise p values were summed according to Fisher’s method to produce a chi-squared statistic with degrees of freedom equal to twice the number of genes in the set. The p value associated with this chi-squared statistic was expressed as a minus log10 p (MLP) as a test of association of the set with the hyperlipidaemia phenotype.
Results
There were 11,490 cases with a diagnosis of hyperlipidaemia and/or taking cholesterol-lowering medication and 38,463 controls. There were 22,028 genes for which there were qualifying variants and the QQ plot for the SLPs obtained for each gene is shown in Figure 1. This shows that the test is well-behaved and conforms well with the expected distribution.
Table 1 shows the results for all genes with an absolute value of SLP exceeding 3 (equivalent to p<0.001). By chance, from 22,028 genes one would expect 11 to have SLP greater than 3 and 11 to have SLP less than −3, whereas the actual numbers are 15 and 28. Applying a Bonferroni correction to test for genome-wide statistical significance would yield a threshold of log10(22,028/0.05)=5.6 for the absolute value of the SLP and this is achieved by only two genes, HUWE1 and CXorf56. For both these genes the SLP was negative, indicating that rare, functional variants were enriched in controls compared with cases and suggesting that impairment of these genes might be protective against hyperlipidaemia.
HUWE1 (SLP=-6.15) codes for an ubiquitin protein ligase which has functions in development and tumorigenesis and which regulates the ABCG1 and ABCG4 lipid transporters such that over-expression of HUWE1 reduces ABCG1 and ABCG4 protein levels and their cholesterol export activity while silencing the gene has the reverse effects (Aleidi et al., 2015). It is also mediates the ubiquitination of peroxisome proliferator-activated receptor α (PPARα), thereby reducing its function (Zhao et al., 2018). PPARα is a transcriptional factor which promotes hepatic lipid catabolism by stimulating fatty acid oxidation and ketogenesis in response to nutrient starvation.
CXorf56 (SLP=-5.61) is a brain expressed gene which is not well characterised but has recently been implicated as a cause of intellectual disability (Rocha et al., 2020). Subjects with variants damaging this gene are reported to have intellectual disability and sometimes related features including epilepsy and alopecia but there are no reports of abnormalities of lipid metabolism.
No other gene was formally genome-wide significant and it is reasonable to assume that for most of those listed in the table the SLPs represent chance findings. However for a few it is possible that there is a real biological effect which is being picked up. The most obvious of these is LDLR, with SLP=3.67. It is well known that variants in this gene can cause familial hyperlipidaemia (Sharifi et al., 2017). It is worth noting however that the other two genes implicated in autosomal dominant familial hyperlipidaemia, APOB (SLP=0.11) and PCSK9 (SLP=-0.66) did not produce any evidence of association. The same applies to HMGCR (SLP=-0.38), which codes for the target of statins.
Others of the genes with magnitude of SLP exceeding 3 (equivalent to p<0.001) which might be of interest are as follows.
RBP2 (SLP=3.14) is expressed in the gut and its product is responsible for uptake of vitamin A but recent studies in mice have shown that it also influences body weight, the response to glucose challenge and hepatic triglyceride levels while Rbp2 deficient mice have increased adiposity, with larger adipocytes, and decreased energy expenditure (Lee et al., 2020). The authors suggest a signalling role for RBP2 and if similar mechanisms were present in humans then it is plausible that variants disrupting RBP2 could lead to increased risk of hyperlipidaemia.
STAT5B (SLP=3.11) codes for a transcription factor which is responsible for mediating the signal from various ligands, including growth hormone, and is also involved in adipogenesis (Goupille et al., 2016; Wakao et al., 2011). Two common variants in STAT5B, rs8082391 and rs8064638, have previously been reported to be associated with total cholesterol and low-density lipoprotein cholesterol while mice deficient in hepatic Stat5a/b had reduced serum cholesterol (Kornfeld et al., 2011). Additionally, a small candidate gene study claimed that SNPs in STAT5B were associated with lipid changes in response to growth hormone replacement therapy (Makimura et al., 2011).
However no SNPs close to STAT5B were genome-wide significantly associated with low density lipoprotein cholesterol or triglyceride levels in 180,000 UK Biobank subjects so the early SNP association results may have been false positives (Richardson et al., 2020). The mouse phenotype remains of interest, although it might suggest that variants in the gene should reduce, rather than increase, the risk of hyperlipidaemia
NPFFR1 (SLP=3.02) codes for a receptor for a variety of neuropeptides including NPAF and NPFF and is expressed on adipocytes, where NPFF or NPAF treatment results in increased expression of adrenergic receptors and potentiation of the response to beta agonists to increase adenylyl cyclase activity (Lefrère et al., 2002). It is also expressed in the brain and NPFF has been shown to be anorexigenic in rats and chicks (Cline et al., 2007; Murase et al., 1996). Knockout of Npffr1 in mice has differential effects according to sex including increased susceptibility to high fat diet with impaired glucose tolerance in males but increased weight and sensitivity to obesogenic insults in females (Leon et al., 2018). Overall it seems possible that impaired functioning of this receptor might lead to metabolic abnormalities which predispose to hyperlipidaemia.
ACOT9 (SLP=-3.19) codes for a mitochondrial acyl-CoA thioesterase which has recently been proposed to have a key role in liver lipogenesis and risk of non-alcoholic fatty liver disease (NAFLD) (Steensels et al., 2020). It is a regulator of lipid accumulation and its expression is higher in NAFLD patients, while Acot9 deficient mice are protected against weight gain, hepatic glucose production, steatosis and steatohepatitis in the setting of excess nutrition. These observations are consistent with the possibility that variants in ACOT9 might tend to be protective against metabolic responses resulting in hyperlipidaemia.
GK (SLP=-3.31) codes for glycerol kinase and a number of variants in the gene have been reported to cause X-linked glycerol kinase deficiency syndromes which can very mild or which may involve symptoms including vomiting, hypoglycaemia, hyperketonaemia and intellectual disability (Sjarif et al., 1998). Its product has recently been shown to enhance hepatic liver metabolism and increased expression in mice leads to increased blood levels of cholesterol and triglycerides (Miao et al., 2020). Thus it is possible that variants mildly impairing GK but not sufficient to cause clinical glycerol kinase deficiency might be somewhat protective against hyperlipidaemia.
The gene sets with MLP>3 (equivalent to uncorrected p<0.001) are shown in Table 2. Given that 1454 sets were tested the critical value for the MLP to reach if a Bonferroni correction were applied would be log10(1454/0.05)=4.46, though this may be somewhat conservative given that some gene sets overlap with each other. This threshold is reached by the set HISTONE_MODIFICATION which contains 23 genes, one of which is HUWE1 (SLP=-6.15). No other genes in the set appear likely to have a direct role in risk of hyperlipidaemia and it appears that the result for the set is essentially driven by this single gene. The other set to be formally statistically significant is GENERATION_OF_PRECURSOR_METABOLITES_AND_ENERGY with MLP=5.47, implying that variants disrupting the function of one or more genes in this set can impact hyperlipidaemia risk. The set contains 120 genes, none of which individually produced an SLP with magnitude greater than 3. However Table 3 shows the members of this set which have individual SLP>=1.3, equivalent to an uncorrected p value of 0.05, and a number of them appear to be of interest, as detailed below. These genes all have negative SLPs, suggesting that rare variants impacting their functioning might tend to be protective against hyperlipidaemia.
ADIPOQ (SLP=-1.37) codes for adiponectin, an adipokine which is expressed in adipocytes which influences fat metabolism, and common variants in the gene are associated with cardiovascular disease risk and lipid levels (Liu et al., 2018; Salazar-Tortosa et al., 2020; Wang et al., 2019; Zhang et al., 2019). Adiponectin has a hypoglycaemic effect and can reverse the insulin resistance associated with both lipoatrophy and obesity (Berg et al., 2001; Yamauchi et al., 2001). A rare intronic variant, rs74577862, has been reported to be associated with decreased levels of adiponectin and increased risk of atherosclerosis (Chen et al., 2017; UK10K Consortium et al., 2015).
SURF1 (SLP=-1.71) codes for a protein involved in the assembly of cytochrome c oxidase complex and variants in it can cause Leigh syndrome, a fatal neurological condition (Lee et al., 2012). However mice completely lacking in Surf1 have increased longevity, lower adiposity and enhanced fatty acid oxidation. These findings suggest that variants which moderately impact SURF1 in humans might result in a more favourable lipid profile.
ADRB3 (SLP=-1.81) codes for an adrenoceptor which promotes lipolysis and thermogenesis in response to sympathetic nerve stimulation and, as reviewed recently, the common variant Trp64Arg is associated with obesity and levels of serum lipids and adipokines (Luo et al., 2020).
GYG2 (SLP=-2.00) has, like SURF1, been reported to be a cause of Leigh syndrome (Imagawa et al., 2014). Knockdown in adipocytes causes reduction in total lipid and number of lipid droplets per cell as well as increased lipolysis (Kerr et al., 2019).
PHKA1 (SLP=-2.29) and PHKA2 (SLP=-2.66) code for subunits of phosphorylase kinase and abnormalities in them are known to cause glycogen storage diseases with very variable phenotypes which can include hypoglycaemia, hypercholesterolaemia and hypertriglyceridaemia (Beauchamp et al., 2007; Preisler et al., 2012).
Inspection of the detailed results for all the genes mentioned revealed that the SLPs obtained tended to arise from the cumulative effects of many rare variants. For no gene was it possible to identify any individual variant which appeared to be a main driver for the overall differences in burden scores between cases and controls.
The SLPs for all genes and the MLPs for all gene sets are provided in supplementary tables S2 and S3.
Discussion
This analysis identifies one gene, HUWE1, which reaches conventional standards for statistical significance after correction for multiple testing and which could plausibly have a role in influencing risk of hyperlipidaemia. An increased burden of rare functional variants is found in controls versus cases, implying that impaired functioning reduces hyperlipidaemia risk and suggesting that theoretically it might be considered a target for lipid-lowering pharmacotherapy, although one caveat is that it can act as either a tumour suppressor or as an oncogene (Crawford et al., 2020).
Other genes with uncorrected p values less than 0.001 are not formally statistically significant but might still be worth further investigation. The result for LDLR (SLP=3.67) demonstrates that even genes known to influence lipid levels may not achieve definitive results with a dataset of this size. Taking account of other information available, RBP2 (SLP=3.14), NPFFR1 (SLP=3.02) and ACOT9 (SLP=-3.19) are especially noteworthy.
The significant results from the gene set analysis confirm that the variants tending to affect genes involved in metabolism and energy influence hyperlipidaemia risk but it is more difficult to say exactly which genes are involved. The results for ADIPOQ (SLP=-1.37) and ADRB3 (SLP=-1.81) are in the opposite direction to what might be expected, since the literature would predict that disruption of these genes could promote hyperlipidaemia. On the other hand, it seems plausible that reduced function of SURF1 (SLP=-1.71) or GYG2 (SLP=-2.00) could be protective.
Overall, this analysis demonstrates that next generation sequence data does not provide a magic bullet for identifying rare variants influencing complex traits, even with large sample sizes. On the other hand, there are some potentially interesting results which are close to the threshold for formal significance such that it would be quite reasonable to expect that further definitive results could be obtained from a somewhat larger sample. At time of writing, exome sequencing of 200,000 UK Biobank samples has been completed and this data should be released soon and may well provide further insights.
Data Availability
The raw data is available on application to UK Biobank. Detailed results with variant counts cannot be made available because they might be used for subject identification.
Conflicts of interest
The author declares he has no conflict of interest.
Data availability
The raw data is available on application to UK Biobank. Detailed results with variant counts cannot be made available because they might be used for subject identification.
Acknowledgments
This research has been conducted using the UK Biobank Resource. The author wishes to acknowledge the staff supporting the High Performance Computing Cluster, Computer Science Department, University College London. This work was carried out in part using resources provided by BBSRC equipment grant BB/R01356X/1.