ABSTRACT
Retinol is a fat-soluble vitamin that plays an essential role in many biological processes throughout the human lifespan. Previous work has characterised genetic influences on circulating retinol; however, small sample sizes have limited our ability to fully appreciate the genetic architecture of this trait. In this study, we performed the largest genome-wide association study (GWAS) of retinol to date in up to 22,274 participants. We identified eight common variant loci associated with retinol, as well as a rare-variant signal. An integrative gene prioritisation pipeline supported novel retinol-associated genes outside of the main retinol transport complex (RBP4:TTR) related to lipid biology, energy homeostasis, and endocrine signalling. Genetic proxies of circulating retinol were then used to estimate causal relationships with almost 20,000 clinical phenotypes via a phenome-wide Mendelian randomisation study (MR-pheWAS). The MR-pheWAS suggested that retinol may exert causal effects on inflammation, adiposity, ocular measures, the microbiome, and MRI-derived brain phenotypes, amongst several others. Conversely, circulating retinol may be causally influenced by factors including lipids and renal function. Finally, we demonstrated how a retinol polygenic score could identify individuals who are more likely to fall outside of the normative range of circulating retinol for a given age. In summary, this study provides a comprehensive evaluation of the genetics of circulating retinol, as well as revealing traits which should be prioritised for further clinical investigation with respect to retinol related therapies or nutritional intervention.
INTRODUCTION
Vitamin A is an essential micronutrient that is involved in a range of important biological processes, including, vision, immune function, cell division, and neurodevelopment1, 2. Vitamin A does not refer to a single compound, but rather to a group of compounds that encompasses retinol (all-trans retinol), retinoids (metabolites of retinol, such as retinaldehyde and retinoic acid), and provitamin carotenoids (beta-carotene, alpha-carotene, and beta-cryptoxanthin). Retinol is the form of vitamin A dietarily consumed from animal products, along with retinyl ester3, while plant-based materials contain precursors termed carotenoids that can be converted to retinaldehyde4. Retinoic acid, an oxidised form of retinaldehyde, is a particularly potent signalling molecule that regulates the expression of thousands of genes after binding to nuclear receptors including the retinoic-acid receptor and retinoid-X receptor subgroups5, 6.
The majority of dietary retinol is delivered to the liver, which is the primary organ responsible for its storage and metabolism. Retinol binding protein 4 (RBP4) is the major systemic transporter of retinol after hepatic secretion, facilitating delivery of retinol throughout the body3, 7, 8. RBP4 in turn complexes with the tetramer protein transthyretin (TTR), which stabilises circulating RBP4 and reduces renal filtration9. Notably, retinol can also be delivered directly to target tissues through other mechanisms, such as its postprandial packaging into lipid chylomicrons, as reviewed elsewhere3.
The role of retinoid-related interventions in human disease for individuals who are not retinol deficient has been of long-standing interest. Synthetic retinoids that are structurally similar to retinol/retinoic acid are approved for dermatological indications (e.g., adapalene) and some cancers (e.g., bexarotene)10, 11, with continued interest in repurposing these compounds across a range of other indications, including neuropsychiatry2, 12. Currently, retinol supplementation is not specifically indicated unless an individual is deficient, which is rare in high-income countries, though much more common in low-income countries. Numerous observational or randomised controlled studies have explored the effects of supplementation, a high vitamin A diet, and/or measured circulating retinol in a variety of disease contexts. However, the data from these efforts have often either been null or conflicting between studies12–16. Despite this, recent observational evidence suggests a relationship between a greater serum retinol abundance and lower mortality in a large, prospective 30-year follow up study17.
Genetics provides a powerful tool to better characterise factors that influence the abundance of circulating retinol in serum. Moreover, estimated genetic effects on retinol can be utilised to understand potential causal relationships with human health and disease, which may be informative for supplementation, dietary intervention, or drug repurposing18, 19. Family studies have suggested that circulating retinol is significantly heritable20, albeit estimated in small sample sizes. Similarly, dedicated genome-wide association studies (GWAS) of circulating retinol have also been limited to very modest sample sizes21, 22. In 2011, Mondul et al. published findings of two genome-wide significant loci associated with retinol (N = 5006). These loci were plausibly mapped to the genes RBP4 and TTR, respectively, which form the primary retinol transport complex22. There has been comparatively little progress in further characterising the genetic architecture of retinol since that time relative to other micronutrients like vitamin D, for which large sample size GWAS (N > 400,000) have been released23, 24. The recent adoption of untargeted high-throughput metabolomics platforms with coverage for retinol in some existing genotyped cohorts presents a new opportunity to boost statistical power. As a result, in the current study, we aim to perform the largest GWAS of circulating retinol to-date to identify novel loci and leverage these data to study how retinol relates to health and disease.
RESULTS
The common and rare variant genetic architecture of circulating retinol
We integrated common and rare variant data from up to 22,274 individuals of European ancestry in our discovery meta-analyses to estimate genetic effects on circulating retinol (Figure 1A-B, Online Methods). Firstly, variants from the INTERVAL and METSIM studies were meta-analysed (NMeta = 17,268 – termed METSIM+INTERVAL, Figure 1, Online Methods). After harmonisation, there were 8,173,975 common and 5,091,050 rare [minor allele frequency (MAF) < 1%] overlapping variants between INTERVAL and METSIM, respectively. As retinol effect sizes were estimated in the same units in both studies (plasma SD units, quantified by the same instrument), we conducted both a fixed-effects inverse-variance weighted (IVW) meta-analysis and a sample-size weighted meta-analysis of Z scores (Stouffer’s method). Secondly, we conducted an additional meta-analysis (NMeta = 22,274) which also included data from two other studies (ATBC and PLCO), termed the METSIM+INTERVAL+ATBC+PLCO meta-analysis. However, there were markedly fewer variants available in this meta-analysis after imputing the ATBC and PLCO summary statistics and harmonising with METSIM+INTERVAL (NVar = 3,896,351, Online Methods). As a result, we focused on the METSIM+INTERVAL meta-analysis as the primary discovery dataset.
Considering common variants from the HapMap3 panel (MAF > 0.05, outside of the major histocompatibility complex (MHC) region), we observed a relatively subtle inflation of retinol signals across the genome as indexed by the mean X2 statistic, with mean X2 values around 1.03-1.04, regardless of the meta-analysis considered (Supplementary Table 1). Common variant SNP heritability (h2SNP) estimated using the linkage disequilibrium (LD) score regression (LDSR) approach and the 1000 genomes European reference panel was between 6%-7% and nominally statistically significant, although with somewhat large standard errors (2.6%-3.2%) (Figure 1C, Supplementary Table 2). SNP heritability estimates of circulating retinol increased to between 10%-13%, but were still noisy, using two alternate models to estimate h2SNP and the UK Biobank (UKBB) as the LD reference (Figure 1C, Online Methods, Supplementary Table 2). Partitioned h2SNP across tissues and cell-types demonstrated nominal enrichment in biologically logical contexts such as liver, adipose, pancreas, and blood (Supplementary Figure 1). The somewhat large h2SNP standard errors are likely a product of sample size; however, we then conducted further analysis to explore the extent of the polygenic signal associated with retinol across the genome using an Empirical Bayes’ method (Online Methods). This method was utilised to model the number of non-null effects on retinol genome-wide stratified by bins of LD scores that index the extent of LD a variant exhibits with other variants (Figure 1D). Across all LD score bins, we estimated that the mean fraction of common variants across the genome with non-null effects on retinol was between 1.4% to 2.4%, depending on the modelling parameters used. In line with expectation, the proportion of non-null retinol effects was very high (> 50%) when considering the variants that display the most extensive LD (highest LD score bins). The application of these analyses to two GWAS of another vitamin (25-hydroxyvitamin D3) with either comparable or much larger sample sizes23, 25, suggested that the less-polygenic architecture of retinol observed in this study may become more diffuse across the genome with greater sample sizes, in line with many other quantitative traits (Supplementary Figure 2). However, we caution that further investigation will be required as these data become available to confirm this.
Next, we processed the common variant results of both meta-analyses to identify genome-wide significant loci associated with circulating retinol (PGWAS < 5 × 10-8). In the primary discovery meta-analysis (METSIM+INTERVAL, Stouffer’s method), we uncovered eight genome-wide significant loci, six of which were not reported in the previous Mondul et al. retinol GWAS (Table 1). The absolute effect sizes of these lead SNPs were between 0.066 and 0.172 SD in circulating retinol per effect allele, as derived from the IVW meta-analysis. We observed minimal heterogeneity between the two cohorts for these lead SNPs, although heterogeneity was slightly more marked for rs6601299. Replication was then attempted in the TwinsUK cohort (N up to 1621, Online methods). Considering the mean association across all timepoints retinol was measured, as well as the twin pairs separately, 7 out of the 8 lead SNPs in the loci had effect sizes in the same direction, which was greater than expected by chance alone (Binomial P = 0.035, Supplementary Table 3).
We also investigated the effect of retinol associated lead SNPs on factors associated with dietary intake of retinoids. Using a GWAS of retinol intake derived from a self-reported 24-hour dietary recall in the UK Biobank (UKBB, N = 62,991)26, we found no evidence to suggest that any of the effect of these genetic signals on circulating retinol is mediated through influencing dietary intake behaviours (Supplementary Table 4, Supplementary Figure 3).
In the larger sample-size meta-analysis with fewer variants available across all input datasets (METSIM+INTERVAL+ATBC+PLCO), six of the eight genome-wide significant loci from the smaller meta-analysis were available to test for association. Five of the six loci available became more statistically significant in this larger meta-analysis, whilst the chromosome 8 locus obtained a very similar level of statistical significance in both meta-analyses (Table 2). It should be noted that there was relatively large heterogeneity for the lead SNP at the TTR locus on chromosome 18 locus between cohorts. This was due to a more significant association in the ATBC+PLCO GWAS, although this region was still very strongly associated in METSIM and INTERVAL in the same direction.
We also estimated rare variant (frequency < 1%) effects on circulating retinol using variants available in the METSIM+INTERVAL meta-analysis. Despite relatively low power to detect rare variant association, we identified a genome-wide significant rare variant signal on chromosome five (chr5:86765041:T:C, dbSNP ID: rs138675130) associated with reduced circulating plasma retinol per C allele (-0.441 SD, SE = 0.0709, P = 6.37 × 10-9) with no significant heterogeneity between the contributing studies. The frequency of this C allele in Europeans (gnomAD v.3.1.2) ranges from 0.5% in non-Finnish Europeans to 0.8% in Finns, and whilst it is marginally rarer in Africans and South Asians, it is entirely absent in the East Asian and Middle Eastern populations in that database. The variant is intergenic and in FinnGen release 8, the C allele was associated at phenome-wide significance (P < 1 × 10-5) with increased odds of benign neoplasm of the eye and adnexa, as well as whooping cough. The closest canonical transcription start site to this variant is that of COX7C, which encodes a subunit of a terminal component of the mitochondrial respiratory chain. We also uncovered several suggestively significant rare variant signals (P < 1 × 10-5), including three non-synonymous variants in the genes FREM2, NAXD, and CHD1 (Supplementary Table 5). Of these, the rare NAXD non-synonymous allele suggestively associated with lower retinol (rs3742192) had some in silico evidence to suggest deleteriousness (Supplementary Table 5), although it is classed as benign in ClinVar. Finally, to boost power, we also statistically aggregated rare variants to genes (Online Methods). There were no significant retinol genes after Bonferroni correction of these gene level association results (P < 3.02 × 10-7); however, there were two novel suggestively associated genes (P < 3.02 × 10-5), GALM and ZDHHC18 (Supplementary Table 6).
Prioritisation of retinol-associated genes reveals novel mechanisms influencing circulating retinol
In common variant loci, prioritising causal genes can be difficult due to confounding factors like linkage disequilibrium. For circulating retinol, we employed a multi-faceted approach to prioritise genes that are confidently associated. Firstly, we sought to interrogate the eight genome-wide significant loci uncovered in the main discovery meta-analysis (METSIM+INTERVAL) to uncover plausible causal genes. This was achieved by adapting an integrative pipeline developed in previous work that considers annotation, probabilistic finemapping, integrative scoring, and in silico prediction (Online Methods, Supplementary Table 7). We describe these results further for each locus in the supplementary text but summarise the prioritisation evidence forthwith. There was quite consistent evidence in four loci for a likely causal gene (GCKR, FOXP2, RBP4, and TTR). RBP4 (chromosome 10 locus) and TTR (chromosome 18 locus) were previously associated with retinol in the only other dedicated GWAS of this trait and form the complex that transports retinol in serum22, thereby, having a direct biological link to retinol abundance. GCKR, a gene encoding a protein that binds to and regulates the key metabolic enzyme glucokinase, was confidently the causal gene for the locus on chromsome 2 with the rs1260326 lead SNP. This gene is known to have a large and varied metabolic association profile due to the role of glucokinase in glycaemic and lipid related processes, amongst others27–29. Interestingly, the lead SNP, and most likely causal variant derived from probabilistic finemapping, was a common missense allele (rs1260326), whereby the retinol increasing T allele corresponds to a substitution of leucine for proline in the GCKR protein. Previous experimental work suggests that this variant impacts GCKR affinity for glucokinase27, 30.
The transcription factor gene FOXP2 was also strongly supported by multiple lines of evidence as a causal gene for the locus on chromosome seven. The role of this gene in the brain and in relation to neurological phenotypes like language has been extensively studied31, but less so in the periphery despite relatively high expression across many different systemic tissues. Therefore, we analysed RNA-sequencing data of FOXP2 overexpression in a cell-line not derived from the central nervous system (human osteosarcoma epithelial cell line) and revealed the transcriptional correlates of FOXP2 overexpression were enriched for a broad range of pathways related to factors including extracellular matrix biology, glycosylation, and interleukin signalling, amongst many others (Supplementary Tables 8-11, Online Methods).
The remaining four loci exhibited less clear evidence of which gene to prioritise, although all point to potentially interesting functional mechanisms. On chromosome eight, there is some evidence to support PPP1R3B as a gene that influences retinol, which encodes a catalytic subunit of the phosphatase PP1 that is implicated in relevant metabolic processes like glycogen synthesis32. However, other lines of evidence point to a role of long-noncoding RNA in this locus. The loci on chromosomes 16 and 20 are noteworthy as the closest transcription start sites to the respective lead SNPs are two transcription factors (TF) from the Maf family (MAF and MAFB). Interestingly, MAFB has been shown to regulate both TTR and RBP4 expression in various tissue contexts from human or murine studies33, 34. As only some lines of evidence support these two TF, further functional characterisation of these two loci is warranted. Interestingly, in the locus on chromosome 16, two of the other genes with some evidence for a retinol-related function (MAFTRR and LINC01229) have been shown experimentally to regulate MAF expression and are also associated with other biochemical traits like urate, further supporting the role of the Maf family on retinol abundance in serum35. Finally, the remaining locus on chromosome 2 (2:122078406-122084285) had the least interpretable functional prioritisation results. The closest transcription start site to the lead SNP was another TF (TFCP2L1) that has broad physiological roles including in the kidney36.
We then sought to expand our scope for gene discovery beyond genome-wide significant retinol-associated loci through further integration of genetics with transcriptomics and proteomics (Online Methods, Supplementary Tables 12-13). Firstly, we leveraged multivariate models of genetically regulated expression (GReX) to perform a transcriptome-wide (liver, whole blood, adipose, small intestine, pancreas, and breast mammary tissue) and proteome-wide (plasma) association study (TWAS/PWAS) of circulating retinol. Tissues for the TWAS were selected based on prior knowledge of retinol biology and the results of the partitioned heritability analyses (Online Methods), whilst plasma was the only relevant tissue available for PWAS. After applying multiple-testing correction to the TWAS and PWAS individually (FDR < 0.05) and testing whether there was a shared causal variant via colocalisation [Posterior probability (PP) of a shared causal variant (H4), PPH4 > 0.8], we identified strong evidence of four additional retinol-associated genes outside of genome-wide significant loci (at least +/-1 megabase away). These were as follows: MLXIPL, which binds to carbohydrate response elements to regulate triglycerides37-39; GSK3B, a gene that encodes a member of the glycogen synthase kinase family involved in metabolism and glycaemic homeostasis40; the tankyrase gene (TNKS) implicated in processes like Wnt signalling41; and INHBC, part of the inhibin family of proteins with important endocrine functionality42. Genetically predicted mRNA expression of MLXIPL in adipose, pancreas, and breast mammary tissue was inversely associated with circulating retinol levels. Conversely, TWAS analyses revealed that genetically predicted expression of GSK3B and TNKS was positively associated with circulating retinol levels. Finally, genetically predicted plasma protein expression of INHBC showed a positive correlation with retinol levels (ZPWAS = 4.72). We then used a more conservative approach whereby finemapped variants that influence protein expression (pQTLs) were used as instrumental variables (IV) to estimate the causal effect of plasma proteins on retinol using Mendelian randomisation (MR, Online Methods, Supplementary Table 14). We applied the same filters to the results (FDR < 0.05 and PPH4 > 0.8). These analyses further supported that upregulated INHBC likely increases serum retinol, with each SD increase in plasma protein expression associated with a small but highly statistically significant [0.05 per SD in expression, 95% CI: 0.03, 0.07] impact on circulating retinol. In line with expectation, pQTL-MR, and subsequent colocalisation, further genetically validates that elevated RBP4 protein abundance correspondingly increases serum retinol with somewhat large effect (0.6 [95% CI: 0.48, 0.72] SD in retinol per SD in plasma RBP4 expression). There was also evidence that RBP4 protein expression and retinol colocalise under the hypothesis of a single causal variant (PPH4 = 1). Considering the eight genes prioritised in this and the previous section (RBP4, GCKR, FOXP2, TTR, MLXIPL, GSK3B, TNKS, INHBC), we found that these genes exhibited upregulated expression in the liver (PAdjusted < 0.05) upon analysing data from 54 GTEx tissues. This further consolidates the salience of hepatic processing to genetic influences on circulating retinol abundance. Pathway analyses of these genes demonstrated that they were enriched amongst factors involved in carbohydrate metabolism (Supplementary Table 15).
Wide-ranging evidence of causal effects of retinol across the human clinical phenome
The role of retinol in human health and disease has been of long-standing interest. However, most evidence has been observational, limiting the ability for causal inference. Moreover, randomised controlled trials (RCT) of interventions like retinol supplementation and synthetic retinoids have only been performed for a small fraction of the traits implicated through observational studies. We sought to increase our understanding of causal effects of retinol on human health by leveraging genetic variants associated with retinol uncovered in this study as IVs. Given certain assumptions are met, these genetic proxies of retinol can be utilised to estimate causal effects of circulating retinol at scale using Mendelian randomisation (MR) (Online Methods). Firstly, we utilised a single IV in RBP4 (rs10882283), as this gene has a clear and unambiguous association with circulating retinol levels, and therefore, is less likely to be prone to horizontal pleiotropy than other retinol-associated loci. We do caution, however, that RBP4 does exert some other functionality that may not be directly related to retinol transport43, although this gene is still likely the best available single IV associated with circulating retinol at genome-wide significance. We utilised this RBP4 IV (METSIM+INTERVAL meta-analysis effect size) to estimate the effect of retinol on over 19,500 outcomes in the IEUGWAS database (IEUGWASdb, Online Methods, Supplementary Table 16). After multiple-testing correction (FDR < 0.01), retinol was found to putatively exert causal effects on outcomes including several lipid traits, leukocyte counts (total leukocytes and neutrophils), reticulocytes, optic disc area, and resting-state connectivity of a functional MRI (fMRI) derived network edge. For instance, each SD in circulating retinol was associated with a -0.1 SD [95% CI: -0.14, -0.06] decrease in leukocyte count, whilst this unit retinol increase was estimated to increase optic disc area by 0.22 SD [95% CI: 0.12, 0.32]. We further interrogated a representative subset of these associations using colocalisation to test if the signals are driven by a shared causal variant in RBP4 (Supplementary Table 17). Colocalisation strongly supported that the effect of circulating retinol on leukocyte count, optic disc area and the edge in the fMRI network arises from a shared causal variant in RBP4. In contrast, the effect of retinol on lipids through RBP4 was shown to likely arise from linkage due to the proximal gene FFAR4 that encodes a free-fatty acid receptor. Therefore, the lipid findings most likely do not represent a causal impact of circulating retinol, at least through RBP4, but rather the influence of FFAR4.
To boost power for discovery of causal retinol effects, we then utilised all independent (LD r2 < 0.001) genome-wide significant SNPs as IVs (Online Methods). Firstly, we used the inverse-variance weighted estimator with multiplicative random effects (IVW-MRE) to estimate the effect of retinol on IEUGWASdb outcomes for which at least six of the IVs were available (> 17,000 outcome phenotypes). There was moderate positive correlation between the IVW-MRE and single RBP4 estimates across all traits tested (r = 0.43, Supplementary Figure 4). Whilst well powered, the IVW-MRE assumes all IVs are valid, which is unlikely in practice. Therefore, we developed a pipeline to prioritise the most confident causal relationships that survive multiple-testing correction considering the IVW-MRE estimates (FDR < 0.01, Online Methods, Figure 2A, Supplementary Tables 18-20). This was comprised of three tiers, with Tier #1 being the highest level of evidence. Retinol/outcome pairings that were assigned a tier had to exhibit no significant heterogeneity between IV-outcome effects, a non-significant MR-Egger intercept (which screens for unbalanced pleiotropy), and not be driven by a single IV. Four additional MR methods with differing assumptions were then applied in this study (Online Methods). From the trait pairings that passed the above heterogeneity and pleiotropy filters, Tier #1 traits were those for which all five methods were statistically significant, whilst Tier #2 traits had 4/5 significant methods, and Tier #3 3/5 methods significant. There were no Tier #1 retinol/outcome trait pairings, but several Tier #2 and Tier #3 trait pairings (Figure 2B-C), with all of them directionally consistent with the estimates from the single RBP4 IV, supporting their validity. Broadly, we found that retinol increased body fat related measures, resting-state fMRI connectivity of several network edges, as well as food consumption phenotypes related to carbohydrates. Retinol also exhibited evidence of a relationship with the cortical thickness and surface area of several brain regions, as well as microbiome composition and keratometry measurements. These results were dominated by continuous traits, for which we are better powered, however, there were some binary traits assigned Tier #2 or Tier #3 evidence. Specifically, retinol was associated with decreased odds of tuberculosis sequalae and coxarthrosis (arthrosis of the hip), whilst it was associated with increased odds of non-specific skin eruptions, adverse asthma/COPD medication effects, and dental problems. We caution that all these inferred associations require further investigation, and should be treated with requisite caution, as reviewed elsewhere19, 44, 45. A consideration of this approach that leverages multiplicative random effects with relatively few IVs (< 10), is the potential influence of residual standard errors below 1 on the estimation of the IVW standard errors. We see some evidence of this impact on the IVW-MRE estimates for Tier #2 and Tier #3 traits – as these traits exhibit no significant heterogeneity between IV estimates. Specifically, whilst all fixed effect IVW results are still highly statistically significant for these traits, they have larger standard errors than the IVW-MRE, indicative of residual standard errors < 1. This is a function of the MRE not scaling the standard error of the IVW by the model’s residual standard error like in the fixed effects model. We discuss this further in the supplementary text and in supplementary figure 5. However, these issues only impact the P-value of the IVW-MRE relative to that of the fixed effects, and Tier #2/Tier #3 traits still exhibit non-zero evidence across multiple methods and no indication of a single IV driving the association. Moreover, using instead a fixed effects IVW estimator as the primary test for trait pairings with no heterogeneity (Cochran’s Q P < 0.05) yields similar outcomes being prioritised as most statistically significant after FDR correction (Supplementary Text). It also important to consider when interpreting these estimates that MR approaches are only interpretable under the assumptions they make, and as a result, any potential causal relationship reported here requires further validation, ideally using a randomised control trial design.
We hypothesised that the putative effect of retinol on body fat could be one explanation for its relationship in this study to brain phenotypes beyond a direct effect of retinoid signalling, particularly given that obesity and adiposity have been linked with MRI related indices46. However, using IVs for body fat percentage, we did not find any strong evidence that it is causally related to any of the retinol-associated brain regions (Supplementary Table 21), suggesting a direct effect of retinol on these regions/networks or a relationship induced through some other unobserved confounder. Reverse causality for these Tier #2 and Tier #3 exposures was then also considered, although for binary traits this should be treated purely as a test of the null hypothesis given the difficulties in using binary traits as IVs47. There was no strong evidence for reverse causality of any of these traits (Supplementary Table 22). One exception to this was in relation to expression of the protein PEAR1, for which there was very nominal evidence of bidirectional effects.
A limitation of the above phenome-wide analyses is that the multiple-testing burden that arises from the inclusion of over 17,000 traits may obscure retinol effects on binary disease phenotypes, as these are usually less powered than GWAS of continuous traits. To overcome this, we also applied the above pipeline using all retinol IVs to 1141 binary endpoints with at least 1000 cases in FinnGen release 8 (not featured in IEUGWASdb), allowing a phenome-wide analysis of electronic health record derived binary outcomes. There were eight disease phenotypes that retinol was associated with after multiple testing correction (Figure 3A, FDR < 0.01), which increased to 19 with an exploratory FDR < 0.1 threshold (Supplementary Table 23). After applying the above pipeline to these results that considers heterogeneity, pleiotropy, and consistency across MR methods, there were four disease endpoints with Tier #3 evidence (Figure 3B, Supplementary Table 24). Specifically, retinol was estimated to increase the odds of congenital malformations of the heart and great arteries, whilst it was protective for type 2 diabetes with coma and inflammatory liver disease. As these traits had tier #3 evidence, there was some inconsistency in the strength of the results across different MR methods, and therefore, these relationships should be interpreted cautiously. One of the most active areas of research in retinol epidemiology is the relationship between retinol and cancer risk48. The estimated effect of circulating retinol on the odds of any malignant neoplasm was not significantly different than one - OR = 0.97 [95% CI: 0.91, 1.04], P = 0.423 (Supplementary Figure 6). However, there was some indication of a protective effect of retinol on squamous non-small cell lung cancer, which approached the threshold for statistical significance after FDR correction - OR = 0.64 [95% CI: 0.51, 0.80], P = 8.19 × 10-5, q = 0.01. There was also some data to support retinol having effects on other respiratory neoplasm endpoints (Supplementary Figure 6). Given previous observational evidence that retinol is protective for lung cancer48, as well as some data supporting the use of synthetic retinoids like bexarotene in lung neoplasms49, this relationship warrants further exploration.
We also investigated evidence for bidirectional effects involving these Tier #3 traits that retinol is genetically predicted to influence. As described above, using binary traits as exposures in MR should be treated with suitable caution and are often underpowered. The use of either genome-wide significant or suggestively significant SNPs (P < 1 × 10-5) as IVs did not indicate evidence of reverse causality of these diseases to retinol (Supplementary Table 25). However, given some of the statistical limitations of these analyses, such effects warrant further consideration.
Genetic evidence that lipids and kidney function influence circulating retinol
It is also clinically valuable to understand exposures and diseases that impact circulating retinol abundance. To explore this in greater detail, we leveraged retinol as an outcome trait in MR analyses. We utilised a diverse range of thousands of continuous and ordinal phenotypes from IEUGWASdb as exposures in a similar pipeline described above (Online Methods). Several lipid species were demonstrated to putatively influence retinol abundance after multiple-testing correction (FDR < 0.01, Figure 4A, Supplementary Table 26); for example, triglycerides were implicated to increase circulating retinol whilst cholesteryl ester related traits decreased circulating retinol. We also saw a positive effect of the frequency of solarium and sun lamp use on retinol, which may arise from behavioural related mechanisms. Furthermore, our findings suggest that increased retinol levels are associated with a susceptibility-weighted MRI measure in the left putamen, called T2*. T2*, reflecting magnetic susceptibility relative to tissue water, can be influenced by factors such as iron and calcium content. This association may also be influenced by behavioural or other pathways that warrant further investigation.
After applying the same tiering system used for retinol as an exposure, we observed Tier #1 evidence strongly supporting a causal effect of creatinine on circulating retinol. This relationship is biologically plausible and likely represents an association with kidney function50, 51. Due to the biological complexity of lipid traits, we observed significant heterogeneity between IV effects, and as a result, they were not assigned a tier in our analysis. However, the effect of triglycerides on increasing circulating retinol levels is consistent with established knowledge of retinol biology, as well as this study implicating two genes that are mechanistically confirmed to impact triglycerides (GCKR and MLXIPL). We then leveraged the CAUSE model to distinguish causal effects of creatinine on circulating retinol from correlated pleiotropy that may arise between these two traits due to the extensive polygenicity of creatinine52. We found that a model that includes a causal effect of creatinine on retinol was more parsimonious than a model of pleiotropy alone (‘sharing model’) through comparison of these models using the Bayesian expected log pointwise posterior density (ELPD) method (Supplementary Figure 7, ΔELPDSharing vs Causal = -4.34, P = 8.9 × 10-3). Given that IVs for creatinine could plausibly act through lipid species like triglycerides to influence circulating retinol, we then constructed multivariable MR (MVMR) models that estimated the creatinine to retinol relationship conditioned on high density lipoprotein (HDL), low density lipoprotein (LDL), and triglycerides (Online Methods). While there was some evidence that the effect of creatinine on retinol could arise due to triglycerides, there was also evidence to suggest an independent effect of both triglycerides and creatinine on increasing circulating retinol, depending on the modelling parameters used (Figure 4B, Supplementary Text). In addition to investigating the causal effects of exposures on retinol, we also examined the possibility of bidirectional effects. We found very weak evidence that retinol has a negative effect on creatinine (P = 0.023), but there was no significant evidence to suggest bidirectional relationships between retinol and the other implicated exposures.
Finally, we explored pharmacological agents and molecular perturbagens that may influence circulating retinol (Online Methods). Considering the novel genes prioritised in this study with an assigned direction of expression (TWAS/PWAS, pQTL MR), it was found that GSK3B is a drug-target known to be inhibited by lithium and related compounds. This may be of clinical interest as it suggests that lithium, utilised as a therapy in mood disorders, may decrease circulating retinol via its inhibition of GSK3B given that genetically predicted expression of this gene was positively associated with retinol. We then employed computational signature mapping to further characterise pharmacological agents related to retinol (Online Methods). However, these analyses did not yield any compounds for which the in vitro transcriptomic signature significantly matched or opposed genetically predicted expression associated with retinol after multiple-testing correction (Supplementary Table 27). We then considered perturbagen signatures aggregated to biological pathways or overall mechanisms of action (MOA) groups of compounds (Online Methods). After multiple-testing correction (FDR < 0.05), there were 13 gene-set based perturbagen signatures that were significantly similar to the directionality of genetically predicted expression associated with retinol (Supplementary Table 28). For example, the expression signature of compounds in the HDAC inhibitor MOA opposed expression genetically predicted to increase serum retinol. This can be interpreted as while no single HDAC inhibitor was significantly associated with retinol, there was at least some evidence for the overall relationship with this MOA. This accords with the suggested effect of HDAC inhibitors like valproic acid on downregulating expression of RBP453–55, a gene not included in the signature mapping analyses given the large effect of its encoded protein on retinol.
Genetically proxied retinol can identify individuals outside of the normative range of circulating retinol for a given age
We were also interested in evaluating the performance of a genetically proxied index of circulating retinol, that is, a circulating retinol polygenic score (PGS). The independent TwinsUK cohort was utilised to tune and evaluate retinol PGS (Online Methods). We used several methods to evaluate the performance of a retinol PGS in this cohort. Firstly, we test different retinol PGS configurations in a random selected training subset of the model (70% of cohort), using a linear mixed model to account for relatedness between the twin pairs. The best performing retinol PGS configuration in the training subset explained approximately 2.12% of the phenotypic variance of retinol when applied to the remaining 30% of the cohort (mean variance explained across three retinol measurement timepoints). We also found similar performance when applied to each subset of twin pairs (Supplementary Table 29). A limitation of this approach for using the same cohort for tuning and testing the retinol PGS is that the estimated effect sizes may not be representative. As a result, we employed an approach to tune weights for the PGS using the summary statistics alone. This was achieved through leveraging the principles of probabilistic finemapping to update variant weights by their posterior probability of association (Online Methods). Due to the modest polygenicity of retinol, this method upweights a small number of variants. Despite this, these scores were still significantly associated with circulating retinol in TwinsUK (Supplementary Table 29).
Like most micronutrients, circulating retinol has been shown previously to have a complex relationship with age56. However, population-level approaches investigating these effects do not account for inter-individual variability. We hypothesised that normative modelling could be used to characterise individual patterns of circulating retinol, and to evaluate the contribution of genetics to these individualised effects. Normative modelling, derived from the application of growth charts in paediatric medicine, aims to estimate normative reference ranges of variation in the population (e.g., of circulating retinol) based on age and/or other relevant variables. Here, we established reference ranges for circulating retinol as a function of age using generalised-additive models for location, scale, and shape (GAMLSS) frameworks in the TwinsUK dataset (Online Methods, Supplementary Text). These models were constructed in the full cohort, as well as in one twin subset only for comparison. Briefly, this involved identifying the optimal distribution for the GAMLSS model using all samples, followed by splitting the data into two partitions. One partition was utilised to estimate centiles (5th, 25th, 50th, 75th and 95th), and the left-out subjects were benchmarked against this reference chart to determine the position of their individual retinol measurement (Online Methods). This process was then repeated using the opposite subject allocation. An individual’s retinol level was classified as infra-normal if their measured retinol fell below the 5th percentile for their age, and as supra-normal if their retinol value exceeded the 95th percentile for their age.
We then investigated the extent to which retinol PGS was associated with these individual profiles of circulating retinol with respect to age (Supplementary Table 30). By way of example, we report results forthwith from the more conservative modelling approach which only used one half of the twins. Retinol PGS was at least nominally significantly associated with supra and infra-normal deviations except for supra-normal deviations at the first (youngest visit) for which there was only a trend observed (Supplementary Figures 8-9). For example, at the second visit each SD in retinol PGS was associated with an approximately 70% [95% CI: 23%, 136%] increase in the odds of exhibiting supra-normal retinol levels for a given age relative to all remaining participants, whilst conversely reducing the odds of displaying infra-normal levels by approximately 37% [95% CI: 12%, 55%]. These data suggest that genetics is a non-zero contributor to circulating retinol levels that fall outside the normative range for a given age. In future, a normative modelling approach could be utilized to examine additional factors, including dietary intake, as well as the interplay between genetics and other influences, that might contribute to individualized deviations in retinol levels relative to the population benchmarks.
DISCUSSION
We conducted the largest GWAS of circulating retinol to-date, revealing important novel insights into genetic influences on this trait. The sample sizes in this study facilitated the first published estimate of SNP heritability for circulating retinol, plausibly between 5-10%. However, the large standard errors accompanying these estimates reinforces that greater sample sizes are still needed. Moreover, we were able to uncover confident genetic signals associated with retinol at genome-wide significance outside of the RBP4:TTR transport complex. The gene prioritisation pipeline applied both within and beyond genome-wide significant loci prioritised eight genes with high-confidence for a role in retinol biology. These genes were highly expressed in the liver and overrepresented amongst biological pathways including carbohydrate metabolism. The liver is known to be the key organ responsible for retinol storage and processing1, which is represented strongly by the genetic data in this study. Further, the prioritised genes assigned as overrepresented in the regulation of carbohydrate metabolic process pathway (GCKR, GSK3B, and MLXIPL1) are all broadly known to be related to hepatic energy metabolism. As lipids are directly mechanistically linked to retinol absorption, storage, and delivery1, 3, it is plausible that the varied metabolic roles of these genes converge on changes in the abundance of different lipid species. The role of lipids in circulating retinol abundance is also highlighted by our Mendelian randomisation analyses. However, it is still likely that glycaemic homeostasis may impact circulating retinol via mechanisms not directly linked to lipid biology; for example, expression of the insulin-controlled glucose transporter GLUT4 is postulated to be related to RBP4 protein levels57. In summary, our results suggest that the most identifiable common variant influences on circulating retinol are either mediated through direct effects on transport or metabolic factors, particularly related to lipids. We also prioritised genes like FOXP2 for which a mechanistic relationship to retinol is less inherently clear. Our analyses of transcriptomic correlates of FOXP2 supported the immense biological pleiotropy associated with this transcription factor, reinforcing its significance outside of its traditionally conceived association in the literature with neurological phenotypes like language. Work is now needed to disentangle the mechanisms which specifically underlie this relationship between FOXP2 and circulating retinol that were infer from these genetic findings.
Our study also represents a significant advancement as it is the first to perform a high-throughput, hypothesis free, analysis investigating the potential causal effects of retinol across a wide range of human clinical phenotypes using Mendelian randomisation. This work recapitulated known influences of retinol on ophthalmological measures58, the innate and adaptive immune response59, and congenital heart malformations60. However, we also uncovered some more novel relationships that may be of direct clinical relevance. We highlight forthwith the example of circulating retinol being genetically predicted to impact the thickness and surface area of several brain regions, as well as indices of brain connectivity. Retinoic acid, a downstream metabolite of retinol, is considered one of the most intrinsic central nervous system signalling molecules, particularly as it exerts control over processes like neuronal differentiation in utero and adult neurogenesis, as reviewed elsewhere2. It is, therefore, logical that retinol would plausibly influence brain structure and connectivity throughout the lifespan. However, the regions implicated in this study require further examination with respect to their clinical significance. By way of example, we associated increased circulating retinol with a reduction in thickness in the right rostral anterior cingulate cortex. This cortical region has been identified by a large international mega-analysis from the ENIGMA consortium to exhibit increased thickness in individuals with the neuropsychiatric disorder schizophrenia compared to controls61, suggesting a potential protective effect of retinol in this region with respect to schizophrenia. This accords with previous evidence linking retinoids to schizophrenia2, 62. It is known clinically that both retinol deficiency and toxicity can have harmful neurological effects, highlighting the complexity of the relationship of retinol to the brain throughout the human lifespan. This complexity is also seen with synthetic retinoids. For instance, isotretinoin (13-cis retinoic acid), indicated for conditions like acne, has been shown to have opposing effects on adult neurogenesis relative to all-trans retinoic acid and putatively increases the risk of suicide63, 64, although evidence for this association is mixed65. Conversely, another synthetic retinoid, bexarotene, with different receptor affinities to isotretinoin, has demonstrated some promise as a potential adjuvant to antipsychotics in schizophrenia66. Future work should attempt to understand these relationships with greater fidelity by investigating genetic influences on other retinoids beyond retinol, as well as how tissue-specific abundance can differ from what circulates in serum67. Emerging methods for non-linear Mendelian randomisation would also be useful in this context given that retinol often exerts dose dependent effects68. The causal estimates generated in this study also need to be treated with appropriate caution due to the limitations of MR and require further validation in study designs that can enable causal inference, such as randomised control trials. As reviewed previously18, 19, 69, MR tests are only unbiased when their assumptions are plausibly satisfied, which is why we implement a suite of different methods and sensitivity analyses with quite distinct underlying assumptions in this study. Genetic estimates on circulating retinol used as IVs could also be confounded by factors including uncontrolled population stratification, selection bias, and measurement error. In summary, we provide a large resource to the literature of putative effects of circulating retinol across the human clinical phenome that will be informative for future investigation of this trait.
Finally, we make some recommendations for future GWAS of retinoid molecules. An important limitation of our analyses is that we only investigated genetic effects on circulating retinol, rather than retinol availability within target tissues. Given the complexities of retinol homeostasis, it is plausible that genetic effects on factors like retinol entry into the cell (e.g., STRA6 receptor) and esterification for storage (e.g., lecithin retinol acyltransferase) are obscured when considering only retinol present in serum or plasma. Therefore, future studies should attempt to measure retinol abundance in different tissues from genotyped samples, although this will pose a challenge in terms of obtaining sufficient sample sizes. Moreover, it would be of interest to characterise the genetic overlap between effects on retinol versus other retinoids like retinaldehyde and all-trans retinoic acid. Despite these limitations, and the need for concerted efforts to collect more data, we believe this study demonstrates the value in conducting retinol GWAS to both better characterise retinoid associated biology and its clinical significance.
ONLINE METHODS
Study cohorts
The proceeding section outlines the datasets included in the genome-wide meta-analysis of circulating retinol, as well as the replication cohort.
INTERVAL
The largest constituent cohort of the meta-analysis was drawn from the INTERVAL study, comprised of recruited blood donors from the United Kingdom70. Retinol abundance was measured from plasma using the high-throughput metabolomics platform DiscoveryHD4® (Metabolon, Inc., Durham, USA), as outlined in the supplementary text. Briefly, after adjustment for various technical/biological confounders and outlier effects, residualised plasma retinol was inverse-rank normal transformed before association testing. Whole-genome sequencing of this cohort was performed as described elsewhere71. A GWAS of the normalised residuals was performed in HAIL via multiple-linear regression adjusted for INTERVAL metabolon batch and 10 genetic PCs72. The final GWAS sample size was 11,132 European ancestry participants.
METSIM
Plasma retinol was also measured using the DiscoveryHD4® high-throughput platform in a recent metabolome-wide GWAS of the METSIM (Metabolic Syndrome in Men) study73. The METSIM cohort consists of middle-aged men recruited from Northern Finland between 2005-201074. As described by Yin et al.73, METSIM participants were genotyped using the Human OmniExpress-12v1_C BeadChip and imputed using a custom METSIM panel of whole genome-sequenced participants in the study. A linear mixed model implemented in EPACTS v.3.2.6 was then leveraged to perform GWAS on residualised retinol, subjected to inverse-rank transformation after adjustment for technical and biological confounders. The final GWAS had a sample size of 6136 METSIM participants (European ancestry).
ATBC+PLCO
The largest previous dedicated GWAS of circulating retinol from 2011 was also included in this study22. This GWAS comprised data from two studies that measured serum retinol: the Alpha-Tocopherol, Beta-Carotene Cancer Prevention (ATBC) Study, a Finnish randomised control trial of beta-carotene/alpha-tocopherol supplementation for cancer prevention75, and the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial, a United States trial of cancer screening effectiveness76. The inclusion criteria, measurement of serum retinol, and genotyping have been outlined by Mondul et al.22. Briefly, ATBC/PLCO samples were genotyped using the Illumina HumanHap550/610 arrays and imputed to the HapMap Central European reference panel, whilst serum retinol concentrations were estimated using reversed-phase liquid chromatography. The retinol GWAS (N=5006) was performed in R (version 2.10.1) using multiple linear regression adjusted for age at sample collection, SNP derived PCs, cancer status, serum cholesterol, and body mass index. The retinol units for the GWAS effect sizes were in natural log transformed μg/L. A limitation of using summary statistics from this study is that only < 600,000 variants were available for inclusion in the GWAS, as was common at the time before more recent advances in imputation pipelines that result in larger post-imputation yield. Therefore, to increase the number of variants available for meta-analysis, we applied a summary statistics-based imputation procedure to boost the number of variants available for meta-analysis. After harmonisation with the 1000 genomes phase 3 reference panel, we applied Gaussian summary statistics imputation (ImpG) as implemented by the FIZI v0.7.2 python package (https://github.com/bogdanlab/fizi) with the default window size of 250 kb77. The ImpG model leverages the assumed Gaussian distribution of GWAS Z scores with mean zero and variance that arises due the LD-induced correlation between variants. As outlined elsewhere77, Z scores of unobserved (imputed) variants can be estimated given the LD correlation matrix derived from the reference panel, along with a metric of imputation accuracy (R2) using the conditional variance. We retained only confidently imputed variants (R2 > 0.8).
TwinsUK
We performed a serum retinol GWAS in the TwinsUK cohort to serve as a replication dataset. High-throughput metabolomics profiling in this cohort has been described extensively elsewhere78. TwinsUK is a prospective population-based study of mostly female twin pairs which has been profiled using a variety of multiomic technologies79. As outlined by Shin et al., genotyping was performed with a combination of Illumina arrays (HumanHap300, HumanHap610Q, 1M-Duo and 1.2MDuo 1M)80. This was followed by imputation to the 1000 genomes phase reference panel after quality control and retaining individuals of predominantly European ancestry after PCA. We retained physically genotyped variants and those with moderate imputation accuracy for GWAS (R2 > 0.3) in 5654 samples. Our strategy for GWAS in this cohort given the limited sample size with measured retinol available was to split the twin pairs into two separate cohorts and ensure individuals in each sub-cohort were unrelated. Relatedness testing in both sub-cohorts containing one of the two possible twin pairs was performed separately using KING as implemented by plink2 (PLINK v2.00a3LM AVX2 Intel), with one participant from third-degree relative or greater pairs randomly removed81, 82. Kinship estimation via KING was performed for autosomal variants physically genotyped on the array, with a MAF > 0.05, outside of regions of long-range LD like the MHC83, and in relative linkage equilibrium (r2 < 0.05). PCA was then applied in each sub-cohort using plink2 to calculate eigenvectors for use as downstream covariates. Retinol was measured from samples at three timepoints. The mean age of participants at each timepoint was 51.5 (SD = 8.41), 58.6 (SD = 8.38), and 64.7 (SD = 8.41), respectively. There were only a very small number of males in this cohort (∼ 3%), so only females were retained for further analysis (N = 1696) due to this imbalance in sex composition. After merging with the genotyped split twin cohorts, as described above, there were up to 916 and 717 genotyped participants with measured retinol at three timepoints in each subset, respectively. Six GWAS were performed: in each sub-cohort (one or two), multiple linear regression was utilised to test the additive effect of each variant on measured retinol at one of the three measured time-points covaried for age, five SNP derived PCs, and metabolomics batch. These GWAS was performed using the --glm flag in plink2, resulting a 9,051,192 by three matrix of estimated retinol effect sizes for both sub-cohorts.
Genome-wide meta-analyses
We conducted genome-wide meta-analysis of common variants (MAF ≥ 0.01) using METAL (version March 2011), followed by a rare variant (MAF < 0.01) meta-analysis also with METAL. The METSIM and INTERVAL cohorts were integrated for the primary meta-analysis as they both had expansive genome-wide coverage of common and rare variants. A sample size weighted meta-analysis of Z scores (Stouffer’s method) was utilised for this purpose. We also conducted the METSIM+INTERVAL meta-analysis via an inverse-variance weighted estimator with fixed effects to estimate the effect sizes of effect alleles in SD units of plasma retinol given this was the unit of both the METSIM and INTERVAL GWAS, as well as both studies using the same Metabolon Inc. platform for metabolite quantification. Heterogeneity between the studies was assessed using Cochran’s Q test. We then conducted a meta-analysis with fewer available variants that also included ATBC+PLCO using Stouffer’s method (as the unit for this GWAS differed from METSIM and INTERVAL). This larger sample-size meta-analysis was also restricted to common variants as there was very limited rare variant coverage in ATBC+PLCO.
The FUMA v1.4.1 (Functional Mapping and Annotation of Genome-Wide Association Studies) platform was utilised to annotate variants, define lead SNPs, and infer loci boundaries for genome-wide significant signals (P < 5 × 10-8)84. We utilised the default settings for defining independent significant SNPs (r2 ≤ 0.6) and lead SNPs (r2 ≤ 0.1). LD estimation was achieved using the 1000 genomes phase 3 European reference panel, with LD blocks within 250 kb of each other merged into a single locus. We then attempted to replicate the lead SNPs from the eight genome-wide significant loci (METSIM+INTERVAL) in TwinsUK, as described in the previous section. Given the small sample size of TwinsUK, we sought to ascertain if the lead SNPs were directionally consistent with the meta-analysis. This was achieved by taking the mean SNP Z score across the three timepoints for the two sub-cohorts of unrelated participants, with a binomial test utilised to infer whether the number of lead SNPs (mean Z) that were directionally consistent was greater than chance alone (Binomial P < 0.05).
In the rare-variant meta-analysis, we annotated variants using the Functional Annotation of Variants (FAVOR) online resource85. Phenome-wide association profiles of selected variants were also investigated using the pheweb browser collated from FinnGen release 8 (https://r8.finngen.fi/)86. Rare variants were then aggregated to genes through leveraging the characteristics of the Cauchy distribution87, 88. In this approach, gene-wise P values are summed and then transformed to approximate a Cauchy distribution, which due to its heavy tail is insensitive to correlations amongst the P values. This behaviour of the Cauchy distribution is important as covariance amongst rare variants is difficult to estimate, and therefore, this approach guards against inflated type I error due to potential unknown covariance/LD between rare variants. Code for implementing the Cauchy aggregation was adapted from https://github.com/yaowuliu/ACAT.
SNP heritability estimation
Summary statistics for the meta-analyses were ‘munged’ using the munge_sumstats.py script from the ldsc repository of scripts (https://github.com/bulik/ldsc) and only common (MAF > 0.05) HapMap3 variants outside of the MHC retained. SNP heritability was then estimated using the LDSR model and the 1000 genomes phase 3 reference panel89. By way of comparison, we then estimated SNP heritability using the LDAK model via the SumHer implemented in LDAK v.5.2 (https://dougspeed.com/)90. Pre-computed tagging files derived from 2000 white British individuals in the UKBB for HapMap3 SNPs were utilised to calculate SNP heritability using the LDAK-thin and BLD-LDAK models, as described elsewhere90, 91. We also estimated partitioned SNP heritability via LDSR using a multi-tissue and cell-type panel92.
Empirical Bayes’ modelling of the genetic architecture of retinol
We investigated the polygenicity of the genetic architecture of circulating retinol using an Empirical Bayes’ adaptive shrinkage method termed ashR93. Functions to perform this method were implemented via the ashR R package v2.2-54 (https://github.com/stephens999/ashr). Briefly, this approach models effect sizes, along with their standard error, as a mixture of zero and non-zero effects. Empirical Bayes’ inference is performed under the assumption that the distribution of these variant effects is unimodal, which is a realistic assumption for genetic effects on complex traits. The METSIM+INTERVAL IVW meta-analysis was utilised for this as it provides an interpretable effect size and standard error for each variant (plasma SD units). In line with previous work94, we annotated each HapMap3 variant with its corresponding LD score from the 1000 genomes phase 3 European reference panel, as well LD scores from the UKBB White Great British samples for comparison, and sorted these into bins of similar LD scores (NBins=1000 and NBins=5000). The ashR Empirical Bayes’ inference of the proportion of non-zero effects was undertaken in each LD score bin, followed by calculating the mean across all bins. We utilised a generalised additive model to plot a smoothed trend line of the relationship between increasing LD score bin (higher LD score) and the proportion of non-zero effects.
Gene prioritisation
Gene prioritisation was performed within genome-wide significant loci, as well as outside of loci that obtained genome-wide significance. Due to the better coverage of common variants, we focused on the METSIM+INTERVAL meta-analysis for this analysis. The pipeline for prioritising putative causal genes within the eight genome-wide significant loci was adapted from a previous GWAS performed by our group95. The following criteria were utilised in this study: 1) closest transcription start site (TSS) to the lead SNP, 2) closest gene (any) to the lead SNP, 3) gene encoding retinoid transporter or enzyme in locus, 4) non-synonymous variant in locus, 5) the most statistically significant GTEx eGene [expression quantitative trait loci (eQTL) signal] in locus, 6) most significant GTEx eGene in finemapped credible set for eQTL signal [posterior inclusion probability (PIP) > 0.1, DAP-G method]96, 97, 7) strongest plasma pGene [protein quantitative trait loci (pQTL) signal] drawn from finemapped pQTLs (PIP > 0.5)98, 8) highest CADD score99, 9) lowest RegulomeDB score100, 10) the OpenTargets V2G predicted gene for the lead SNP101, and 11) genes physically mapped to SNPs in the 95% credible set derived from probabilistic finemapping (assuming a single causal variant such that LD did not have to be modelled)102. We used a prior variance of 0.15 to approximate Bayes’ factors from variant-wise effect sizes (METSIM+INTERVAL IVW meta-analysis), in line with previous work finemapping association signals with quantitative traits103. We characterised which genes satisfied the greatest number of the above criteria on a per locus basis.
FOXP2 was one of our confidently prioritised genes but its biological significance outside of the brain is less well understood. To investigate this, we analysed RNA sequencing data from an in vitro experiment that overexpressed FOXP2 in a human osteosarcoma epithelial cell line (U2OS) via transfection of wild-type FOXP2 expressing plasmids. Raw read counts from five control cell line replicates versus five plasmid transfected FOXP2 overexpression replicates were downloaded from the Gene Expression Omnibus (GEO) resource (GEO Accession: GSE138938). Data normalisation, filtration, and differential expression analyses were performed using the edgeR package version 3.34.0104. Specifically, raw counts were firstly normalised to library size and lowly expressed genes with fewer than 10 raw counts in the smallest library removed via a counts-per-million thresholding approach. Data were inspected before and after the filtration step via coefficient of variation (BCV) and multidimensional scaling (MDS) plots (Supplementary Figure 10). Differential expression for each gene that survived quality control was then performed using exact tests for differences in the means between two groups of negative-binomially distributed counts. We defined a differentially expressed gene as those which survived multiple-testing correction using the Bonferroni method (PCorrected < 0.05), with three different absolute log2 fold change (FC) cut-offs considered: |log2FC| > 1.5, |log2FC| > 2, and |log2FC| > 5. The use of Bonferroni correction and large absolute FC thresholds is very conservative; however, given the volume of differentially expressed genes, and a relatively large number of replicates for a cell line experiment boosting power, we believe these strict parameters are warranted to priortise the most salient FOXP2 associated signals. The overrepresentation of each set of candidate genes amongst biological pathways and other ontology sets was tested using g:Profiler105.
To identify potential causal genes that have not reached genome-wide significance at our current sample size, we integrated the circulating retinol GWAS with genetic effects on mRNA and protein expression. Firstly, we conducted a transcriptome and proteome-wide association study (TWAS/PWAS) of circulating retinol using the FUSION approach106. As outlined previously107–110, FUSION leverages models of cis-acting genetically regulated expression (GReX) that exhibit statistically significant non-zero heritability. Variant weights from GReX models are integrated with the effect of those same variants on retinol to estimate the direction of genetically regulated expression associated with increasing circulating retinol. TWAS GReX (mRNA) were estimated previously using GTEx v8 (http://gusevlab.org/projects/fusion/). We selected the following biologically informative tissues to perform TWAS based on known retinol biology or tissues that exhibited at least nominally significant (P < 0.01) enrichment of SNP heritability in the partitioned-LDSR model. The selected tissues were: small intestine terminal ileum, pancreas, liver, adipose (visceral omentum), adipose (subcutaneous), breast (mammary tissue), and whole blood. It has been suggested previously that TWAS signals from tissues that are less directly trait relevant can induce spurious associations, which is why we limited our hypothesis space to these tissues111. Protein GReX were derived from plasma (ARIC study), as outlined elsewhere98. We applied Benjamini-Hochberg false discovery rate (FDR) correction across all TWAS Z, followed by all PWAS Z. Colocalisation between GReX models and retinol was performed for all TWAS/PWAS signals that were at least nominally significant via the coloc package as implemented by FUSION (single shared variant hypothesis)103. We then considered a more conservative approach to priortise proteins for whom expression could be causally linked to circulating retinol through leveraging finemapped plasma pQTLs (PIP > 0.5, ARIC) as instrumental variables (IV) for Mendelian randomisation19, 44, 98. The Wald ratio method was implemented for proteins with single IVs, whilst an inverse-variance weighted estimator with fixed effects was utilised for proteins with more than 1 IV. Fixed effects were used for the IVW rather than multiplicative-random effects in this instance as no protein had > 4 IVs. The use of IV based approaches for identifying trait-associated genes versus GReX has been discussed extensively elsewhere109. Colocalisation was also performed for proteins that survived FDR correction. Mendelian randomisation was performed using the TwoSampleMR package v0.5.6, with IVs clumped through leveraging LD from the 1000 genomes phase 3 European reference panel to retain only independent pQTLs (r2 < 0.001).
Finally, we investigated the tissue specificity of prioritised genes from GWAS loci and the TWAS/PWAS/MR approach using FUMA. To do this, we compared the expression of these prioritised genes using a t-test (one-sided and two-sided) against all other available genes in GTEx v8 on a per tissue basis (54 tissues), followed by applying Bonferroni correction to these P-values84. We also conducted pathway analyses of these genes using g:Profiler with default parameters105.
Causal inference
We developed and implemented a comprehensive pipeline to leverage this retinol GWAS to identify putative causal effects of retinol on traits across the human clinical phenome, as well as traits that causally influence retinol in the reverse direction. This was achieved using Mendelian randomisation (MR), which has been reviewed extensively elsewhere19, 45, 69. Firstly, we considered circulating retinol as the MR exposure. The lead SNP in RBP4 was first chosen as a single IV to proxy circulating retinol as out of all the genes implicated in genome-wide significant loci, RBP4 has exhibits the most specificity in terms of its relationship with serum retinol. We utilised the Wald ratio method to estimate the effect of the circulating retinol increasing rs10882283-A allele (METSIM+INTERVAL IVW effect size) on over 19,000 outcomes in the IEUGWASdb v6.9.2 resource via the ieugwasr package version 0.1.526. To correct for multiple testing, we applied the false discovery rate (FDR) method, and we retained only those retinol/outcome pairs with estimates that were significant below the 1% FDR threshold (q < 0.01). Subsequently, we investigated whether significant trait pairs colocalised using coloc v5.1.0 with default priors.
Although the single IV approach is more conservative, power to detect causal effects can be boosted by using all available independent (r2 < 0.001, 1000 genomes phase 3 European reference panel) genome-wide significant SNPs as IVs. The inverse-variance weighted estimator with multiplicative random effects (IVW-MRE) was utilised to estimate the effect of retinol on outcomes from IEUGWASdb for which at least 6 IVs were available in that GWAS112. The IVW approach has a zero percent breakdown level as it assumes all IVs are valid for use in Mendelian randomisation, which is often unrealistic in practice. We developed pipeline to prioritise the most reliable causal estimates from the IVW-MRE that survived multiple-testing correction (q < 0.01). This involved identifying retinol MR estimates on traits for which the following applied: i) no significant heterogeneity (P < 0.05) between IV exposure-outcome effects tested with Cochran’s Q113, ii) a non-significant intercept of an MR-Egger model that does not constrain the intercept to pass through the origin114, and iii) no evidence that leaving out any single IV ablates the statistical significance (at least P < 0.05) of the estimate115. Traits satisfying these criteria were then assigned a tier (Tier #1, Tier #2, Tier #3) based on the statistical significance of the retinol causal estimate using other MR methods besides the IVW-MRE that have different assumptions regarding IV validity. These were: the IVW with fixed effects (does not model heterogeneity like with MRE), the weighted median method116, the weighted mode method117, and the MR-Egger method114. The underlying assumptions and methodological considerations of using these methods have discussed extensively elsewhere118, 119. Traits for which the effect of retinol was at least nominally statistically significant using all five methods were assigned as Tier #1, whilst four tests being statistically significant was Tier #2, and three tests being statistically significant was Tier #3. The F-statistic and I2 of the IVs was also assessed to ensure they were well powered (F > 10) and suited for MR-Egger (I2 > 0.9), respectively120. To investigate the effect of body fat percentage on the retinol associated MRI indices, we used IVs from a non-overlapping GWAS of that trait (N=65,831)121. We then repeated the above process of binary outcomes with at least 1000 cases from FinnGen release 8, which is not included in the current version of IEUGWASdb at time of analysis, to increase power to detect effects on disease endpoints.
We also systematically investigated continuous outcomes that may causally impact circulating retinol. Continuous outcomes were our focus as causal estimates from binary exposures are difficult to interpret and are often less powered in the context of MR47. Exposures were filtered from phenotypes available in IEUGWASdb to retain continuous traits, with further filtering to identify traits with ≥ 5 genome-wide significant, independent (r2 < 0.001, 1000 genomes phase 3 European reference panel) IVs available in the retinol GWAS. The same pipeline as above was then applied (IVW-MRE FDR < 0.01, followed by sensitivity analyses and tier assignment). The causal estimate of creatinine on retinol was a Tier #1 trait, and due to the pervasive polygenicity of creatinine afforded by its large sample size, we followed up this relationship using the MR model “Causal Analysis Using Summary Effect Estimates” (CAUSE), as described elsewhere using the CAUSE R package v1.2.052. Briefly, this method is a more polygenic approach and seeks to distinguish casual effects from correlated pleiotropy by fitting competing models that account for these terms and comparing them using ELPD.
After using 1 million random variants to estimate nuisance parameters, LD clumping and thresholding was applied to the serum creatinine summary statistics (IEUGWASdb trait ID: met-d-creatinine) in line with the original CAUSE publication (P < 0.001, r2 < 0.01, 1000 genomes phase 3 European reference panel). The competing CAUSE models were then fit (null, sharing, and causal), ensuring that all Pareto k estimates were < 0.5 during the model comparison using ELPD.
We then performed a multivariable MR (MVMR) analysis to estimate causal effects of creatinine on retinol conditioned on three major lipid species using GWAS from the global lipids genetics consortium (LDL, HDL, and triglycerides)122. MVMR was undertaken in accordance with previous work using the R packages MVMR v0.3 and MendelianRandomization v0.6.0118. Briefly, this entailed identifying variants associated at genome-wide significance with at least one of the four exposures that are independent (r2 < 0.001), calculating a conditional F-statistic for multivariable instruments123, and applying four MVMR models (IVW, Egger regression, Weighted Median, and a LASSO based penalised regression approach for selecting the optimal IV configuration)124.
Drugs and perturbagens associated with circulating retinol
We also considered drugs that may influence circulating retinol. We searched genes prioritised from our pipeline with an assigned direction of retinol-associated expression using DGIdb v4.2.0 and DrugBank v5 to identify retinol associated genes targeted by drugs125, 126, outside of known drugs that target RBP4 and TTR. Retained drug-gene interactions were restricted to those with known mechanism of action and > 2 lines of supporting evidence. We also utilised computational signature mapping to identify pharmacological agents that may enhance or inhibit the expression of genes associated with circulating retinol. To boost power for signature mapping, we considered all gene that were nominally associated with retinol in the TWAS/PWAS (P < 0.05) that exhibited moderate colocalisation of a shared causal variant (PPH4 > 0.4). These genes were uploaded to the Connectivity Map Query online tool to quantify the similarity, termed connectivity, with drug perturbagen associated expression profiles, as outlined in detail elsewhere127.
Polygenic scoring
We used the independent TwinsUK replication cohort, described in a preceding section, to investigate retinol polygenic scores (PGS). PGS were applied to genotyped variants and high confidence imputed variants (R2 > 0.8) in TwinsUK. There were two different methodologies implemented to construct PGS: LD clumping and thresholding (LD C+T) and a probabilistic finemapping based method that scales variant effect sizes based on their posterior probability of causality, thereby upweighting signals more likely to be causal (RápidoPGS)128, 129. In the LD C+T approach, variants were clumped using the within sample LD of TwinsUK at the following P-value thresholds: 5×10-8, 1×10-5, 1×10-3, 0.01, 0.05, 0.1, 0.5, and 1. Additive PGS were then profiled using PRSice2 v2.3.5130. The RápidoPGS applies probabilistic finemapping (with Bayes’ factors approximated using Wakefield’s method) to independent LD blocks genome-wide such that variant-wise posterior probabilities of causality can be estimated. This in essence is a ‘shrinkage’ approach to PGS to account for double-counting effects that arise due to correlated effect sizes induced by LD; however, does not inherently require an independent genotyped sample from the GWAS for tuning. Approximate Bayes’ factors in each LD block were derived assuming a prior variance of 0.15, conventionally used for quantitative traits. This parameter choice was compared to a data driven approach to estimating the prior variance based on SNP heritability, as outlined elsewhere129. Variant-wise effect sizes are multiplied by their posterior probabilities before PGS calculation, as above.
As we are using a twin cohort, there were two different approaches we utilised to tune the optimal PGS configuration from the LD C+T approach. Firstly, we randomly split the cohort into a training (70% of participants) and test (30% of participants) partition. A linear mixed model was then fit in the training partition with fixed effects of PGS, age, and metabolomics batch and a random effect of family ID to account for twin relatedness. This was applied for retinol measured at each of the three visits for all the P-value thresholds. The marginal R2 from a null model with no PGS was subtracted from the full model to infer the best performing PGS in the training partition (mean marginal R2 across three visits). The variance explained of the best performing P-value threshold was then estimated using the same approach in the test partition. By way of comparison, we also split the twins into separate unrelated cohorts and used one set as training and one as testing. Fixed effects instead of mixed effects linear regression was then implemented in a similar fashion to above, additionally covaried for five SNP derived principal components. We then repeated all the above for the two probabilistic finemapping weighted PGS (prior variance = 0.15 and data driven prior variance).
Normative modelling
We built a normative model of retinol as a function of age per study visit in TwinsUK. This was achieved using a generalised additive model for location (μ), scale (σ), and shape (GAMLSS)131, 132, implemented in R v4.4.1. The GAMLSS approach is useful in this application as it is semi-parametric and able to account for factors such as heteroskedasticity and non-Gaussian distributions. Our modelling approach can be summarised as follows: we firstly fit a GAMLSS model in the full sample for a variety of GAMLSS distribution families implemented by the gamlss R package v5.4.12. The model on retinol at visit i, i ∈ {1,2,3} for each of the GAMLSS families set the μ term as the first order fractional polynomial of age, along with metabolomics batch as an additional covariate, with the term σ just the first order fractional polynomial of age to model the scale of the distribution. The model fit of each of the tested families was then assessed using the Bayesian information criterion (BIC) and the Akaike information criterion (AIC). We repeated the above also including measured body mass index (BMI) at that timepoint in the μ term by way of comparison. We chose the GAMLSS family (Box-Cox t distribution) through considering the model performance (minimum AIC and BIC) over all three visits (Supplementary Text, Supplementary Figures 11-12). We then split the cohort in half, separating the twins, and fit normative centile curves using the selected GAMLSS family to one half of each batch of retinol measurement, and computed deviations on the other independent sample half on a per batch basis. We repeated this process with the modelling and deviation subsets reversed to compute deviations for all samples. Individuals whose measured retinol was above the model derived 95th percentile were classified as having supra-normal retinol for their age, while those below the 5th percentile were classified as having infra-normal retinol. To guard against overfitting due to the relatedness of twins between the two subsets, we then performed all of the above just using one half of the twins as the full cohort for model fitting, followed by splitting this subset as described above. We tested the relationship between scaled retinol PGS (mean = 0, SD = 1) and infra-normal retinol at each visit was tested in half of the cohort (unrelated) using binomial logistic regression additionally covaried for five SNP derived PCs. The same models were also constructed for supra-normal individuals.
Software and operating systems
The primary analyses in this manuscript were performed either on a MacBook Pro (OS X: Ventura 13.3), an in-house linux cluster (Ubuntu 18.04.5 LTS), or the High-Performance Computing Research Compute Grid of the University of Newcastle [Red Hat Enterprise Linux release 8.1 (Ootpa)]. The primary R version utilised was version 4.1.1 (2021-08-10), with some additional analyses using R version 4.0.3 (2020-10-10) (linux cluster). The Python version utilised was either Python 2.7.17 or Python 3.6.9, depending on the requirements of the analyses.
DATA AVAILABILITY
Genome-wide summary statistics will be uploaded to GWAS catalog upon final publication. In the interim, summary statistics can also be found at the following: 10.5281/zenodo.7905523. TwinsUK data can be accessed by approve researchers upon application (https://twinsuk.ac.uk/resources-for-researchers/our-data/).
CODE AVAILABILITY
Code used in this study is freely available at the following GitHub repository - https://github.com/Williamreay/Retinol_GWAS_code.
ETHICS DECLARATION
P.S. is now a full-time employee of GlaxoSmithKline. The remaining authors declare no competing financial interests.
AUTHOR CONTRIBUTIONS
W.R.R designed the study and was the primary analyst. D.J.K. assisted with developing the phenome-wide Mendelian randomisation pipeline. M.A.D developed the normative modelling pipeline. Z.F.G. performed the signature mapping analyses. P.S performed quality control on the INTERVAL data prior to GWAS. K.K performed the INTERVAL GWAS. E.D.C and C.E.C provided input into the TwinsUK retinol quality control, and the interpretation of the clinical associations. L.A.G assisted with data curation and preparation of the manuscript. A.M.M and D.A provided the ATBC/PLCO data. M.J.C contributed to drafting manuscript and provided funding. All authors contributed to interpretation of the results and the final manuscript.
ACKNOWLEDGEMENTS
TwinsUK is funded by the Wellcome Trust, Medical Research Council, Versus Arthritis, European Union Horizon 2020, Chronic Disease Research Foundation (CDRF), Zoe Ltd, the National Institute for Health and Care Research (NIHR) Clinical Research Network (CRN) and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London. P.S. was supported by a Rutherford Fund Fellowship from the Medical Research Council (grant no. MR/S003746/1). Participants in the INTERVAL trial were recruited with the active collaboration of NHS Blood and Transplant England (www.nhsbt.nhs.uk), which has supported field work and other elements of the trial. DNA extraction and genotyping were co-funded by the National Institute for Health and Care Research (NIHR), the NIHR BioResource (http://bioresource.nihr.ac.uk) and the NIHR Cambridge Biomedical Research Centre (BRC-1215-20014). The academic coordinating centre for INTERVAL was supported by core funding from the: NIHR Blood and Transplant Research Unit (BTRU) in Donor Health and Genomics (NIHR BTRU-2014-10024), NIHR BTRU in Donor Health and Behaviour (NIHR203337), UK Medical Research Council (MR/L003120/1), British Heart Foundation (SP/09/002; RG/13/13/30194; RG/18/13/33946) and NIHR Cambridge BRC (BRC-1215-20014; NIHR203312) [*].
Metabolon metabolomics assays in INTERVAL were funded by the: NIHR BioResource, NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) [*], Wellcome Trust grant number 206194 and BioMarin Pharmaceutical, Inc. The academic coordinating centre thank blood donor centre staff and blood donors for participating in the INTERVAL trial. This work was supported by Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (England), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome. For the purpose of open access, the author(s) has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.*The views expressed are those of the author(s) and not necessarily those of the NIHR, NHSBT or the Department of Health and Social Care. Access to the TwinsUK cohort for this study was funded by a Hunter Medical Research Institute (Newcastle, Australia) Precision Medicine Research Program Pilot Grant (W.R.R). C.E.C. is supported by an investigator grant from the National Health and Medical Research Council (NHMRC). M.A.D is supported by an investigator grant from the National Health and Medical Research Council (NHMRC). P.S. was supported by a Rutherford Fund Fellowship from the Medical Research Council (grant no. MR/S003746/1). We also acknowledge and thank all participants in the genetic studies included in this paper.
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.
- 14.
- 15.
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵