Abstract
Examining the downstream molecular consequences of genetic variation significantly enhances our understanding of the heritable determinants of complex traits and disease predisposition. Metabolites serve as key indicators of various biological processes and disease states, playing a crucial role in this systematic mapping, also providing opportunities for the discovery of new biomarkers for disease diagnosis and prognosis. Here, we present a genome-wide association study for 249 circulating metabolite traits quantified by nuclear magnetic resonance spectroscopy across various genetic ancestry groups from the Estonian Biobank and the UK Biobank. We generated mixed model associations in the Estonian Biobank and six major genetic ancestry groups of the UK Biobank and performed two separate meta-analyses across the predominantly European genetic ancestry samples (n = 599,249) and across all samples (n = 619,372). In total, we identified 89,489 locus-metabolite pairs and 8,917 independent lead variants, out of which 4,184 appear to be novel associated loci. Moreover, 12.4% of the independent lead variants had a minor allele frequency of less than 1%, highlighting the importance of including low-frequency and rare variants in metabolic biomarker studies. Our publicly available results provide a valuable resource for future GWAS interpretation and drug target prioritisation studies.
Introduction
Systematic mapping of the heritable determinants underlying complex traits and disease predisposition can be greatly improved by detailed understanding of downstream molecular consequences of genetic variation. Studying metabolite traits is crucial because they serve as key indicators of various biological processes and disease states. Metabolite studies can reveal the complex interactions between genes and metabolic pathways, providing a more comprehensive understanding of molecular human biology and the potential for novel therapeutic targets. This understanding can lead to the identification of new biomarkers for disease diagnosis, prognosis, and treatment, as well as the development of personalised medical interventions.
Although genome-wide association studies (GWAS) for several traits and diseases now exceed the sample size of 1 million individuals (COVID-19 Host Genetics Initiative, 2023; Suzuki et al., 2024; Yengo et al., 2022; Zhou et al., 2022), studies of molecular traits such as gene expression (Võsa et al., 2021), plasma proteins (Sun et al., 2023) or circulating metabolites have lagged behind. Notable exceptions are five blood lipid traits where the largest meta-analysis now includes data from 1.65 million individuals (Graham et al., 2021). Therefore, recent large-scale metabolite GWAS continue to uncover novel associations and biological insights (Karjalainen et al., 2024; Richardson et al., 2022; Smith et al., 2022; van der Meer et al., 2024). For example, Karjalainen et al performed GWAS meta-analysis of 233 circulating metabolites from the Nightingale Health nuclear magnetic resonance (NMR) platform in up to 136,016 participants from 33 cohorts (Karjalainen et al., 2024), identifying 443 independent loci and revealing significant pleiotropy and polygenicity (Karjalainen et al., 2024). Additionally, using up to 115,082 samples from the phase 1 release of the UK Biobank Nightingale Health platform NMR data, two studies reported high levels of pleiotropy and genetic correlation between metabolites (Richardson et al., 2022; Smith et al., 2022). Similarly, the latest study using 207,836 unrelated White British UK Biobank participants from phase 2 release of the UK Biobank NMR data increased the number of discovered loci to 497 (van der Meer et al., 2024).
However, for more than half of the metabolites captured by NMR, the proportion of heritability explained by genome-wide significant variants remains below 50% in the largest GWAS to date (van der Meer et al., 2024), indicating that much larger sample sizes are needed to discover the remaining effects. Furthermore, existing GWAS studies using the Nightingale Health NMR platform have been limited to common variants (MAF > 1%) due to limited sample sizes as well as low imputation accuracy of low-frequency variants. Therefore less attention has been paid to low-frequency and rare variation, which, while explaining less heritability over-all, could still provide important biological insights (Nag et al., 2023).
The large number of metabolites, their complexity and diversity create a challenge for their identification, compared to other omics measurements. Although the Nightingale Health NMR platform is highly reproducible and 39 of 249 inferred metabolites have now been clinically validated, there have been various changes over the years both in the number of metabolites quantified as well as their absolute quantification results (Bizzarri et al., 2023). This variability can pose additional challenges when meta-analysing metabolites across multiple cohorts. For example, although the latest NMR data release from the UK Biobank contains 249 metabolites, Karjalainen et al analysed only 233 metabolites from the same platform, 225 of which were shared with the UK Biobank (Karjalainen et al., 2024). Thus, to reduce unwanted variability and maximise statistical power, it is essential to ensure that metabolite quantification and normalisation is performed in a uniform manner across cohorts.
Here, we present a genome-wide association study for 249 circulating metabolites quantified by nuclear magnetic resonance spectroscopy across the complete set of UK Biobank participants (n = 434,020) from diverse genetic ancestry groups and European-ancestry individuals from the Estonian Biobank (n = 185,352) (Figure 1). We also performed two separate meta-analyses across the predominantly European ancestry samples (n = 599,249) and across all samples (n = 619,372), resulting in 3-5× larger sample size compared to previous studies (Karjalainen et al., 2024; Richardson et al., 2022; Smith et al., 2022; van der Meer et al., 2024). These publicly available results will provide a valuable resource for future GWAS interpretation and target prioritisation studies.
Results
Per-cohort analyses
We performed GWAS for 249 metabolites (Table S1) in the Estonian Biobank (EstBB) and six genetic ancestry groups from the UK Biobank (UKBB) (Figure 1). The UK Biobank genetic ancestry groups were defined previously by the Pan-UKBB project (Karczewski et al., 2024). Relying on the population-specific genotype imputation panel for the EstBB (Mitt et al., 2017) and the Genomics England imputation panel (Shi et al., 2024) for the UKBB allowed us to test 10-96 million variants for all ancestry-metabolite pairs (up to 9× more than (Karjalainen et al., 2024; van der Meer et al., 2024)). The number of trait-level genome-wide significant (p < 5x10-8) locus-metabolite pairs ranged from 37 (UKBB_AMR) to 63,235 (UKBB_EUR) and the number of independent lead variants (r2 < 0.8) ranged from 24 to 6,415, with most associations detected in the EstBB and UKBB_EUR subsets (Table 1). In comparison, applying the same filtering to Karjalainen et al summary statistics revealed 8,724 genome-wide significant locus-metabolite pairs corresponding to 669 independent loci.
To assess the similarity of the associations detected in the two biobanks, we used LD score regression to calculate genetic correlations for each metabolite between the EstBB (n = 185,352) and the EUR subset of the UKBB (n = 413,897). We observed generally high genetic correlations for the matched metabolites between the two biobanks (median rg = 0.91, mean rg = 0.89, Bonferroni-corrected p < 0.05 for all comparisons; Table S2). Motivated by the high genetic correlation between the two biobanks, we proceeded with the meta-analyses.
Discovery of novel loci via meta-analysis
In the meta-analysis of EstBB and the EUR subset of the UKBB (meta_EUR, n = 599,249), we identified 88,278 locus-metabolite pairs corresponding to 8,784 independent lead variants (r2 < 0.8), representing approximately 10-fold increase as compared to the results reported by (Karjalainen et al., 2024). The heritability of individual metabolites ranged from 2.8% for Acetoacetate to 19.5% for HDL_size (median 10.2%) (Figure S1) and we observed a clear linear relationship between heritability and the number of loci discovered for each metabolite (Figure S2). In the following sections, we will focus on 56 selected metabolites representing amino acids, glycolysis related metabolites, ketone bodies, fluid balance, inflammation, and major lipid subclasses (Table S3). If not mentioned otherwise, we present the results from the meta-analysis of the EUR genetic ancestry groups from EstBB and UKBB (meta_EUR). Besides these results, we have publicly released the complete GWAS summary statistics for all 249 metabolites in all genetic ancestry groups as well as the two meta-analyses via the NHGRI-EBI GWAS Catalog (see Data availability).
We compared the associations detected in our meta_EUR meta-analysis to the largest independent metabolomics GWAS meta-analysis using the same NMR platform ((Karjalainen et al., 2024), n = 136,016). On average, 97% of the lead associations detected by Karjalainen et al also replicated in our meta_EUR analysis (Figure 2) with a highly concordant direction of effect (Figure S3). Identical analysis using results from our meta_ALL analysis is presented in Figure S4. We also detected many novel associations for all tested metabolites. The fraction of novel associations ranged from 27% for 3-Hydroxybutyrate (bOHbutyrate) to 85% for Lactate (Figure 2). Altogether, we identified 4,085 novel loci not previously reported by Karjalainen et al, including 248 loci on chromosome X.
Associations with low-frequency variants
While the previous GWAS studies of NMR metabolites have focussed on common variation (MAF > 1%) (Karjalainen et al., 2024; van der Meer et al., 2024), we tested all variants with minor allele count (MAC) greater than 20. Thus, in our meta_EUR meta-analysis, 12.4% of the independent lead variants (n = 1,088 variants) had MAF < 1% (Figure 3). As expected, these low-frequency variants also had larger effect sizes than those at higher allele frequencies (Figure 3). This estimate is a lower bound as many loci with common lead variants are likely to contain secondary signals with lower allele frequencies.
In addition to the EUR genetic ancestry group, we also performed GWAS in five smaller genetic ancestry groups of the UK Biobank (AFR, AMR, CSA, EAS, MID) (Table 1). Including these summary statistics into the meta-analysis increased the number of independent lead variants from 8,784 to 8,917 (Figure 1, Table 1), 41 of which were not tested in the EstBB and UKBB_EUR cohorts due to low allele frequency (allele count < 20). This highlights the need to substantially increase the sample sizes for under-represented genetic ancestry groups to enable the discovery of ancestry-specific effects.
Genetic correlation and pleiotropy
To understand the shared genetic control of various classes of metabolites, we used LD score regression to estimate the pairwise genetic correlations between the selected set of 56 metabolites in our meta_EUR meta-analysis (Figure 4). Pairwise genetic correlations for all 249 metabolites are shown in Table S4. As expected, we observed high genetic correlations between 25 lipid traits. The three branched-chain amino acids (Leu, Val, Ile, total branched-chain amino acids), ketone bodies, and a group of lipid-related metabolites (LDL and HDL size, HDL cholesterol, ratio between polyunsaturated and monounsaturated fatty acids and unsaturation) also formed three distinct highly genetically correlated clusters (Figure 4). In contrast, other amino acids and glycolysis-related metabolites exhibited moderate genetic correlations with other metabolites. Finally, it is especially interesting that negative correlations with large clusters of other metabolites appear for metabolite ratios, such as PUFA_by_MUFA and omega_6_by_omega_3.
Additionally to the calculated genetic correlations, we identified clusters of lead variants that were shared (r2 > 0.8) between different metabolite traits. Among the 249 metabolite traits, most lead variants were significantly associated (p < 5x10-8) with multiple metabolites (mean = 10; median = 2). When focussing on the selected 56 metabolites, we detected 880 independent lead variants that were significantly associated (p < 5x10-8) with five or more metabolites. Most prominently, a common missense variant (MAF = 40%) in the glucokinase regulatory protein (GCKR) (rs1260326, GCKR:p.Leu446Pro) was significantly associated (p < 5x10-8) with 51 (out of 56) selected metabolites (Figure S5).
As another example of a highly pleiotropic locus, a non-coding variant (rs12916) in the HMG-CoA reductase (HMGCR) gene was associated with 27 (out of 56) selected metabolites and overall, with 180 (out of 249) tested metabolite traits (Figure 5A). Genetic variation in HMGCR has been robustly linked to cardiovascular disease (Ference et al., 2016) and HMGCR is also a known drug target for statin therapy.
Nevertheless, not all associations were equally pleiotropic. For example, a low-frequency (MAF = 0.4%) missense variant in the histidine ammonia lyase (HAL) gene (rs61937878) was significantly associated (p < 5x10-8) with 2 out of 56 selected metabolites (histidine and glycine) and no additional significantly associated metabolites were detected among the full set of 249 tested (Figure 5B). HAL converts histidine to trans-urocanate (Hall, 1952), thus explaining the extremely strong association with histidine (beta = 0.864; log10(p) = 951). In contrast, the association with glycine was much weaker (beta = 0.071; log10(p) = 9.60) and could not be easily explained by a direct effect of the HAL enzymatic activity.
Discussion
A major advantage of our study is that we were able to use samples from two large biobanks assayed on the same metabolomics platform. All samples were processed and analysed in the same way, following identical procedures for sample processing, spectra generation, data acquisition in the same laboratory, thus reducing technical variability and increasing statistical power for discoveries. This low technical variability was exemplified by the high genetic correlation that we observed between the two biobanks (mean rg = 0.89). Furthermore, our meta-analyses involved 3-5x more samples and up to 9x more variants than two previous studies using the same NMR platform (Karjalainen et al., 2024; van der Meer et al., 2024). As a result, we were able to replicate 97% of previously known associations while detecting more than 4000 novel associations (Figure 2). At the level of individual metabolites, the number of detected signals increased between 27-85%. Thus, our study provides the most comprehensive catalogue of genetic associations with these metabolites yet.
By utilising state-of-the-art population-specific genotype imputation panels (Mitt et al., 2017; Shi et al., 2024; Taliun et al., 2021), we were able to test more low-frequency variants than previous studies, leading to the discovery of numerous novel locus-metabolite associations. As a result, 12.4% of the independent lead variants detected in our analysis had a minor allele frequency (MAF) of less than 1% (Figure 3). This proportion is only likely to increase as we and others seek to identify independent low-frequency signals at established GWAS loci. Thus, while these low-frequency associations are unlikely to explain a large proportion of trait heritability, they can still be a valuable resource of genetic instruments for cis-Mendelian randomisation (cis-MR) and drug target MR studies (Richardson et al., 2022). We expect future statistical fine mapping and rare variant analysis studies (Nag et al., 2023) to uncover many novel biological insights.
While large-scale biobanks provide unprecedented power for genetic discovery, they also introduce complexities in interpreting genetic associations due to pervasive pleiotropy. Our results reinforce previous reports of extensive pleiotropy across metabolite GWASs (Karjalainen et al., 2024; Richardson et al., 2022; Smith et al., 2022). Some of this pleiotropy is readily interpretable, such as co-regulation between various lipid traits (Richardson et al., 2022) or opposing effects between substrates and products of enzymatic reactions (Smith et al., 2022). However, given our large sample size, we also detected more cryptic pleiotropic effects such as the histidine ammonia lyase (HAL) missense variant effect on glycine (Figure 5B). Such extensive overlaps exemplify that as cohorts grow larger, the detection of pleiotropic signals becomes more pronounced, complicating the disentanglement of direct and indirect genetic effects. This observation aligns with the omnigenic model, where all expressed genes in a cell contribute to complex traits through interconnected regulatory networks (Boyle et al., 2017; Sinnott-Armstrong et al., 2021; Smith et al., 2022).
Besides revealing molecular mechanisms of trait-associated genetic variation, the integration of genomic and metabolomic data from multiple biobanks has demonstrated the potential of metabolomics-based risk scores to estimate common disease risk more effectively than traditional polygenic scores (Buergel et al., 2022; Nightingale Health Biobank Collaborative Group et al., 2023). However, training these metabolomic risk scores requires longitudinal data from large biobanks, which might not be available for diseases with low prevalence in the general population. Interestingly, a recent study demonstrated that it is possible to combine GWAS data for molecular biomarkers and disease outcomes to build predictive risk models in the absence of longitudinal data (Sens et al., 2024). These findings underscore the ongoing potential of large metabolite GWAS to improve our understanding of disease risk and progression, paving the way for personalised prevention and treatment strategies (Julkunen et al., 2023).
Our study also has several limitations. First, 97% of the samples included in our analysis were of predominantly European genetic ancestries. This demographic skew severely limited our ability to detect genome-wide significant signals in other genetic ancestry groups and may influence the generalizability of our findings across genetic ancestry groups. As a result, the number of genome-wide significant signals increased by only 1.3% (Table 1) when samples from other UKBB genetic ancestry groups (AFR, AMR, CSA, EAS, MID) were included in the analysis. Secondly, due to significant computational and methodological challenges, we did not perform statistical fine mapping of the identified loci. As a result, we are likely missing many secondary signals at the genome-wide significant loci. We expect that cross-population fine mapping methods such as SuSiEx can help to resolve some of these issues in the future (Yuan et al., 2024). Lastly, we applied a global genome-wide significance threshold (p < 5x10-8) tailored for common variants. To control for false-positive associations, this threshold may need adjustment to account for the large number of metabolites tested and the low allele frequency threshold (allele count > 20) utilised in our study.
Our comprehensive results, made publicly available, will serve as a valuable resource for the scientific community for future research, enabling more detailed analyses of genetic influences on circulating metabolite levels and paving the way for additional future studies towards improved understanding of genetic basis of metabolic traits and complex diseases, identify novel therapeutic targets, and develop personalised intervention strategies.
Methods
Cohorts
Estonian Biobank
The Estonian Biobank (EstBB) is a volunteer-based biobank with 212,955 participants in the current data freeze (Milani et al., 2024). All biobank participants have signed a broad informed consent form and their blood sample collection was undertaken across the country between 2002 and 2021 (Leitsalu et al., 2015). The activities of EstBB are regulated by the Human Genes Research Act, which was adopted in 2000 specifically for the operations of EstBB. Individual level data analysis in EstBB was carried out under ethical approval 1.1-12/624 from the Estonian Committee on Bioethics and Human Research (Estonian Ministry of Social Affairs), using data according to release application 6-7/GI/8988 from the EstBB.
UK Biobank
The UK Biobank is a longitudinal biomedical study of approximately half a million participants between 38-71 years old from the United Kingdom (Bycroft et al., 2018). Participant recruitment was conducted on a volunteer basis and took place between 2006 and 2010. Initial data were collected in 22 different assessment centers throughout Scotland, England, and Wales. Data collection includes elaborate genotype, environmental and lifestyle data. Blood samples were drawn at baseline for all participants, with an average of four hours since the last meal, i.e. generally non-fasting. NMR metabolomic biomarkers (Nightingale Health, quantification library 2020) were measured from EDTA plasma samples (aliquot 3) during 2019–2024 from the entire cohort. Details on the NMR metabolomic measurements in UK Biobank have been described previously for the first tranche of ∼120,000 samples (Julkunen et al., 2023). The UK Biobank study was approved by the North West Multi-Centre Research Ethics Committee. This research was conducted using the UK Biobank Resource under application numbers 91233 and 30418.
Genotype imputation
Estonian Biobank
All EstBB participants have been genotyped at the Core Genotyping Lab of the Institute of Genomics, University of Tartu, using Illumina Global Screening Array v1.0, v2.0 and v3.0. Samples were genotyped and PLINK format files were created using Illumina GenomeStudio v2.0.4. Individuals were excluded from the analysis if their call-rate was < 95%, if they were outliers of the absolute value of heterozygosity (> 3SD from the mean) or if sex defined based on heterozygosity of X chromosome did not match sex in phenotype data (Mitt et al., 2017). Before imputation, variants were filtered by call-rate < 95%, HWE p-value < 1e-4 (autosomal variants only), and minor allele frequency < 1%. Genotyped variant positions were in build 37 and were lifted over to build 38 using Picard. Phasing was performed using the Beagle v5.4 software (Browning et al., 2021). Imputation was performed with Beagle v5.4 software (beagle.22Jul22.46e.jar) and default settings. Dataset was split into batches of 5,000. A population specific reference panel consisting of 2,695 WGS samples was utilised for imputation and standard Beagle hg38 recombination maps were used. Based on principal component analysis, samples who were not of European ancestry were removed. Duplicate and monozygous twin detection was performed with KING 2.2.7 (Manichaikul et al., 2010), and one sample was removed out of the pair of duplicates.
UK Biobank autosomes
Genotype imputation for the UK Biobank (UKBB) autosomal data was conducted using a high-coverage whole sequencing reference panel (342 million autosomal variants) from 78,195 individuals from the Genomics England (GEL) project. Reference panel construction and UK Biobank imputations have been described previously (UKBB data field 21008) (Shi et al., 2024).
Briefly, the UK Biobank SNP array data consisted of 784,256 autosomal variants. Initially, 113,515 sites identified by previous centralised UK Biobank analysis as failing quality control were removed, along with an additional 39,165 sites failing a Hardy–Weinberg equilibrium test on 409,703 GBR samples, with a p-value threshold of 1-10. The resulting SNP array data were mapped from the GRCh37 to GRCh38 genome build using the GATK Picard LiftOver tool. Alleles with mismatching strands but matching alleles were flipped. A further 495 sites were removed due to incompatibility between the two reference genomes, resulting in a final SNP array incorporating 631,081 autosomal variants used for phasing and imputation.
Haplotype estimation of the SNP array data, a prerequisite for imputation, was carried out one chromosome at a time using SHAPEIT4 v4.2.2 (Delaneau et al., 2013) without a reference panel, utilising the full set of UK Biobank samples. SHAPEIT4 was run with its default 15 Markov chain Monte Carlo iterations and 30 threads. Autosomal imputation using the GEL reference panel was conducted with IMPUTE5 (Rubinacci et al., 2020) (v.1.1.4). The SNP array data were divided into 408 consecutive and overlapping chunks of approximately 5 megabases (Mb) each, with a 2.5 Mb buffer across the genome using the Chunker program in IMPUTE5. Each chunk was further divided into 24 sample batches, each containing 20,349 samples. IMPUTE5 was run on each of the 9,792 subsets using a single thread and default settings. The resulting imputed genotype dosages are stored in BGEN format, and phasing information is stored in VCF format. More details on the imp
UK Biobank X chromosome
As the UKBB genotypes imputed by Genomics England did not include the X chromosome, we used the TOPMed r2 imputation for the X chromosome (UKBB data field 21007). Imputation was performed using the TopMed Imputation Server (Das et al., 2016). The data were divided into 10 Mb chunks, and each chunk underwent several checks to ensure validity. These checks included verifying the inclusion of variants in the reference panel, ensuring a sufficient overlap with the reference panel, and maintaining an adequate sample call rate. Chunks that did not meet these criteria were excluded from further analysis. Overall, quality control methods employed by the TopMed Imputation Server were slightly more conservative than those employed by GEL and thus the sample size for each sub-population decreased by roughly 0.5% (final sample sizes: AFR - 6,411; AMR - 925; CSA - 8,627; EAS - 2,595; EUR - 412,523; MID - 1,491). Genotype phasing was performed with Eagle2 (Loh et al., 2016) and imputation was conducted with mimimac4 (Das et al., 2016). After imputation, all chunks of each chromosome were merged into a single file. For chromosome X, additional checks were performed to verify ploidy and ensure the accuracy of mixed genotypes. The chromosome was split into three regions (PAR1, non-PAR, PAR2) for phasing and imputation, and these regions were later merged into a complete chromosome X file.
NMR metabolite data quality control and normalisation
NMR data generation in the EstBB and UKBB has been previously described (Nightingale Health Biobank Collaborative Group et al., 2023). During the quality control of the Nuclear Magnetic Resonance spectroscopy (NMR) metabolomics data, we detected a difference between distributions of several metabolites (notably Ala and His) driven primarily by spectrometer and batch effect. We removed this unwanted technical variation using the R package ’ukbnmr’ in both EstBB and UKBB data (Ritchie et al., 2023). We excluded individuals with more than 5 missing metabolite measurements from the cohort, confirmed that none of the 249 metabolites had a significant number of missing measurements (8000 for EstBB, 24000 for UKBB), and applied a metabolite-wise inverse normal transformation to obtain the final dataset.
Association testing
We conducted genome-wide association tests for each of the seven genetic ancestry groups separately using regenie v3.1.1 (Mbatchou et al., 2021), with sex, age, age squared and the top principal components used as covariates (PC1-PC10 for EstBB, PC1-PC20 for UKBB). For step 1 (whole genome model), we used genotype calls for UKBB and genotyping data for EstBB and included variants with a minor allele frequency (MAF) of at least 1%, a minor allele count (MAC) of at least 20, Hardy-Weinberg equilibrium exact test p-values of 10-15 or less, and maximum per-variant and per-sample missing genotype rates of 0.1. For step 2 (association testing using a linear regression model), we used imputed genotypes and selected variants with a MAC of at least 20 and an imputation INFO score of at least 0.6.
Meta-analysis
We performed two different inverse-variance weighted fixed-effect meta-analyses: meta_EUR on individuals of predominantly European genetic ancestry (EstBB cohort and EUR genetic ancestry group of UKBB), and meta_ALL which encompasses all seven genetic ancestry groups from UKBB and EstBB.
Genetic correlations
We employed LD score regression (LDSC) (Bulik-Sullivan et al., 2015) to obtain pairwise genetic correlations for all 249 NMR metabolites. Correlations were calculated between biobanks for each metabolite and between all metabolites in three of the larger datasets (EstBB, UKBB_EUR and meta_EUR) using the European reference panel LD scores.
Lead variant and locus definition
We obtained the set of dataset-metabolite-variant triplets by iterating over variants that met the genome-wide significance threshold of 5x10-8. The variant with the lowest p-value was designated as the lead variant within a 2 Mb locus. In each dataset, neighbouring loci were merged into one if their lead variants were in LD with an r2 of at least 0.05. To better evaluate the independence of lead variants, we utilised PLINK v1.90b6.26 to calculate pairwise LD between all lead variants in a single genetic ancestry group, assigning them into shared cross-metabolite clusters if r2 was at least 0.8. The variant with the lowest p-value was considered to be the only independent lead in each cluster.
Data and code availability
Complete genetic ancestry group-specific and meta-analysis association summary statistics from this study can be downloaded from the GWAS Catalog (Sollis et al., 2023) (accessions GCST90449363 - GCST90451603, Table S5). GWAS lead variants are available from Zenodo (https://dx.doi.org/10.5281/zenodo.13937265). The meta_EUR meta-analysis results can also be viewed in our PheWeb browser (https://nmrmeta.gi.ut.ee/). Meta-analysis code is available from https://github.com/ralf-tambets/EstBB-UKBB-metaanalysis/. The individual-level UK Biobank data are available for approved researchers through the UK Biobank data-access protocol (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access). The individual-level data from Estonia Biobank can be accessed through a research application to the Institute of Genomics of the University of Tartu (https://genomics.ut.ee/en/content/estonian-biobank).
Author contributions
R.T. performed GWAS analysis on the EstBB and UKBB data. I.R. developed the initial GWAS workflow. N.T., A.K., K.F. developed quality control criteria for the EstBB metabolite data. K.A., P.P., R.T, U.V., E.A. and J.K. wrote the manuscript with feedback from all authors.
Data Availability
Complete genetic ancestry group-specific and meta-analysis association summary statistics from this study can be downloaded from the GWAS Catalog (Sollis et al., 2023) (accessions GCST90449363 - GCST90451603, Table S5). GWAS lead variants are available from Zenodo (https://dx.doi.org/10.5281/zenodo.13937265). The meta_EUR meta-analysis results can also be viewed in our PheWeb browser (https://nmrmeta.gi.ut.ee/). Meta-analysis code is available from https://github.com/ralf-tambets/EstBB-UKBB-metaanalysis/. The individual-level UK Biobank data are available for approved researchers through the UK Biobank data-access protocol (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access). The individual-level data from Estonia Biobank can be accessed through a research application to the Institute of Genomics of the University of Tartu (https://genomics.ut.ee/en/content/estonian-biobank).
https://dx.doi.org/10.5281/zenodo.13937265
https://github.com/ralf-tambets/EstBB-UKBB-metaanalysis/blob/main/data/sumstats_paths.tsv
Funding
K.A and I.R. were supported by a grant from the Estonian Research Council (grant no PSG415). E.A was supported by the European Union through Horizon 2020 and Horizon Europe research and innovation programs under grants no. 894987, 101137201 and 101137154. R.T., E.A, U.V., J.K. and T.E. were supported by the Estonian Research Council grant no PRG1291. K.F and A.K. were supported by a grant from the Estonian Research Council no PRG1197. N.T. was supported by the Estonian Research Council grant no PRG1414. Project was supported by European Union’s Horizon 2020 research and innovation programme under Grant Agreement No 101017802 (OPTOMICS).
Acknowledgements
We want to acknowledge the participants of the UK Biobank and Estonian Biobank for their contributions. The Estonian Genome Center analyses were partially carried out in the High Performance Computing Center, University of Tartu. This research has been conducted using the UK Biobank Resource under application numbers 91233 and 30418. Nightingale Health Plc is acknowledged for early access to the UK Biobank NMR metabolite data. The Estonian Biobank Research Team consists of Mari Nelis, Georgi Hudjasov, Reedik Mägi, Andres Metspalu, and Lili Milani.
Footnotes
↵* These authors jointly supervised this work.
Augmented the Data Availability statement with a link to our PheWeb browser. Fixed some minor typos.