Abstract
Coffee is one of the most widely consumed beverages. We performed a genome-wide association study (GWAS) of coffee intake in US-based 23andMe participants (N=130,153) and identified 7 significant loci, with many replicating in three multi-ancestral cohorts. We examined genetic correlations and performed a phenome-wide association study across thousands of biomarkers and health and lifestyle traits, then compared our results to the largest available GWAS of coffee intake from UK Biobank (UKB; N=334,659). The results of these two GWAS were highly discrepant. We observed positive genetic correlations between coffee intake and psychiatric illnesses, pain, and gastrointestinal traits in 23andMe that were absent or negative in UKB. Genetic correlations with cognition were negative in 23andMe but positive in UKB. The only consistent observations were positive genetic correlations with substance use and obesity. Our study shows that GWAS in different cohorts could capture cultural differences in the relationship between behavior and genetics.
Introduction
Coffee is a leading global food commodity that has psychoactive properties that are largely due to the presence of caffeine1. While rates of use and daily intake varies widely by geographic region, it is estimated that approximately 60-85% of adults in Europe and the United States consume between 0.6 to 5.5 cups of coffee daily2–4. Intake of coffee and its bioactive constituents is associated with benefits on cognitive function5 and lower risk of liver disease6,7 (but see8), Parkinson’s and other neurodegenerative diseases6,7,9, cardiovascular disease6,7, type II diabetes6,7, and certain cancers6,7,10. However, coffee intake is also associated with higher risks for some adverse outcomes, including increased risk of other substance use and misuse11–14, some cancers (e.g., lung cancer7,10,15), poor lipid profile6,7, pregnancy loss6,7, gastrointestinal maladies16, and worse cardiovascular outcomes following excessive intake17. Given the widespread and regular intake of coffee across the globe, addressing the full spectrum of correlations with health and disease is an important but challenging task.
Genetic studies offer a compelling avenue to investigate the relationships between coffee intake and other complex traits. Twin studies that calculate genetic contributions to daily coffee intake estimate it to be 36-56% heritable, suggesting that coffee intake should be amenable to genetic analysis. Whereas phenotypic correlations, which depend on measuring two or more traits in the same cohort, can arise from genetic and environmental factors, genetic correlations assess genetically driven relationships using the results from genome-wide association studies (GWAS) and can therefore examine correlations between two or more traits, even if they were measured in entirely non-overlapping cohorts. In the past decade, over a dozen GWAS (N=1,207-407,072) have examined coffee intake18–34. Several of these GWAS have found associations with single nucleotide polymorphisms (SNPs) within or near genes that metabolize caffeine (Supplementary Table 1), such as CYP1A1 and CYP1A218–20,23–26,30,32,33. Some of these loci are also associated with other complex traits, including liver disease35–37, cancers38–41, and alcohol consumption42–44. This pleiotropy could suggest that these other associations are mediated by coffee intake or that these loci also influence these traits via alternative independent mechanisms. Genetic correlations have also been reported between coffee intake with other substance use45,46, reduced gray matter volumes18, psychiatric illness45, osteoarthritis47, sleep48, body mass index (BMI)49, type II diabetes49, and migraine50. However, some genetic correlations were conducted under a priori justification (e.g., other substance use traits, sleep) and as such may fail to capture the full scope of genetic correlations between coffee with other traits. Thus, a data-driven examination of trait associations with coffee intake remains unexplored.
While coffee is the primary source of caffeine for many, other common dietary sources of caffeine include tea, soft drinks, and chocolate. Consequently, when we refer to coffee intake, we mean explicit measures of coffee intake (e.g., measured as cups/day) and not caffeine intake unless otherwise specified. Intake of other caffeine sources also varies by geographic region based on beverage sales2. For example, tea (rather than coffee) is the preferred source of caffeine in the United Kingdom (UK; tea vs. coffee: ∼50% vs. 20%) compared to the United States (US; ∼10% vs. 30%)2. As some genetic studies used data from the UK Biobank (UKB) only18,47,48,51–53 or combined cohorts across regions with different patterns of caffeinated beverage intake (Supplementary Table 1)32,33,46, this distinction may limit generalizability or introduce environmental and cultural confounds that affect the genetic associations between coffee intake and other traits.
In this study, we used survey responses from US-based 23andMe, Inc. research participants of European ancestry (N=130,153) and performed a GWAS of a single item “How many 5-ounce (cup-sized) servings of caffeinated coffee do you consume each day?”. Using genetic correlations and phenome- and laboratory-wide association studies (PheWAS, LabWAS), we explored the relationships between coffee intake and thousands of biomarkers, health features, and lifestyle traits to provide a fuller inventory of genetic correlations with coffee intake. We compared our findings from the 23andMe cohort to those from the UKB using publicly available GWAS summary statistics of coffee intake (“How many cups of coffee do you drink each day? (Include decaffeinated coffee)”, N=334,659, http://www.nealelab.is/uk-biobank/). Although we had originally intended to perform a meta-analysis, our results revealed a lower-than-expected genetic correlation between coffee intake in the two cohorts; therefore, we instead used these datasets to explore cohort differences in coffee intake across these two distinct populations.
Results
GWAS in the 23andMe US-based cohort replicated seven loci implicated in coffee intake
Participant demographics of the 23andMe cohort are described in Supplementary Table 2. The cohort was 65% male, had a mean age of 52.8 ± 16.9 years old, and an average BMI of 28.38 ± 6.54 (range: 14.0-69.1), similar to the US average of 27.5 (95% CI: 25.5-29.4)54. The average coffee intake in the cohort was 1.98 (± 2.35 SD) cups per day, similar to the coffee intake distributions in UKB (2.14 ± 2.09 SD; see Supplementary Figure 1 and Supplementary Table 3 for distributions).
We conducted a GWAS of 14,274,006 imputed genetic variants assuming an additive genetic model that included age, sex, the first five genetic principal components, and indicator variables for genotype platforms as covariates (Supplementary Table 4; Supplementary Material for additional genotyping and GWAS details). The genomic control inflation factor of the GWAS was λ=1.09, suggesting no substantial inflation due to population stratification. SNP- heritability of coffee intake via Linkage Disequilibrium score regression (LDSC) was 7.57% ± 0.59 (Supplementary Table 5).
We identified seven genome-wide significant (p<5.00E-08) independent (r2<0.1) loci that were associated with coffee intake (Figure 1, Table 1; Supplementary Figures 2-8 for locus zoom plots). These associations replicated prior coffee or caffeine GWAS findings (Supplementary Table 6). For example, rs2472297 (p=3.60E-65, chr15q24.1) is in the intergenic region between CYP1A1 and CYP1A2, and has been previously associated with coffee and caffeine intake18,20,24,26,28,30,32,33,55. CYP1A1 and CYP1A2 encode members of the cryptochrome P450 superfamily of enzymes involved in xenobiotic metabolism22. rs2472297 has also been previously associated with traits like alcohol consumption56,57, clozapine pharmacokinetics58, kidney function59–62, and the concentration of biomarkers in urine63–68. We also identified rs4410790 (p=5.20E-55, chr7p21.1), which is located upstream of the AHR gene encoding a transcription factor that regulates CYP1A1/CYP1A2 and is activated by polycyclic aromatic hydrocarbons, which are present in coffee22,69. Prior studies associated rs4410790 and caffeine intake from tea20, as well as with traits like caffeine metabolism28, bitter beverage intake26, and urine biomarkers64,66–68,70. Lastly, rs199612805 (p=1.80E-10, chr22q11.23), which is located near ADORA2A, was also implicated in coffee intake. This variant was recently associated with caffeine intake from tea and coffee in the UKB20. ADORA2A encodes an adenosine G-protein coupled receptor that is inhibited by caffeine to produce stimulating effects71. The remaining four SNPs – rs34645063, rs28634426, rs117824460, and rs11474881 – were in linkage disequilibrium (LD) with SNPs previously identified by other coffee or caffeine GWAS18,20,24,26,28,30,55. rs34645063 (p=3.30E-09, chr6q16.1) is a deletion/insertion polymorphism between MMS22L and POU3F2. rs34645063 is in LD (R2=0.74) with rs754177720 and is also associated with caffeine intake from coffee or tea20. rs28634426 (p=2.10E-10, chr7q11.23) is an intronic variant of STYXL1 in LD with rs17685 (R2=0.78) and rs1057868 (R2=0.76), which were previously implicated by coffee GWAS18,20,24,26,30,55. rs117824460 (p=1.70E-08, chr19q13.2) is an intronic variant of CYP2A6, and is in LD (R2=0.05) with rs56113850, which was implicated in coffee intake20,26 and caffeine metabolism28. CYP2A6 encodes a cryptochrome P450 superfamily enzyme member that metabolizes nicotine72; rs117824460 has also been associated with smoking traits57,73 and serum albumin74, C-reactive protein75, and liver alkaline phosphatase levels66. The final significant variant we identified, rs11474881 (chr20q13.33, p=1.10E-08), is an intronic variant of the PCMTD2 gene; rs11474881 is in LD (R2=0.98) with rs6062679, which was previously implicated in coffee and tea intake and bitter beverage consumption26. We used three additional multi- ancestral cohorts to replicate these findings (Table 1; Supplementary Table 3). Of the SNPs that passed QC, all replicated in a larger sample of 23andMe research participants of European ancestry (N=689,661), all replicated in those with African American ancestry (N=32,312), and one replicated in those of Latin American ancestry (N=124,155).
A) Manhattan plot displays seven genome-wide significant loci for coffee intake in the 23andMe cohort (N=130,153). The horizontal line represents the threshold for significance (p=5.00E-08). Nearest protein-coding genes (<1Mb) to significant loci are labeled. Quantile-quantile plot shown in upper left corner. For more details, see Table 1 and Supplementary Table 6. B) Overlap of genes identified by MAGMA, H-MAGMA, S-PrediXcan, and S-MultiXcan. Genes identified by all four methods are displayed. C) Genes predicted to affect coffee intake identified by S-MultiXcan according to the most significantly associated biological systems. For more details, see Supplementary Table 9. D) Genes implicated in coffee intake by S-PrediXcan according to brain regions. Upregulated genes are shown in red, downregulated shown in blue. For more detail, see Supplementary Table 10.
Significant (p<5.00E-08) GWAS results for coffee intake from 23andMe research participants (N=130,153) of European ancestry (EA). Replication (EA Rep) was conducted in an additional cohort of 23andMe participants of EA (N=689,661), and in those with African American ancestry (AA; N=32,312) and Latin American ancestry (LA; N=124,155); *SNPs that did not pass QC in replication. See Supplementary Table 6 for additional information.
We also report several notable nominal associations with coffee intake (p<1.00E-06, Supplementary Table 6). rs72790130 (p=5.50E-08, chr16q23.3) and rs2155645 (p=9.80E-07, chr11q23.2) are intronic variants of two cell adhesion molecule genes, CDH13 and NCAM1, respectively. Both genes have been previously implicated in substance use traits by other GWAS20,57,76–82 and candidate gene studies in humans and animal models83–88. rs2155645 is also in LD with rs2298527 (R2=0.43), which was previously implicated in daily caffeine intake from coffee20. rs11204734 (p=2.90E-07, chr1q21.3) is an intronic variant of ARNT; its protein heterodimerizes with AHR and binds to xenobiotic response elements to regulate transcription of CYP1A1 and CYP1A222. Finally, rs340019 (p=2.10E-07, chr15q22.2) is an intronic variant of RORA, which is involved in circadian rhythm and metabolic regulation, among other functions89. rs340019 is in LD with rs12591786 (R2=0.25), which was implicated in daily caffeine intake from cups of coffee and tea20.
Gene-based and tissue enrichment analyses suggest coffee intake is primarily associated with gene expression in the brain
We used gene- and transcriptome-based analyses (MAGMA, H-MAGMA, S-MultiXcan/S- PrediXcan) and identified 165 target candidate genes that may be most relevant to coffee intake. MAGMA identified 31 genes implicated in coffee intake in physical proximity to GWAS loci (Supplementary Table 7). H-MAGMA, which maps SNPs to genes via chromatin interaction from human brain tissue, implicated 143 unique gene-tissue pairs showing expression specific to cell type (75.16% neuron [31.30% cortical neuron, 33.04% iPSC derived neurons; 35.65% midbrain dopamine neurons], 24.83% astrocyte) and developmental (48.00% fetal, 52.00% adult) (Supplementary Table 8). Finally, S-MultiXcan predicted significant transcriptional regulation of 40 genes implicated in coffee intake dispersed across 20 tissues (Figure 1C; Supplementary Table 10). Of the top biological systems implicated by S-MultiXcan, nine were attributed to the nervous system (brain N=5; tibial nerve N=4), eight to the digestive system (esophagus N=6; pancreas N=1; small intestine N=1), and six to the reproductive system (testis N=4; prostate N=2; Figure 1C). Fifty percent of these genes were predicted to be downregulated in the digestive and reproductive systems, whereas 66.67% of nervous system genes were predicted to be upregulated. Cortical enrichment was further supported by S-PrediXcan (Figure 1D), showing that SNPs associated with coffee intake most frequently correlated with predicted gene expression in overall cortical and frontal cortical regions (N=4/tissue), as well as the putamen (N=5). Overall, four genes (SCAMP2, SCAMP5, MPI, and FAM219B) were identified by all four methods, and six of the 165 discovered genes (FBXO28, NEIL2, HAUS4, IGDCC4, RP11- 298I3.5, RP11-298I3.5) were not within 1Mb of SNPs identified by prior GWAS of coffee or caffeine traits (Supplementary Table 11; Figure 1B; Table 2). These novel genes have been associated with substance use (e.g., HAUS4 and smoking initiation73), educational outcomes (e.g., HAUS4 and educational attainment90), and biomarkers (e.g., FBXO28 and mean platelet volume91; IGDCC4 and mean corpuscular volume92).
Next, we used MAGMA gene-set analysis to identify biological pathways that may be most strongly associated with coffee intake. This analysis revealed significant enrichment (p=4.75E- 07) in pathways related to the metabolism of xenobiotics or foreign substances (i.e., chemicals) (Supplementary Table 12).
MAGMA tissue-based enrichment analyses suggested that coffee intake was only significantly associated with brain tissue (Supplementary Figure 9A). More specifically, differential expression by coffee intake was enriched (p<9.25E-04) in the frontal cortex, overall cortex, cerebellum, and cerebellar hemispheres (Supplementary Figure 9B; Supplementary Table 12), consistent with the S-PrediXcan findings (Supplementary Table 9).
Genetic correlation and polygenic score analyses of coffee intake in US- and UK-based cohorts reveal discrepant associations with health and psychiatric traits, but consistent positive associations with substance use and obesity
To boost statistical power and identify novel genes associated with coffee intake, we sought to meta-analyze our data (metaGWAS) with those from the UKB using METAL93. As a preliminary step to determine the appropriateness of a meta-analysis, we examined the genetic correlation between coffee intake in the 23andMe and UKB cohorts. Surprisingly, the two datasets were only moderately correlated (rg=0.63, p=3.54E-43), although all top loci (p<5.0E-05) shared direction of effect and had similar effect strengths (Supplementary Figure 10). In addition, the estimated LDSC SNPh2 heritability of coffee intake of our metaGWAS was slightly lower than for both the univariate GWAS (metaGWAS SNPh2=4.09% ± 0.26 vs. 23andMe SNPh2=7.57% ± 0.59 vs. UKB SNPh2=4.85% ± 0.33; Supplementary Table 5). We interpreted these results as an indication of cohort heterogeneity and proceeded to analyze genetic associations with coffee intake in each cohort independently.
To further understand these discrepancies, we performed a series of genetic correlation and polygenic score analyses. First, we examined the genetic architecture of coffee intake measured in 23andMe and UKB by comparing patterns of LDSC genetic correlation (rg) with 317 traits across 20 health, psychiatric, and anthropologic categories from publicly available GWAS summary statistics (Figure 2A; Supplementary Table 14). After accounting for multiple testing, 75 traits were genetically correlated with coffee intake in the 23andMe cohort and 74 traits in the UKB cohort. These associations could be underpinned by other unmeasured factors, like sugar intake from coffee sweeteners or smoking and alcohol use95. However, these patterns of genetic correlations persisted after conditioning on dietary sugar intake, cigarettes smoked per day, and alcohol consumption measured by the Alcohol Use Disorder Identification Test (AUDIT; Supplementary Tables 11-12; Supplementary Tables 15-16). Strikingly, of the traits significant in at least one cohort, only 34 (29.57%) were significant in both datasets, and only 58.82% of the traits significant in both datasets shared the same direction of correlation.
A) Comparison of genetic correlations across psychiatric (light gray), anthropologic (medium gray), and health (dark gray) traits between 23andMe (lanes 1 and 2) and UKB (lanes 3 and 4). Lanes 1 and 3 show rg values calculated by LDSC, and lanes 2 and 4 show FDR-corrected p values. Only traits for which at least one cohort was FDR-significant are displayed. For a full list of correlations and trait names, see Supplementary Table 14. Most signals persisted after conditioning for dietary sugar, cigarettes per day, and Alcohol Use Disorder Identification (AUDIT) Consumption scores using mtCOJO94 (Supplementary Tables 15-16; Supplementary Figures 11-12). Genetic correlations for traits denoted with * could not be calculated in both cohorts; ** denotes reverse coding. B) Phenomic associations (panel 1: PheWAS (p<3.62E-05), panel 2: LabWAS (p<1.57E-04)) identified from PGS of coffee intake from 23andMe and UKB summary statistics. Only traits for which at least one cohort was FDR-significant are displayed (saturated bars=FDR significant; desaturated bars=FDR non- significant). neuro.=neurological; gen.=genitourinary; neopl.=neoplasms; sense=sense organs; derma.=dermatologic; imm.=immune. For full trait names and more detail, see Supplementary Table 18-19.
Among the traits that were significant and consistent in direction for both cohorts, we observed positive genetic correlations between coffee intake and substance use phenotypes. For example, we identified positive genetic correlations with smoking initiation (23andMe: rg=0.50, p=4.74E-47; UKB: rg=0.12, p=1.89E-06), drinks per week (23andMe: rg=0.39, p=3.38E-28; UKB: rg=0.21; p=1.39E-14), and cannabis initiation (23andMe: rg=0.28, p=1.34E-08; UKB: rg=0.09, p=5.61E-03). The strength of genetic correlations for substance use and misuse traits significant in at least one cohort was stronger in 23andMe compared to the UKB (0.30±0.03 vs. 0.09±0.02; Welch’s t(51.97)=5.96, p=2.23E-07). For example, associations with substance use disorder and dependence traits were mostly observed in the 23andMe cohort and were weaker or not observed in the UKB, such as for tobacco use disorder, opioid use disorder, cannabis use disorder, nicotine dependence, and alcohol dependence (rg=0.24 to 0.44, p=6.54E-23 to 2.12E-03), as well as externalizing (23andMe: rg=0.48, p=7.21E-41; UKB: rg=0.07, p=4.37E-03), which is highly correlated with substance use and misuse96. Cluster analysis showed that genetic correlations for coffee intake in both cohorts aligned more with general substance use than misuse (Supplementary Figure 13).
Metabolic traits were largely congruent in their positive genetic correlations with coffee intake in both cohorts. For example, BMI (23andMe: rg=0.19, p=1.61E-11; UKB: rg=0.25, p=7.85E- 26) and waist-to-hip ratio (23andMe: rg=0.12, p=4.33E-04; UKB: rg=0.13, p=3.96E-07) were positively genetically correlated with coffee intake in both datasets. Also consistent across cohorts were the lack of significant genetic correlations with most cardiovascular and cancer traits.
The majority of traits were only significant in one cohort or showed discrepancies in the direction of association. For example, coffee intake measured in the 23andMe dataset was positively genetically correlated with anxiety-related traits (rg=0.22 to 0.44, p=1.41E-05 to 8.53E- 03). In contrast, all significant genetic correlations between coffee intake and anxiety-related traits in the UKB were negative (rg= -0.33 to -0.12, p=5.49E-06 to 8.12E-03), except clinically diagnosed anxiety (rg=0.17, p=1.39E-05). We also identified significant positive genetic correlations with cross-disorder, attention deficit hyperactivity disorder, schizophrenia, and anorexia (rg=0.12 to 0.27, p=1.00E-07 to 0.01) that were exclusive to the 23andMe dataset, whereas these associations were not apparent or were negatively genetically correlated in the UKB (rg= -0.13 to 0.02, p=1.01E-04 to 0.55). Significant positive correlations with cognitive variables, such as executive function and intelligence, were found in the UKB (rg= 0.13 to 0.23, p=4.55E-08 to 8.04E- 23), though these were negatively genetically correlated in 23andMe (rg= -0.17 to -0.10, p=7.83E- 08 to 2.06E-03). Certain correlations with physical health traits also differed between cohorts. While correlations with most gastrointestinal traits in the 23andMe cohort were positive, such as a positive genetic correlation with gastric ulcers (rg=0.41, p=3.58E-03), the corresponding genetic correlations observed in the UKB were either non-significant or negative (rg= -0.22 to 0.12, p=1.34E-06 to 0.88). Positive genetic correlations with chronic pain as well as back, hip, and knee pain were observed in the 23andMe dataset (rg=0.12 - 0.26, p=9.02E-08 - 3.58E-03), yet only negative genetic correlations with pain traits were reported in the UKB (rg=-0.22 to -0.12, p=6.23E- 04 - 2.54E-06). Across all health and psychiatric traits that were significant within each cohort, all traits showed a positive genetic correlation with coffee intake in 23andMe participants. Only 41.3% of correlations were positive in the UKB.
We observed similar discrepancies when we extended our results to a health-care system population (Figure 2B). We conducted PheWAS and LabWAS by testing the association between polygenic scores (PGS) for coffee intake derived from 23andMe or the UKB with 1,655 medical traits and biomarkers. We identified 31 PheWAS and LabWAS traits that met the 5% FDR significance threshold using the 23andMe PGS, and 24 using the UKB PGS (Supplementary Tables 17 and 18). Only two endocrine traits (i.e., obesity and morbid obesity) and two biomarkers related to red blood cells were consistent in significance and direction of association. Otherwise, all significant associations were observed when testing PGS generated from one cohort but not the other. For instance, when coffee intake PGS were derived from 23andMe, among the top positive PheWAS and LabWAS associations were substance use disorders and respiratory conditions (e.g., chronic airway obstruction, emphysema, and respiratory failure) and absolute monocyte count. Among the top negative associations derived from 23andMe PGS were those with sense organs, neoplasms, certain respiratory conditions (i.e., allergic rhinitis, acute and chronic tonsillitis, chronic tonsillitis and adenoiditis), and urea nitrogen serum/plasma. When coffee intake PGS were derived from UKB, among the top positive PheWAS and LabWAS associations were endocrine and musculoskeletal disorders, as well as the two metabolic biomarkers, glycated hemoglobin A1c and glucose. The only significant negative PheWAS and LabWAS associations from UKB-derived PGS were with anxiety disorders, and biomarkers related to blood (mean corpuscular hemoglobin concentration) and metabolic (cholesterol and triglycerides in serum or blood) traits.
Discussion
In this study, we contributed to the existing GWAS literature of coffee intake by analyzing a US population of 130,153 participants. We uncovered seven loci associated with coffee intake, most of which were in genes implicated in metabolic processes. Coffee related variants were significantly enriched in the central nervous system. Despite prior evidence that coffee intake confers health benefits, we found genetic correlations mostly with adverse outcomes in both cohorts, particularly substance use disorders and obesity-related traits, in both cohorts. Relationships with other medical, anthropologic, and psychiatric traits were inconsistent in the US and UK cohorts, suggesting that differences between populations may affect coffee intake GWAS results and its genetic relationships with other traits.
Our GWAS replicated prior associations with genes and variants implicated in coffee and caffeine intake as well as other metabolic and xenobiotic processes28, including rs2472297 near CYP1A1/CYP1A218,24,26,33,46 and rs4410790 near AHR18,23,24,26,27,46,97, even though our study sample was smaller compared to other GWAS (N=125,776-373,52220,24,26). Gene-based analyses uncovered 165 candidate genes, including four genes that overlapped across all four analyses: MPI, SCAMP2, SCAMP5, and FAM219B, all of which have been implicated in a prior coffee GWAS18. These overlapping genes have other associations with substance use and medical biomarkers including blood pressure, hypertension, and LDL cholesterol91,98–104. We identified gene enrichment in brain tissues across the frontal cortex, putamen, and hippocampus, consistent with prior GWAS showing enrichment for SNPs associated with coffee and caffeine in the central nervous system18,20,26. This is supported by brain imaging studies across cortical and subcortical areas showing morphological105–108 and functional109,110 differences between those who habitually drink coffee compared to those who do not.
One of the most striking observations of this study is the breadth and magnitude of positive associations between coffee intake with substance use. It is widely believed that use of one substance heightens risk for use of other substances and that there are common genetic risk factors for any substance use111,112; coffee, which is not generally considered a drug of misuse, does not appear to be exempt from this. We identified positive genetic correlations between coffee intake and other substances (i.e., tobacco, alcohol, cannabis and opioid use), as well as relevant personality traits like externalizing behavior. The genetics of coffee intake aligned with substance consumption phenotypes, corroborating prior GWAS and twin studies113–115 (but see23), but not with substance misuse. This is perhaps unsurprising because the phenotype probed by the 23andMe and UKB cohorts focuses on quantity rather than clinically-defined dependence. We and others previously demonstrated that the genetic architectures of other substance intake versus problematic use are unique43,111,116–119, and this is likely also true for coffee.
We found consistent positive genetic correlations with BMI and obesity in both 23andMe and the UKB. This is in contrast to meta-analyses of randomized control trials and epidemiological studies that found unclear effects by any coffee or decaffeinated coffee intake on waist circumference and BMI-defined obesity, and a modest inverse relationship between coffee intake and BMI120,121. Results for these studies are highly heterogeneous, likely due to interindividual variability in the inclusion of sugary coffee additives, cultivation, roasting, and brewing conditions affecting its chemical makeup9,122, and other habits surrounding coffee intake (e.g., concurrent food intake or appetite suppression by nicotine if smoking concurrently123). This contentious relationship may also be explained by the amount of coffee intake, as greater coffee intake seems to attenuate the genetic associations with BMI and obesity49, possibly due to the appetite suppressant effects of caffeine124. Alongside accounting for other dietary intake, detailed accounting of coffee preparation, and consumptive habits formed with coffee intake, future subgroup analyses may help explain discrepant associations between the genetics and prevalence of coffee intake with BMI-related traits.
We did not recapitulate the beneficial phenotypic relationships between coffee intake and a variety of health outcomes that are generally reported by health association studies6–8,10,125–137, perhaps because our study focused on the genetic relationship between coffee intake and other medical outcomes, or because our study focused on coffee intake and not caffeine intake. At the genetic level, we find no evidence of a common genetic background that could explain the beneficial effects of coffee on 29 cancers, Alzheimer’s disease/dementia/cognitive impairments, Parkinson’s disease, diabetes, cirrhosis, most cardiovascular conditions, or gout. In fact, some of these associations (e.g., cardiovascular traits and type II diabetes) were positive in the 23andMe cohort but showed no significant associations in the UKB cohort. Similarly, phenome-wide analysis did not support prior cancerous, metabolic, cardiovascular, or neurological health advantages of coffee intake6–8,10,126–136,138. Although this may seem discrepant to phenotypic associations that generally report health benefits of coffee intake, recent meta-analysis of over 100 phenotypic studies on coffee intake health outcomes suggest high levels of heterogeneity across cohorts6,7, especially across geographically separated populations6.
We found many opposing relationships with the genetics of coffee intake between 23andMe and UKB. For example, genetic correlations with pain, psychiatric illnesses, and gastrointestinal traits were positively genetically correlated with coffee intake in 23andMe, but these associations were negative in the UKB. Inversely, the UKB analysis revealed that coffee intake was positively genetically correlated with cognitive traits, such as executive function and intelligence, corroborating prior evidence139–142, yet genetic correlations with these two traits were negative in 23andMe. Multiple PheWAS associations were also discordant. When PGS were derived from 23andMe, we observed heightened odds between genetic liability for coffee intake and respiratory illnesses, ischemic heart disease, infection, and alcohol-related disorders. Higher odds for musculoskeletal and sleep conditions were mostly associated with coffee PGS generated from the UKB. Only 11 out of the 42 phenotypes associated with coffee intake PGS showed negative associations, and none of these purported health “benefits” were consistently observed in both cohorts. Whereas the coffee intake PGS from 23andMe was associated with lower odds for ear conditions, skin neoplasms, allergic rhinitis, and tonsillitis, the PGS of coffee intake from the UKB was associated with a lower risk of anxiety disorders. Also of note is that the number of positive genetic correlations and PGS associations between coffee intake and these other traits was greater when analyzed using data from the 23andMe cohort versus from the UKB, and the strength of these associations was usually stronger. Partially consistent with this, one meta- analysis of mortality found an inverse relationship between coffee intake and all-cause mortality in European but not US studies143.
Our study shows that cultural, cohort, or geographic influences could affect the inferred genetic architecture of coffee intake and its associations with other health and lifestyle outcomes. Geographic regions may have an observable influence on GWAS results144. We observed no significant differences in subtle geographic differences on coffee intake correlations using location data available in the UKB (Supplementary Figure 14), suggesting cultural differences may contribute more to the cohort variations we report here. There is considerable variation in how or with whom one may consume coffee that could be subject to cultural influence. Caffeinated beverage sales, for instance, suggest that coffee and carbonated caffeinated beverages are more preferred in the US than the UK2, whereas tea is the preferred source of caffeine in the UK and may modify coffee intake2 (Supplementary Figure 15). Higher levels of coffee intake or caffeine from high caloric beverages in the US cohort may partially explain the higher number and magnitude of negative health associations observed in the 23andMe analysis. Even across coffee beverage subtypes, the concentration of caffeine, other coffee chemical constituents, and manufacturing byproducts (e.g., plastics and metals from packaging) varies and thus may be important parameters in health associations122,145,146. A recent investigation revealed the volume of ground or instant coffee is important to the potential health effects of its intake147; instant coffee (∼60 mg of caffeine per cup) is more commonplace in the UK whereas fresh brewed coffee (∼85 mg of caffeine per cup 20) is preferred in the US2. Cultural differences in coffee intake could help explain the divergent patterns of health and lifestyle associations between UK and US participants, though the relative contributions of culture, geography, and their interactions to these differences will need further exploration.
There are multiple caveats to consider when interpreting our findings. Firstly, our study does not address causality between coffee intake and other health and lifestyle traits. Mendelian randomization (MR) studies have attempted to address the exposure-outcome relationships between two traits by using genetic instruments (i.e., SNPs identified by GWAS) as proxies for exposure and associating them with an outcome of interest. For example, MR using genetic markers associated with coffee intake suggest that coffee consumption has no causal effect on obesity and endocrine disorders, despite observational studies suggesting protective effects of coffee148. Similarly, MR studies of coffee and other substance use (e.g., tobacco, alcohol, cannabis) are also contentious149,150, with evidence that inconsistencies may be driven by gene- cohort confounds such as those we found in this study151. Secondly, the phenotype examined by 23andMe was exclusively caffeinated coffee intake, with one cup defined as 5 ounces, whereas the UKB also included decaffeinated coffee and did not explicitly define the volume of one cup. The caffeine content within coffee was also not directly measured. However, secondary analysis using summary statistics of estimated caffeine intake from any coffee subtype in the UKB20 yielded remarkably similar patterns of genetic correlations as those derived from our GWAS of cups of coffee consumed (Supplementary Figure 16, Supplemental Table 14). This analysis presumably mitigated the relative contribution of decaffeinated coffee (3mg of caffeine per cup versus 60 to 85mg per cup of caffeinated coffee20) to the revealed genetic associations, so we do not believe the cohort discrepancies are driven by the inclusion of decaffeinated coffee drinkers in the UKB. Another consideration is the possible health effects of non-caffeine coffee components, which are comparatively under-investigated9, such as other coffee bean phytochemical and drink additives. Furthermore, while it is unlikely that the discrepancies in genetic associations are driven by age, which is similar between cohorts (approximately 53 years old in 23andMe versus 5720 years old in UKB), these cohorts skew older than the population average. They are also of above average socioeconomic status152 and are of European descent, limiting generalizability of our findings to a larger population. Some studies also show sex- dependent differences in coffee and caffeine metabolism and health associations with intake138,153,154, which was not examined in our study.
Overall, we present striking differences in genetic associations of coffee intake across two large cohorts of European ancestry. While some genetic signals replicate across diverse cohorts, such as our GWAS findings and the associations between coffee intake with substance use and obesity traits, other associations may be obscured by cohort or cultural differences related to the phenotype in question. Our study provides a cautionary perspective on combining large cohort datasets gathered from unique geo-cultural populations.
Author Contributions
SSR and AAP conceived the idea. PF and SLE contributed formal analyses and curation of 23andMe data. HHAT contributed to formal analyses, investigation, and data visualization. BP, AA, and NCK contributed to formal data analysis and data visualization. JJM and LVR contributed to formal analyses. JM contributed to data visualization. HHAT and SSR wrote the manuscript. All authors reviewed and edited the manuscript.
Competing interests
PF and SLE are employees of 23andMe, Inc., and hold stock or stock options in 23andMe. AAP is on the scientific advisory board of Vivid Genomics for which he receives stock options.
Materials and Correspondence
Correspondence and material requests should be directed to Sandra Sanchez-Roige, sanchezroige{at}ucsd.edu.
Methods
Study cohorts, coffee intake and univariate GWAS
23andMe
Univariate GWAS was conducted in a sample of 130,153 male and female research participants of the genetics testing company 23andMe, Inc, as previously described155. Participants provided informed consent and volunteered to participate in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical & Independent (E&I) Review Services. As of 2022, E&I Review Services is part of Salus IRB (https://www.versiticlinicaltrials.org/salusirb). During 4 months in 2015 and 14 months between 2018-2020, participant responses to the question “How many 5-ounce (cup-sized) servings of caffeinated coffee do you consume each day?” were collected as part of a larger survey. Participants categorized as of European descent by genotype data were included in the univariate GWAS (see Supplementary Material)156. Participant demographics are presented in Supplementary Table 2.
DNA extraction and genotyping were performed from saliva samples by clinical laboratories CLIA-certified and CAP-accredited by the Laboratory Corporation of America. 23andMe, Inc. conducted all quality control, imputation, and univariate genome-wide analyses as previously described (see Supplementary Table 4 for SNPs analyzed following quality control and imputation)157,158. Variants were imputed based on an imputation panel combining 1000 Genomes Phase 3, UK10K and the Human Reference Consortium. The 23andMe pipeline removes variants of low genotyping quality (failed a Mendelian transmission test in trios (p<1.00E-20), failed Hardy- Weinberg test (p<1.02E-20), failed batch effects test (ANOVA p<1.00E-20), or had a call rate <90%) or imputation quality (r2<0.50 averages across genotyping arrays or a minimum r2<0.3 or variants with apparent batch effects (p<1.00E-50))157,159. The 23andMe GWAS pipeline performs linear regression and assumes an additive model for allelic effects43,155,160,161. Only unrelated participants were included in the GWAS, which was determined using a segmental identity-by-descent (IBD) estimation algorithm. Two individuals that shared more than 700 cM IBD at one or both genomic segments (∼20% of the genome) were classified as related. This is the minimum threshold expected to identify first cousins in an outbred population. Age (inverse-normal transformed), sex, the top five principal genotype components, and indicator variables for genotype platforms were applied as covariates and p-values were corrected for genomic control.
We conducted replication in three multi-ancestral cohorts (European ancestry N=689,661; African American ancestry N=32,312; Latin American ancestry N=124,155; daily mg of caffeine from coffee, transformed by log10(x+75)). Demographic information on these cohorts is shown in Supplementary Table 3. Ancestry was determined by analyzing local ancestry (see Supplementary Material)156.
UK Biobank
Summary statistics of coffee intake (N=334,659) were generated from UK Biobank (UKB) participants. Participants provided informed consent, were of White British descent, and answered the question “How many cups of coffee do you drink each day? (Include decaffeinated coffee)”. Other previously published GWAS of coffee intake with publicly available summary statistics were not included in our meta-analysis due to differences in the way that coffee intake was measured (e.g., “How often do you drink coffee?”, “How much coffee do you consume per year?”30), or differences in ascertainment (e.g., Parkinson’s disease only31). Secondary analysis was also conducted with GWAS summary statistics of caffeine intake from coffee (N=373,522) in the UKB that was calculated based on the number of cups of caffeinated and decaffeinated coffee consumed20. For further information about the UKB data collection and GWAS summary statistics for coffee intake and caffeine consumed from coffee, see http://www.nealelab.is/uk-biobank/ (field 1498, both sexes) and Said et al. 202020, respectively.
Gene-based analyses (MAGMA, H-MAGMA, SPrediXcan/S-MultiXcan)
The web-based platform Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA v1.3.8) was used to further explore the functional consequences of lead SNPs and identify prior associations in the literature. GWA significant lead SNPs of coffee cups per day consumed from the UKB were identified using FUMA. SNPs were annotated based on ANNOVAR categories, Combined Annotation Dependent Depletion scores, RegulomeDB scores, expression quantitative trait loci (eQTLs), and chromatin state predicted by ChromHMM. Novel coffee intake candidate genes were identified as genes not in linkage disequilibrium or within 1Mb of GWAS-significant SNPs uncovered by other GWAS of coffee and caffeine traits (e.g., coffee/caffeine intake or caffeine metabolism). These SNPs were identified using the EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/).
MAGMA gene-based and pathway analyses
We used “Multi-marker Analysis of GenoMic Annotation” (MAGMA, v1.08) to conduct gene-based associations on the 23andMe GWAS summary statistics of coffee intake. SNPs were annotated to protein-coding genes using FUMA and Ensembl build v92, which accounts for SNP LD using multiple regression methods. The default settings were used, and LD was estimated using the 1000 Genomes European reference sample. The significance of associations across 19,773 genes were adjusted using Bonferroni correction for multiple testing (one-sided p<2.53E-06; Supplementary Table 7). Gene-set analysis was subsequently conducted on 10,678 gene-sets and Gene Ontology terms curated from the Molecular Signatures Database (MsigDB v7.0). Tissue-specific gene expression profiles were also assessed in 54 tissue types and 30 general tissue types across the body with average gene expression in each tissue type used as a covariate in the analysis (Supplementary Table 13). Using Genome-Tissue Expression (GTEx, v8) RNA-seq data, gene expression values were log2 transformed values of the average Reads Per Kilobase Million (RPKM) for each tissue type (RPKM>50 were listed as 50). Significance was determined following Bonferroni correction (one- sided p<9.26E-04 for 54 tissue types; one-sided p<1.67E-03 for 30 general tissue types).
H-MAGMA
To identify neurobiologically relevant target genes, we incorporated coffee intake GWAS data with chromatin interaction profiles from human brain tissue using Hi-C coupled MAGMA (H-MAGMA)162.
S-PrediXcan and S-MultiXcan
We performed a transcriptome-wide association study (TWAS) using the MetaXcan package (ver0.7.5)163,164 consisting of S-PrediXcan and S-MultiXcan to identify specific eQTL-linked genes associated with coffee intake. eQTLs are genomic loci that contribute to heritable variation in mRNA levels that might influence the expression of a particular gene or its neighbors. This approach uses genetic information to predict gene expression levels in various brain tissues and tests whether the predicted gene expression correlates with coffee intake. S-PrediXcan uses precomputed tissue weights from the GTEx project database (https://www.gtexportal.org/) as the reference transcriptome dataset via Elastic net models. As input data, we included our GWAS summary statistics, transcriptome tissue data, and covariance matrices of the SNPs within each gene model (based on HapMap SNP set; available to download at the PredictDB Data Repository) from all available tissues (N=49). We applied a Bonferroni correction for multiple testing across all tissues (N=21,565; Supplementary Table 10).
LDSC heritability and genetic correlations
Linkage Disequilibrium Score regression (LDSC; https://github.com/bulik/ldsc) was used to calculate heritability (SNP-h2) and genetic correlations between habitual coffee intake and other phenotypes165. SNP-h2 was calculated from publicly available, pre-computed LD scores (“eur_w_ld_chr/”). LDSC was also used to calculate genetic correlations (rg) between habitual coffee intake and 317 selected traits informed by prior literature across the following categories: substance use and misuse, anxiety, bipolar disorder & depression, cancer, cardiovascular, diet, gastrointestinal, lifestyle, metabolic, neurological, pain, personality, other anthropologic, other health, and other psychiatric traits.
mtCOJO
We conditioned summary statistics on traits that are correlated with coffee intake using mtCOJO94 to determine if genetic associations are retained after controlling for their effects. Conditioning on GWAS summary statistics for dietary sugar intake166, Alcohol Use Disorder Identification Test (AUDIT) consumption43, and cigarettes smoked per day73 were examined at a p<1.0E-5 and clump-r2 threshold of 0.10.
Phenome and laboratory-wide association studies
We conducted phenome-wide association analyses (PheWAS) and Laboratory-wide association analyses (LabWAS) to test the association between polygenic scores (PGS) for coffee intake and liability across thousands of medical conditions from hospital-based cohorts. These analyses were conducted using data from the Vanderbilt University Medical Center (VUMC). The project was approved by the VUMC Institutional Review Board (IRB #160302, #172020, #190418). VUMC is an integrated health system with individual-level health data from electronic health record (EHR) data for about 3.2 million patients. The VUMC biobank contains clinical data from EHR as well as biomarkers obtained from laboratory assessments. A portion of the individuals from VUMC also have accompanying array genotyping data. This cohort, with over 72,821 patients, is called BioVU167,168.
For each of the unrelated genotyped individuals of European ancestry from BioVU, we computed polygenic scores for coffee intake using the PRS-CS “auto” version167. Genotyping and quality control for this cohort have been extensively described168,169.
Phenome-wide association analyses (PheWAS)
To identify associations between the PGS for coffee and clinical phenotypes, we performed a PheWAS. We fitted a logistic regression model to each of 1,338 case/control disease phenotypes (“phecodes”) to estimate the odds of each diagnosis given the coffee PGS, while adjusting for sex, median age of the longitudinal EHR, and the first 10 PCs. Analyses were conducted using the PheWAS v0.12 R package170. We required the presence of at least two International Disease Classification codes mapped to a PheWAS disease category (Phecode Map 1.2; https://phewascatalog.org/phecodes) and a minimum of 100 cases for inclusion of a phecode. The disease phenotypes included 145 circulatory system, 120 genitourinary, 119 endocrine/metabolic, 125 digestive, 117 neoplasms, 91 musculoskeletal, 85 sense organs, 76 injuries & poisonings, 65 dermatological, 76 respiratory, 69 neurological, 64 mental disorders, 42 infectious diseases, 42 hematopoietic, 34 congenital anomalies, 37 symptoms, and 31 pregnancy complications.
Laboratory-wide association analyses (LabWAS)
We also examined laboratory results in BioVU, which we refer to as LabWAS. We implemented the pipeline already established by Dennis, et al. 169. Broadly, LabWAS uses the median, inverse normal quantile transformed age- adjusted values from the QualityLab pipeline in a linear regression to determine the association with the input coffee intake PGS variable. We controlled for the same covariates as for the PheWAS analyses, excluding median age because the pipeline corrects for age using cubic splines with 4 knots.
Cluster analysis
Previous studies have shown that consumption and misuse/dependence phenotypes have a distinct genetic architecture43,111,116–119. To explore whether the coffee intake analysis clustered closer to substance intake or misuse/dependence phenotypes, we used an unsupervised machine learning hierarchical clustering algorithm known as agglomerative nesting (AGNES)167 on a genetic correlation matrix of all traits. AGNES initially forms single-item clusters that are fused together into intermediate groups until all traits are included in a single cluster171. Clusters are formed with Ward’s method such that the total within cluster variance is minimized while maintaining the fewest number of clusters based on cluster dissimilarity. Dissimilarity is assessed through Euclidean Distance of each pairwise genetic correlation with another trait. The product of AGNES is a dendrogram, formed with multiple brackets called “branches”. AGNES was implemented in R using the cluster package (ver2.1.4)167.
Clustering was conducted with summary statistics of cigarettes per day73, former smoker73, smoking initiation73, problematic opioid use161, ICD10 F17 nicotine dependence172, alcohol dependence173, AUDIT consumption116, AUDIT problems116, cannabis initiation79, cannabis use disorder118, drinks per week73, externalizing psychopathology96, Fagerström Test for Nicotine Dependence (FTND)174, general risk tolerance175, age of smoking initiation73, and opioid use disorder161. The genetic correlations of cigarettes per day, former smoker, and smoking initiation were reverse coded to show the intuitive effects against the other traits in the dendrogram.
Acknowledgements
We would like to thank the research participants and employees of 23andMe for making this work possible. The following members of the 23andMe Research Team contributed to this study: Stella Aslibekyan, Adam Auton, Elizabeth Babalola, Robert K. Bell, Jessica Bielenberg, Katarzyna Bryc, Emily Bullis, Daniella Coker, Gabriel Cuellar Partida, Devika Dhamija, Sayantan Das, Teresa Filshtein, Kipper Fletez-Brant, Will Freyman, Karl Heilbron, Pooja M. Gandhi, Karl Heilbron, Barry Hicks, David A. Hinds, Ethan M. Jewett, Yunxuan Jiang, Katelyn Kukar, Keng- Han Lin, Maya Lowe, Jey C. McCreight, Matthew H. McIntyre, Steven J. Micheletti, Meghan E. Moreno, Joanna L. Mountain, Priyanka Nandakumar, Elizabeth S. Noblin, Jared O’Connell, Aaron A. Petrakovitz, G. David Poznik, Morgan Schumacher, Anjali J. Shastri, Janie F. Shelton, Jingchunzi Shi, Suyash Shringarpure, Vinh Tran, Joyce Y. Tung, Xin Wang, Wei Wang, Catherine H. Weldon, Peter Wilton, Alejandro Hernandez, Corinna Wong, Christophe Toukam Tchakouté.
We would also like to thank The Externalizing Consortium for sharing the GWAS summary statistics of externalizing. The Externalizing Consortium: Principal Investigators: Danielle M. Dick, Philipp Koellinger, K. Paige Harden, Abraham A. Palmer. Lead Analysts: Richard Karlsson Linnér, Travis T. Mallard, Peter B. Barr, Sandra Sanchez-Roige. Significant Contributors: Irwin D. Waldman. The Externalizing Consortium has been supported by the National Institute on Alcohol Abuse and Alcoholism (R01AA015416 -administrative supplement), and the National Institute on Drug Abuse (R01DA050721). Additional funding for investigator effort has been provided by K02AA018755, U10AA008401, P50AA022537, as well as a European Research Council Consolidator Grant (647648 EdGe to Koellinger). The content is solely the responsibility of the authors and does not necessarily represent the official views of the above funding bodies. The Externalizing Consortium would like to thank the following groups for making the research possible: 23andMe, Add Health, Vanderbilt University Medical Center’s BioVU, Collaborative Study on the Genetics of Alcoholism (COGA), the Psychiatric Genomics Consortium’s Substance Use Disorders working group, UK10K Consortium, UK Biobank, and Philadelphia Neurodevelopmental Cohort.
MVJ, SBB, and SSR are supported by funds from the California Tobacco-Related Disease Research Program (TRDRP; Grant Number T29KT0526 and T32IR5226). SBB and AAP were also supported by P50DA037844. BP, JJM, and SSR are supported by NIH/NIDA DP1DA054394. HHAT is funded through a Natural Science and Engineering Research Council PGS-D scholarship and Canadian Institutes of Health Research (CIHR) Fellowship. JYK is supported by a CIHR Canada Research Chair in Translational Neuropsychopharmacology. LKD is supported by R01 MH113362. NSC is funded through an Interdisciplinary Research Fellowship in NeuroAIDs (Grant Number R25MH081482). JM is funded by the National Institute for Health and Care Research (NIHR) Maudsley Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
The datasets used for the PheWAS and LabWAS analyses described were obtained from Vanderbilt University Medical Center’s BioVU which is supported by numerous sources: institutional funding, private agencies, and federal grants. These include the NIH funded Shared Instrumentation Grant S10RR025141; and CTSA grants UL1TR002243, UL1TR000445, and UL1RR024975. Genomic data are also supported by investigator-led projects that include U01HG004798, R01NS032830, RC2GM092618, P50GM115305, U01HG006378, U19HL065962, R01HD074711; and additional funding sources listed at https://victr.vumc.org/biovu-funding/. PheWAS and LabWAS analyses used CTSA (SD, Vanderbilt Resouces). This project was supported by the National Center for Research Resources, Grant UL1 RR024975-01, and is now at the National Center for Advancing Translational Sciences, Grant 2 UL1 TR000445-06.