Abstract
Bipolar disorder (BD) is a heritable mental illness with complex etiology. While the largest published genome-wide association study identified 64 BD risk loci, the causal SNPs and genes within these loci remain unknown. We applied a suite of statistical and functional fine-mapping methods to these loci, and prioritized 22 likely causal SNPs for BD. We mapped these SNPs to genes, and investigated their likely functional consequences by integrating variant annotations, brain cell-type epigenomic annotations, brain quantitative trait loci, and results from rare variant exome sequencing in BD. Convergent lines of evidence supported the roles of SCN2A, TRANK1, DCLK3, INSYN2B, SYNE1, THSD7A, CACNA1B, TUBBP5, PLCB3, PRDX5, KCNK4, AP001453.3, TRPT1, FKBP2, DNAJC4, RASGRP1, FURIN, FES, YWHAE, DPH1, GSDMB, MED24, THRA, EEF1A2, and KCNQ2 in BD. These represent promising candidates for functional experiments to understand biological mechanisms and therapeutic potential. Additionally, we demonstrated that fine-mapping effect sizes can improve performance and transferability of BD polygenic risk scores across ancestrally diverse populations, and present a high-throughput fine-mapping pipeline (https://github.com/mkoromina/SAFFARI).
Introduction
Bipolar disorder (BD) is a heritable mental illness with complex etiology1. Heritability estimates from twin studies range between 60% and 90%2–4, while SNP-based heritability (h²SNP) calculations suggest that common genetic variants can explain up to 20% of the phenotypic variance of BD5. Genome-wide association studies (GWAS) of common variants have been successful in identifying associated genetic risk loci for BD5–15. For example, the largest published BD GWAS to date, conducted by the Psychiatric Genomics Consortium (PGC), comprised more than 40,000 BD cases and 370,000 controls from 57 cohorts of European ancestries, and identified 64 genome-wide significant (GWS) risk loci16. However, identifying the causal SNPs within these loci (i.e., SNPs responsible for the association signal at a locus and with a biological effect on the phenotype, as opposed to those associated due to linkage disequilibrium (LD) with a causal variant) is a major challenge.
Computational fine-mapping methods aim to identify independent causal variants within a genomic locus by modeling LD structure, SNP association statistics, number of causal variants, and/or prior probabilities of causality based on functional annotations. There are a variety of fine-mapping models ranging from regression to Bayesian methods, with different strengths and limitations17–19. For example, the Sum of Single Effects (SuSiE) model uses iterative Bayesian selection with posterior probabilities20, FINEMAP employs a stochastic search algorithm for SNP combinations21, and POLYgenic FUNctionally-informed fine-mapping (PolyFun) computes functional priors to improve fine-mapping accuracy18. Bayesian fine-mapping methods typically generate a posterior inclusion probability (PIP) of causality per SNP, and “credible sets” of SNPs, which represent the minimum set of SNPs with a specified probability of including the causal variant(s). Many methods can assume one or multiple causal variants per locus, and can now be applied to GWAS summary statistics from large and well-powered studies. This is highly advantageous for fine-mapping GWAS meta-analyses; however, the specification of appropriate LD structure is crucial for accurate fine-mapping. When LD cannot be obtained from the original cohort(s) (e.g. due to data access restrictions), it should instead be obtained from a sufficiently large sample that is ancestrally similar to the GWAS population22.
Fine-mapping methods have recently been applied to GWAS of psychiatric disorders. For example, a recent study using FINEMAP and integrating functional genomic data identified more than 100 genes likely to underpin associations in risk loci for schizophrenia23. Several fine-mapped candidates had particularly strong support for their pathogenic role in schizophrenia, due to convergence with rare variant associations23. Here, we use a suite of tools to conduct statistical and functional fine-mapping of 64 GWS risk loci for BD16 and assess the impact of the LD reference panel and fine-mapping window specifications. We link the likely causal SNPs to their relevant genes and investigate their potential functional consequences, by integrating functional genomic data, including brain cell-type-specific epigenomic annotations, and quantitative trait loci data. We also fine-mapped the major histocompatibility complex (MHC) separately by imputing human leukocyte antigen (HLA) variants, and assessed the impact of fine-mapping on polygenic risk score (PRS) predictions. Finally, we present a comprehensive fine-mapping pipeline implemented via Snakemake24 as a rapid, scalable, and cost-effective approach to prioritize likely variants from GWS risk loci. This strategy yielded promising candidate genes for future experiments to understand the mechanisms by which they increase risk of BD.
Methods
GWAS summary statistics and BD risk loci
Summary statistics from the latest published BD GWAS by the Psychiatric Genomics Consortium (PGC) were used as input to the fine-mapping pipeline16. Briefly, this GWAS comprised 41,917 BD cases, and 371,549 controls of European (EUR) ancestries, from 57 cohorts (Supplementary Table S1). Most genomic data were imputed using the Haplotype Reference Consortium (HRC) EUR ancestry reference panel v1.025, leading to a total of 7,608,183 SNPs that were well-imputed and well-represented across cohorts in the GWAS meta-analysis. Each GWS locus window was established around the GWS significant “top lead” SNP (P < 5 × 10−8), with boundaries defined by the positions of the 3’-most and 5’-most SNPs, requiring an LD r2 > 0.1 with the top lead SNP within a 3 Mb range, according to the LD structure of the HRC EUR reference panel16. The GWAS meta-analysis identified 64 independent loci associated with BD at GWS, which were selected for fine-mapping. Due to the complexity and long-range LD of the MHC/HLA region, this locus was analyzed separately (see section ‘Fine-mapping the MHC locus’). Supplementary Table S2 shows the top lead SNP from each GWS locus, association statistics, locus boundaries, locus size, and locus names (as defined in the original GWAS)16.
Conditional analysis
Figure 1 shows an overview of the fine-mapping pipeline. First, stepwise conditional analyses were conducted using GCTA-COJO26 to identify potential independent association signals within each locus, according to the LD structure of the HRC EUR reference panel. Association tests were performed for all SNPs in each locus, conditioning sequentially on the top lead SNP, until no conditional tests were significant (conditional P > 5 x 10-6), to calculate the number of independent association signals per locus. A less stringent P value threshold (P < 5 x 10-6) for significance was selected for the conditional analysis, consistent with the recommendations of Yang et al26.
LD reference panels
Statistical and functional fine-mapping methods require information on LD between variant and selection of a genomic region (“window”) to fine-map. To examine the impact of LD on fine-mapping, analyses were performed using LD information from the HRC EUR reference panel, published LD matrices based on EUR ancestry individuals in the UK Biobank (UKB)18, and “in-sample” LD calculated from a subset of 48 BD cohorts in the PGC BD GWAS for which individual-level genetic data were available within the PGC (33,827 cases, 53,953 controls, all of EUR ancestries), representing 73% of the total effective sample size of the GWAS. Briefly, HRC-imputed dosage data were converted to hard calls with a genotype call probability cut-off of 0.8 and PLINK binary files were merged across cohorts, restricting to the set of unrelated individuals included in the GWAS, using PLINK v1.9028. Missingness rates per SNP were calculated in each cohort, and SNPs absent in all individuals from any one cohort were excluded from the merged dataset, yielding 7,603,435 SNPs overlapping with the GWAS summary statistics. Individual-level genetic data per chromosome were used as an “in-sample” LD reference panel for fine-mapping.
Two fine-mapping “windows” were selected: 1) a 3 Mb window and 2) the GWS locus window defined in the PGC BD GWAS16. The 3 Mb window is a 3 Mb-long region which includes the top lead SNP of the GWS locus, predefined according to published LD matrices of 3 Mb blocks covering the entire genome calculated in the UKB18. The GWS locus window is established around a GWS significant “top lead” SNP (P < 5 × 10−8). Its boundaries are defined by the positions of the 3’-most and 5’-most SNPs, requiring an LD r2 greater than 0.1 with the top lead SNP within a 3 Mb range, in accordance with the LD structure of the HRC EUR reference panel. Excluding the MHC, GWS locus windows ranged between 14,960 - 3,730,000 bp in size (Supplementary Table S2).
Statistical and functional fine-mapping
GWS loci were fine-mapped using a suite of Bayesian fine-mapping methods that can be applied to GWAS summary statistics: SuSiE, FINEMAP, PolyFun+SuSiE, PolyFun+FINEMAP (Figure 1). SuSiE and FINEMAP are statistical fine-mapping methods, while PolyFun incorporates functional annotations as prior probabilities to improve subsequent fine-mapping accuracy18,20,21. Since these methods have different underlying assumptions, strengths and limitations, results were compared to examine convergence of evidence across methods. Briefly, each Bayesian method generates SNP-wise posterior inclusion probabilities of causality (PIP), and a 95% credible set (95% CS), defined as the minimum subset of SNPs that cumulatively have at least 95% probability of containing the causal SNP(s). PIP refers to the marginal probability that a SNP is included in any causal model, conditional on the observed data, hence providing weight of evidence that a SNP should be considered potentially causal.
First, single variant fine-mapping, which makes the simple assumption of one causal variant per locus (K = 1) and does not require LD information18,20,21, was performed using each of the four methods and both the 3 Mb and GWS locus fine-mapping windows. FINEMAP and SuSiE can assume multiple causal variants per locus, modeling the LD structure between them. Fine-mapping was additionally performed within each of the two windows, assuming the default maximum of five causal variants per locus (K = 5) and separately using the HRC, UKB and “in-sample” LD structures. Finally, PolyFun was used to incorporate 187 published functional annotations from the baseline-LF 2.2.UKB model27 to compute prior causal probabilities (priors) via an L2-regularized extension of stratified LD-score regression (S-LDSC)29, and subsequently perform fine-mapping using FINEMAP and SuSiE18. Briefly, functional annotations included epigenomic and genomic annotations, minor allele frequency (MAF) bins, binary or continuous functional annotations, LD-related annotations such as LD level, predicted allele age, recombination rate, and CpG content27. Functionally-informed fine-mapping was also performed using the three LD reference panels and two fine-mapping windows.
In total, 32 fine-mapping analyses were conducted (24 multi-variant analyses using four fine-mapping methods, three LD reference panels and two fine-mapping windows, and eight LD-independent single-variant fine-mapping analyses), varying parameters to examine their impact and the convergence of results. “Consensus SNPs” were defined as those in the 95% CS from at least two methods that used the same LD structure and fine-mapping window, and with a PIP >0.95 or >0.50 (Table 1) (36 opportunities for a SNP to be a consensus SNP). When single-variant fine-mapping was performed, consensus SNPs were defined as those with a PIP >0.95 or PIP >0.50 and found in at least two methods using the same fine-mapping window. The “union consensus” set of SNPs was defined as all consensus SNPs across LD structure and fine-mapping windows with PIP >0.50, excluding SNPs identified only with the UKB LD reference panel and with a high GWAS P value.
All steps of the statistical and functional fine-mapping analyses have been compiled into a high-throughput pipeline named SAFFARI (Statistical And Functional Fine-mapping Applied to GWAS Risk LocI). SAFFARI is implemented through Snakemake in a Linux environment24, with options to provide sets of GWAS summary statistics, lists of fine-mapping windows, and to specify LD reference panels, in the form of LD matrices or individual-level genetic data (GitHub: https://github.com/mkoromina/SAFFARI).
Annotation of union consensus SNPs
Union consensus SNPs were characterized using the Variant Effect Predictor (VEP) (GRCh37) Ensembl release 10930. When SNPs were mapped to multiple transcripts, the most severe variant consequence was retained for annotation, and when SNPs fell within intergenic or regulatory regions, no genes were annotated30. If annotated genes overlapped and the SNP had the same severity consequence, then both genes were annotated. Additional annotations included the CADD scores (https://cadd.gs.washington.edu/), which denote the likelihood of the variant being deleterious or disease-causing (CADD >= 20) and ClinVar annotations (https://www.ncbi.nlm.nih.gov/clinvar/) describing the association of variants with diseases (i.e., benign, pathogenic, etc). Union consensus SNPs were further annotated with RegulomeDB (v.2.2) to determine whether they have functional consequences and lie in non-coding regions and to annotate them to the relevant regulatory elements31. RegulomeDB probability and ranking scores are positively correlated and predict functional variants in regulatory elements. Probability scores closer to 1 and ranking scores below 2 provide increased evidence of a variant to be in a functional region31. Probability of being loss-of-function intolerant (pLI) and loss-of-function observed/expected upper bound fraction (LOEUF) scores were retrieved from the Genome Aggregation Database (gnomAD) v4.0.0. Genes were classified as intolerant to loss of function (LoF) variants if LOEUF< 0.6 or pLI ≥0.9. We also used DGIdb v.5.0.132 to detect any druggable genes amongst our set of high confidence genes for BD risk.
QTL integrative analyses
Union consensus SNPs were investigated for putative causal relationships with BD via brain gene expression, splicing or methylation, using Summary data-based Mendelian randomization (SMR) (v1.03)33,34. Data on expression quantitative trait loci (eQTLs) and splicing quantitative trait loci (sQTLs) were obtained from the BrainMeta study (v2), which comprised RNA-seq data of 2,865 brain cortex samples from 2,443 unrelated individuals of EUR ancestries with genome-wide SNP data35. Data on methylation quantitative trait loci (mQTLs) were obtained from the Brain-mMeta study36, a meta-analysis of adult cortex or fetal brain samples, comprising 1,160 individuals with methylation levels measured using the Illumina HumanMethylation450K array. We analyzed cis-QTLs, which were defined as those within 2 Mb of each gene35. Of the 22 union consensus SNPs, 10 were present in the BrainMeta QTL data, and 10 were present in the Brain-mMeta data. Using the BD GWAS16 and QTL summary statistics35, each union consensus SNP was analyzed as the target SNP for probes within a 2 Mb window on either side using the -- extract-target-snp-probe option in SMR. Only probes for which the union consensus SNP was a genome-wide significant QTL (P < 5 x 10-8) were analyzed, to ensure robustly associated instruments for the SMR analysis33,34. A Bonferroni correction was applied for 579 probes tested in the eQTL (PSMR < 8.64 x 10-5), 2,257 tests in the sQTL (PSMR < 2.21 x 10-5) and 45 tests in the mQTL analyses (PSMR < 1.11 x 10-3). The significance threshold for the HEIDI test (heterogeneity in dependent instruments) was PHEIDI ≥ 0.0134. The HEIDI test is used to identify potential violations of the Mendelian Randomization assumption, specifically the assumption of no horizontal pleiotropy. A SNP with passing the Bonferroni-corrected PSMR and the PHEIDI thresholds indicates either a direct causal role or a pleiotropic effect of the BD-associated SNPs on gene expression, splicing or methylation level.
Overlap with epigenomic peaks and rare variant association signal
Union consensus SNPs were examined for physical overlap with promoters or enhancers of gene expression in human brain cell-types. Data on epigenomic peaks were obtained from purified bulk, H3K27ac and H3K4me3 ChIP-seq of neurons and astrocytes previously published and used to detect active promoters and enhancers37. Physical overlap was visually examined via locus plots using R (R version 4.1.2). For SNPs located in promoters, we assigned the corresponding gene name. For active enhancers, the target gene was assigned based on PLAC-Seq data37 on enhancer-promoter interactions. Genes linked to union consensus SNPs via overlap with epigenomic peaks, SMR, or missense annotations, were further assessed for convergence with findings from an exome sequencing study of BD published by the Bipolar Exome (BipEx) Collaboration38. Using the BipEx browser38, genes annotated to union consensus SNPs were compared for an overlap against BipEx genes characterized by a significant (P < 0.05) burden of either damaging missense or LoF variants.
Fine-mapping the MHC locus
The major histocompatibility complex (MHC) locus was fine-mapped separately due to its complex genetic variation and long-range LD structure39. The human leukocyte antigen (HLA) alleles and amino acid variants were imputed in the PGC BD data, using the 1000 Genomes phase 3 reference panel comprising 503 EUR individuals40 with HLA alleles determined via sequencing. This reference was obtained from the CookHLA GitHub repository41 (CookHLA v.1.0.1) and included 151 HLA alleles (65 2-digit and 86 4-digit) with a MAF >0.01 and <0.99, 1,213 amino acid variants, and 1,268 SNPs within the MHC region (chromosome 6, 29-34 Mb).
Variation in the MHC was imputed for 48 BD cohorts where individual-level genotyped SNP data were available within the PGC (33,827 BD, 53,953 controls), using IMPUTE2, implemented via the Rapid Imputation and COmputational PIpeLIne for GWAS (RICOPILI)42. RICOPILI was used to perform association analysis, under an additive logistic regression model in PLINK v1.9028, covarying for the first 5 principal components (PCs) of genetic ancestry and any others associated with case-control status within each cohort, as per the BD GWAS16. To control test statistic inflation at variants with low MAF in small cohorts, variants were retained only if cohort MAF was greater than 1% and minor allele count was greater than 10 in either cases or controls (whichever had smaller N). Meta-analysis of the filtered association statistics was conducted using an inverse-variance-weighted fixed-effects model in METAL (version 2011-03-25) via RICOPILI43.
Conditional analysis of the MHC-association results was performed to identify whether there are any additional independent associations, by conditioning on the top lead variant within the locus. In brief, the dosage data for the top lead variant in the meta-analysis were extracted for each cohort, converted into a single value representing the dosage of the A1 allele (range 0-2), and this was added as a covariate in the analysis. Association testing, filtering of results per cohort, and the meta-analysis were carried out as described above.
Polygenic risk scoring
Fine-mapping results were further evaluated by testing whether fine-mapping effect sizes could improve the performance of PRS in independent cohorts using PolyPred44, a method which combines effect sizes from fine-mapping with those from a standard PRS approach, such as PRS-CS45. PRS were calculated for individuals in 12 testing cohorts of BD cases and controls that were independent of the BD GWAS: three new PGC cohorts of EUR ancestries, two cohorts of East Asian ancestries, four cohorts of admixed-African American ancestries, and three cohorts of Latino ancestries, some of which have been described previously16 (Supplementary Note).
An analytical workflow outlining the steps of the PolyPred pipeline that we followed is shown in Supplementary Figure S1. First, the standard approach used was PRS-CS, which uses a Bayesian regression framework to place continuous shrinkage priors on effect sizes of SNPs in the PRS, adaptive to the strength of their association signal in the BD GWAS16, and the LD structure from an external reference panel45. The UKB EUR ancestry reference panel was used to estimate LD between SNPs, matching the ancestry of the discovery GWAS16. PRS-CS yielded weights for approximately 1 million SNPs to be included in the PRS. Second, genome-wide fine-mapping was performed on the BD GWAS summary statistics16, using both SuSiE and PolyFun-SuSie as previously described with LD information obtained from the HRC reference panel, to derive causal effect sizes for all SNPs across the genome. Third, PolyPred was used to combine the SNP weights from PRS-CS with SuSie effect sizes (“SuSie+PRS-CS”) and SNP weights from PRS-CS with PolyFun-SuSiE effect sizes (“Polypred-P”). Briefly, Polypred “mixes” the effect sizes from the two predictors via the non-negative least squares method, assigning a weight to each predictor that yields the optimally performing PRS in a specific testing cohort. Each testing cohort was used to tune the optimal PolyPred weights. Fourth, three PRS were calculated for each individual in the testing cohorts, using PLINK v1.9028 to weight SNPs by their effect sizes from PRS-CS, SuSiE+PRS-CS and Polypred-P respectively, and sum across all SNPs in each PRS. Finally, PRS were tested for association with case versus control status in each testing cohort using a logistic regression model including PCs 1-5 and any others as necessary to control for genetic ancestry46. In each testing cohort, the amount of phenotypic variance explained by the PRS (R2) and the 95% confidence intervals were calculated on the liability scale47, using the r2redux R package48, assuming a lifetime prevalence of BD in the general population of 2%. The R2 of each fine-mapping-informed PRS was statistically compared against the R2 of PRS-CS using the r2redux package (r2_diff function)48.
Results
Fine-mapping identifies likely causal BD variants
Stepwise conditional analyses using GCTA-COJO were performed in each of the 64 BD GWS loci (Supplementary Table S2), conditioning associations on their top lead SNP and any subsequent conditionally independent associations, to identify loci that contained independent signals (conditional Pl<l5x10-6). This analysis supported the existence of one association signal at 62 loci (Supplementary Table S3), and two independent association signals within the MSRA locus on chromosome 8 and the RP1-84O15.2 locus on chromosome 8 (Supplementary Table S3).
Excluding the MHC, GWS loci were fine-mapped via a suite of Bayesian fine-mapping tools (SuSiE, FINEMAP, PolyFun+SuSiE, PolyFun+FINEMAP) to prioritize SNPs likely to be causal for BD, and examine the impact of different LD reference panels and fine-mapping windows (see Methods). Table 1 shows the number of SNPs with a PIP >0.95 and PIP >0.50 in each fine-mapping analysis. Generally, incorporating functional priors using PolyFun yielded greater numbers of SNPs passing these thresholds (mean 52% increase). Typically, more SNPs exceeded these PIP thresholds when using the larger 3 Mb fine-mapping window versus the GWS locus window (mean 25% increase). As anticipated, making the simple assumption of a single-causal variant per locus to comply with the conditional analysis and not leveraging LD information prioritized the fewest SNPs. Incorporating an LD reference panel to allow for multiple causal SNPs within a locus prioritized more SNPs, with the UKB reference panel yielding the most SNPs with PIPs exceeding our specified thresholds.
Approximately a third of GWS loci (N= 20) had high PIP SNPs (>0.50). Employing different fine-mapping methods and LD reference panels revealed a substantial number of consensus SNPs with PIP >0.50, particularly with a 3 Mb window, but fewer met the stricter threshold of PIP >0.95. The number of 95% credible sets per locus varied based on the fine-mapping method (Supplemental Note, Supplemental Figure S2, Supplemental Figure S3).
The union consensus set (PIP >0.5) comprised 22 SNPs (from 20 GWS loci), indicating that many of the same SNPs were prioritized regardless of which LD reference panel or fine-mapping window was used (Figure 2). There were 9 SNPs consistently prioritized as the likely causal variant across all LD reference panels and fine-mapping windows (Figure 2, Supplementary Figure S3). The distribution of SNPs with PIP >0.95 for each GWS locus across different methods, LD reference panels and fine-mapping windows is provided in Supplemental Table S4.
Characterization of union consensus SNPs
Union consensus SNPs were initially characterized using standard metrics from online databases. Variant annotation of the union consensus SNPs via VEP30 indicated that 7 of the 22 fall in intronic regions (Supplementary Table S5). There were 20 SNPs annotated to genes using VEP, and 16 of which were the closest gene to the index SNP (Supplementary Table S5). Three of the union consensus SNPs are missense variants: rs17183814 in SCN2A (CADD: 20, ClinVar benign), rs4672 in FKBP2 (CADD: 22.5, not in ClinVar) and rs11549690 in TRPT1 (CADD: 19.8, not in ClinVar). According to RegulomeDB annotations, two union consensus SNPs likely have regulatory functions: rs4331993 (intronic in SYNE1) and rs10867108 (3’ untranslated region of CACNA1B) (Supplementary Table S5). Of the protein coding genes prioritized by VEP annotations, 10 of the 18 had high pLI scores (≥0.9) and 12 had low LOEUF scores (<0.65), indicating intolerance to LoF variants (Figure 3).
QTL integrative analyses and overlap with epigenomic peaks
Summary data-based Mendelian randomization (SMR)33,34 was used to identify putative causal relationships between union consensus SNPs and BD via gene expression, splicing or methylation, by integrating the BD GWAS association statistics with brain eQTL, sQTL and mQTL summary statistics. eQTL and sQTL data were based on the BrainMeta study (2,865 brain cortex samples from 2,443 unrelated individuals of EUR ancestries)35 and mQTL data were from the Brain-mMeta study (adult cortex or fetal brain samples in 1,160 individuals)36. Union consensus SNPs with genome-wide significant cis-QTL P values (P < 5x10-8) and their corresponding gene expression, slicing or methylation probes were selected as target SNP-probe pairs for SMR, yielding 579, 2,257 and 45 SNP-probe pairs for eQTL, sQTL and mQTL analyses respectively. In the eQTL analyses, there were 6 union consensus SNPs with significant PSMR that passed the HEIDI (heterogeneity in dependent instruments) test for 10 different genes, suggesting that their effect on BD is mediated via gene expression in the brain (Figure 3, Supplementary Table S6). Four of the union consensus SNPs showed evidence of causal effects on BD via expression of more than one gene in their cis-region. In the sQTL analyses, there were 8 union consensus SNPs with significant PSMR results, and passing the HEIDI test, implicating 12 genes (Figure 3, Supplementary Table S6). In the mQTL analyses, there were 24 SNP-probe pairs passing the PSMR and PHEIDI thresholds; of which two methylation probes were annotated to specific genes (FKBP2 and PLCB3) (Figure 3, Supplementary Table S6).
There were 12 union consensus SNPs that physically overlapped with active enhancers or promoters of gene expression in brain cell-types37, particularly neurons (Figure 3). Four union consensus SNPs were located in active promoters of the SCN2A, THSD7A, FKBP2 and THRA genes. Through the utilization of PLAC-seq data, we explored enhancer-promoter interactions, specifically for enhancers in which there is a physical overlap with the union consensus SNPs, and prioritized their genes (Figure 3). Amongst the implicated target genes through enhancer-promoter interactions are INSYN2B, SYNE1, RASGRP1, CRTC3, DPH1 and KCNQ2.
Candidate risk genes based on convergence of evidence
By aggregating multiple lines of fine-mapping validation evidence, we present results for high-confidence genes for BD. Specifically, a gene was characterized as high-confidence if it was linked to a fine-mapped SNP via active promoters or enhancers, brain gene expression, splicing or methylation, or if the fine-mapped SNP was a missense variant (Figure 3, Supplementary Figure S5). Taken together, the data support the roles of the following 25 genes in BD: SCN2A, TRANK1, DCLK3, INSYN2B, SYNE1, THSD7A, CACNA1B, TUBBP5, PLCB3, PRDX5, KCNK4, AP001453.3, TRPT1, FKBP2, DNAJC4, RASGRP1, FURIN, FES, YWHAE, DPH1, GSDMB, MED24, THRA, EEF1A2, and KCNQ2. Supplementary Figure S5 provides multi-track locus plots depicting GWAS association statistics, fine-mapping results, overlap with epigenomic peaks from neurons or astrocytes and gene tracks for 11 loci as examples. We assessed the high-confidence genes for evidence of rare variant associations with BD, using data from the BipEx exome sequencing study38. Amongst the 25 genes examined, THSD7A, CACNA1B, SCN2A and TRANK1 had a significant burden (p < 0.05) of damaging missense or LoF variants in BD versus controls. Many high-confidence genes were classified as druggable based on DGIdb v5.0.3 (SCN2A, CACNA1B, PRDX5, THRA, EEF1A2, KCNQ2 and FES).
Dissecting the MHC locus
We performed association analyses of variants in the MHC region (chromosome 6, 29-34 Mb) including HLA alleles, amino acids, SNPs and insertion/ deletion variants, in a sample of 33,781 BD cases and 53,869 controls. The most significant variant in the MHC was rs1541269, located at 30.1 Mb (OR A allele = 1.12, 95% CI = 1.08-1.15, P = 6.71x10-12). This SNP is in moderate LD with the top lead MHC SNP in the full BD GWAS (rs13195402, 26.5 Mb, LDlink EUR r2 = 0.55)16, which was not present in the imputation reference panel used here. After conditioning on the top lead variant in the MHC from this association analysis (rs1541269), no variants in the MHC remained GWS, suggesting a single association signal across the region in BD (Supplementary Figure S4B, Supplementary Table S8). Prior to conditioning, there were 10 variants in HLA genes reaching GWS (P < 5x10-8), including 3 classical HLA alleles, 2 amino acid variants and 5 intragenic HLA SNPs (Supplementary Figure S4A, Supplementary Table S7). HLA-C*0701 showed a protective effect on BD (OR = 0.91, 95% CI = 0.88-0.94, P = 3.40x10-10) as did HLA-B*0801 (OR = 0.91, P= 4.08x10-8) (Supplementary Figure S4A, Supplementary Table S7). An amino acid change from lysine to asparagine at position 66 in HLA-C and the presence of aspartic acid at position 9 in HLA-A also had protective effects on BD and were GWS (Supplementary Table S7). However none of the HLA variants remained GWS after conditioning on the top lead SNP, rs1541269.
Leveraging fine-mapping to improve polygenic risk scoring
We assessed whether fine-mapping results could be used to improve the performance of BD PRS in 12 testing cohorts: three EUR cohorts that were independent of the BD GWAS, two East Asian cohorts, four admixed African American cohorts, and three Latino cohorts46,49,50. Standard PRS were calculated using the PRS-CS method, and fine-mapping informed PRS were calculated via PolyPred, to integrate statistical fine-mapping results (SuSiE+PRS-CS) or functional fine-mapping results (Polypred-P). Across PRS methods, PRS were significantly higher in BD cases versus controls in all EUR target cohorts and most non-EUR cohorts (Figure 4, Supplementary Table S9). Using PRS-CS, the mean phenotypic variance explained on the liability scale was 11.42% in EUR ancestries, 2.39% in East Asian ancestries, 0.14% in African American ancestries and 0.30% in Latino ancestries (Figure 4, Supplementary Table S9). Examining fine-mapping-informed PRS, SuSiE+PRS-CS or Polypred-P explained more phenotypic variance than PRS-CS in 9/12 cohorts, with PolyPred-P typically showing the best performance (Figure 4). However, increased variance explained by SuSiE+PRS-CS or Polypred-P compared with PRS-CS, was only statistically significant in the Japanese BD cohort (P = 1.22x10 and P = 2.29x10 respectively), one African American (P = 0.035 and P = 0.044 respectively) and one Latino cohort (P = 0.046 and P = 0.002 respectively) (Supplementary Table S9, Figure 4).
Discussion
In the most comprehensive fine-mapping study of BD GWAS risk loci to date, we applied a suite of statistical and functional fine-mapping methods to prioritize 22 likely causal SNPs for BD in 20 genomic loci. We linked these SNPs to genes and investigated their likely functional consequences, by integrating variant annotations, brain cell-type epigenomic annotations, brain QTLs, and results from exome sequencing in BD. Convergence of evidence across these analyses prioritized 25 high-confidence genes, which are strong candidates for functional validation experiments to understand the mechanisms by which they increase risk of BD.
We defined a union consensus set of SNPs representing those likely causal for BD based on the convergence between fine-mapping methods, LD reference panels and fine-mapping windows. This comprised 22 SNPs (from 20 GWS loci), indicating that many of the same SNPs were prioritized across fine-mapping analyses (Figure 2). Linking these SNPs to genes and investigating their likely functional consequences using computational approaches and relevant datasets, prioritized 25 high-confidence genes (Figure 3). The SCN2A (Sodium channel protein type 2 subunit alpha) gene had two union consensus SNPs, one intronic and another a missense variant located in a neuronal promoter. SCN2A has been implicated in epilepsy and neurodevelopmental disorders51. THSD7A encodes N-glycoprotein thrombospondin type 1 domain containing 7A, which mediates endothelial cell migration, tube formation and in neuroangiogenesis52. The gene is highly expressed in inhibitory and excitatory neurons and has been previously associated with treatment response to antiepileptic drug mood stabilizers amongst BD patients53. CACNA1B (Calcium Voltage-Gated Channel Subunit Alpha1 B) was prioritized through fine-mapping of rs10867108 which is located in the 3’ UTR of the gene and SMR analysis suggested this SNP increases risk of BD through increased cortical expression of CACNA1B. SCN2A, THSD7A and CACNA1B were also all supported by the BipEx exome sequencing study, with a significant burden (p < 0.05) of damaging missense or LoF variants in BD versus controls38. Our findings align with the results of drug target enrichment analyses performed on the BD GWAS results, which found significant enrichment of associations amongst genes encoding targets of calcium channel blockers and antiepileptics16. Taken together, these genes and pathways warrant functional investigation to understand biological mechanisms and potential for therapeutic modulation.
The FURIN and TRANK1 genes had strong evidence for a causal role in BD, and both have been studied previously in relevant functional experiments. Here, FURIN was prioritized through fine-mapping of rs4702, which overlaps a neuronal enhancer, with SMR suggesting that the BD risk allele acts via decreasing cortical expression of FURIN. This mechanism has been studied via CRISPR editing in human induced pluripotent stem cell (iPSC)-derived neurons, where allelic conversion to the BD risk allele in excitatory neurons resulted in decreased FURIN expression, neurite length, and firing duration54. In our study, the nearby FES gene was also prioritized, as SMR indicated that rs4702 plays a role in cortical expression and alternative splicing of FES. This gene encodes a tyrosine kinase which is involved in many aspects of cellular differentiation. The prioritization of two genes in the same locus, underscores that one fine-mapped SNP may have multiple functional consequences. TRANK1 was implicated as a high confidence gene through the role of the fine-mapped SNP (rs9834970) in cortical expression, and the burden of damaging rare variants identified in BipEx. iPSC-derived neural progenitor cells carrying the BD risk allele at this SNP, have been shown to exhibit lower TRANK1 expression than homozygotes for the non-risk allele55. Furthermore, this experiment showed that TRANK1 expression levels could be rescued by chronic treatment with therapeutic doses of valproic acid, an antiepileptic which is also used to treat mania. Other high-confidence genes include FKBP2, which plays an important role in immunoregulation, protein folding and trafficking56 and SYNE1, with one of its transcripts encoding CPG2, a brain-specific protein localized to excitatory postsynaptic sites, which regulates glutamate receptor internalization57. PLAC-seq data also prioritized genes with relevant biological functions, such as INSYN2B (inhibitory synaptic factor family member 2b), KCNQ2 (potassium voltage-gated channel subfamily Q member 2) and DPH1 (diphthamide biosynthesis 1) with variants in this gene associated with developmental delay58. Of the 25 high-confidence genes, CACNA1B, TRPT1, YWHAE and EEF1A2 were not the closest gene to the GWAS index SNP, and of the 22 union consensus SNPs, 8 were not the top lead SNP from the GWAS, illustrating the utility of fine-mapping. We observed that only two genes (FURIN, DCLK3) overlapped with genes associated with schizophrenia. This could be explained by the polygenic nature of both disorders and the limited and partial overlap (N = 15) of the 64 BD GWS loci against the 287 SCZ GWS loci23.
In the MHC, there were several polymorphic alleles and amino acid variants in the HLA-C and HLA-B genes associated with BD at GWS (chromosome 6, 31.2-31.3 Mb). The HLA-C*07:01 and HLA-B*08:01 alleles were negatively associated with BD, in line with previous studies reporting their protective effects on SCZ59,60. However, these associations were removed after conditioning on the top lead variant in the MHC (rs1541269, 30 Mb), suggesting the effects were driven by LD with more strongly associated variants located upstream. This is consistent with published findings in the PGC BD data, showing no association between the structural variants in the complement component 4 genes (C4A/C4B) (∼31.9 Mb) and BD, either before or after conditioning on the most associated MHC SNP (rs13195402, 26.4 Mb)16. Overall, this analysis of HLA variation in BD again suggests a single association signal across the MHC, and that the causal variants and genes are outside the classical MHC locus, in contrast to findings in schizophrenia61.
Fine-mapping-informed PRS, developed by combining GWAS effect sizes and genome-wide fine-mapping effect sizes using PolyPred, generally explained a greater proportion of phenotypic variance compared with PRS based on GWAS effect sizes alone. This was true in all EUR and East Asian ancestry testing cohorts, and some admixed African American and Latino cohorts (Figure 4, Supplementary Table S9). The increase in phenotypic variance explained by fine-mapping-informed PRS adds support to our fine-mapping results, as leveraging information on causal effect sizes rather than relying solely on association statistics is expected to improve genetic risk prediction. However, the increase in the phenotypic variance explained by fine-mapping-informed PRS versus PRS-CS was only statistically significant in 3 of 12 tested cohorts. For some cohorts, this may be explained by their modest sample sizes, but in one admixed African American cohort and three Latino cohorts, fine-mapping-informed PRS explained less phenotypic variance than PRS-CS. These results suggest that PolyPred does not perform well in admixed ancestries. Despite incorporating causal effect sizes from genome-wide fine-mapping, with the expectation that causal variants are shared across ancestries, the performance of all PRS in non-European cohorts still lagged greatly behind that in Europeans. Nevertheless, we anticipate that the sharing of genome-wide fine-mapping results in a manner analogous to GWAS summary statistics will facilitate calculation of fine-mapping-informed PRS in many cohorts, and the development of improved PRS methods, particularly for admixed populations.
Our strategy of applying a suite of fine-mapping methods and examining the convergence of the results was driven by the variety of the underlying fine-mapping algorithms, and their corresponding strengths and limitations. Consistent with previous literature, we detected more SNPs with high PIPs when incorporating functional priors using PolyFun18. As expected, the specification of LD structure, fine-mapping window, and number of causal variants impacted fine-mapping results. Considering “in-sample” LD from the PGC BD data (albeit a subset of cohorts that were available) as the gold-standard, using the HRC reference panel yielded the most similar fine-mapping results. This observation may be explained by the HRC being used as an imputation reference panel for almost all cohorts in the GWAS. Results suggest that a large and well-matched LD reference panel to the GWAS sample can be used to achieve high-quality fine-mapping results. This has advantageous implications in scenarios when calculating in-sample LD is not possible due to data sharing restrictions, or when obtaining LD information from many cohorts becomes increasingly challenging as GWAS meta-analyses grow. Moreover, although conditional analysis indicated 1 causal variant per GWS locus, our results are highly consistent when using LD reference panels and allowing up to 5 causal variants per GWS locus. The latter analyses also yielded more likely causal SNPs. To facilitate rapid and scalable fine-mapping of GWAS loci, we developed a fine-mapping pipeline (GitHub: https://github.com/mkoromina/SAFFARI) with options to specify multiple fine-mapping methods, GWAS summary statistics, fine-mapping windows, and LD reference panels.
Several limitations of this study and future directions must be noted. First, our fine-mapping focused exclusively on EUR ancestry data, owing to the composition of the PGC BD GWAS. However, this enabled us to investigate the impact of LD reference panels on fine-mapping, which would be challenging for diverse ancestry data, given the limited availability of such panels at present. Increasing ancestral diversity in BD GWAS is an active area of research46, and in future the differences in LD structure between populations could be leveraged to aid fine-mapping62 and PRS predictions44. Second, we approximated “in-sample LD” of the GWAS as we only had access to a subset of the individual-level data (73% of the total effective sample size), we used best guess genotypes to represent imputed dosages, and we merged genotypes across cohorts and calculated LD, in contrast to the GWAS, which was a meta-analysis between cohorts. Third, this study prioritized likely causal variants or genes at 20 of the 64 GWS loci, meaning that many loci were not robustly fine-mapped. Our approach was conservative in that we focused on SNPs with high PIPs (>0.50), that were part of credible sets, and were supported by different fine-mapping models. The improvements in PRS performance after integrating genome-wide fine-mapping results, suggest that our analyses capture meaningful information on causality in other genomic regions that did not meet the stringent criteria we applied to fine-map GWS loci. Fourth, these statistical analyses prioritize variants and genes with high-probabilities of being causal risk factors for BD, however computational approaches fall short of proving causality, and have limited capacity to uncover mechanisms. Finally, the enhancer, promoter and QTL data used may be incomplete due to cell-type or context-specific effects, or incomplete mapping of active enhancers to their target genes, and therefore some union consensus SNP effects may not have been detected in our analysis.
In summary, we conducted a comprehensive statistical and functional fine-mapping analysis of BD genomic loci, yielding a resource of likely causal genes and variants for the disorder. These genes and variants now require investigation in functional laboratory experiments to validate their roles, understand mechanisms of risk, and examine opportunities for therapeutic intervention in BD.
Data availability
The PGC’s policy is to make genome-wide summary results public. Genome-wide fine-mapping results will be made available through the PGC website upon publication (https://www.med.unc.edu/pgc/results-and-downloads). Individual-level genetic data are accessible via Secondary Analysis Proposals to the Bipolar Disorder Working Group of the PGC (https://www.med.unc.edu/pgc/shared-methods/how-to/). This study included some publicly available datasets accessed through dbGaP - PGC bundle phs001254.v1.p1.
Code availability
Analysis scripts are available online at [Github: https://github.com/mkoromina/SAFFARI]. All software used is publicly available at the URLs or references cited.
Competing interests
OAA has served as a speaker for Janssen, Lundbeck, and Sunovion and as a consultant for Cortechs.ai. SKS has served as speaker for Janssen, Takeda and Medice Arzneimittel Puetter GmbH & CoKG. EV has received grants and served as consultant, advisor or CME speaker for the following entities (unrelated to the present work): AB-Biotics, Abbott, Abbvie, Adamed, Angelini, Biogen, Biohaven, Boehringer Ingelheim, Casen-Recordati, Celon, Compass, Dainippon Sumitomo Pharma, Ethypharm, Ferrer, Gedeon Richter, GH Research, Glaxo Smith-Kline, Idorsia, Janssen, Johnson & Johnson, Lundbeck, Newron, Novartis, Organon, Otsuka, Rovi, Sage, Sanofi-Aventis, Sunovion, Takeda, and Viatris. PBM has received remuneration from Janssen (Australia) and Sanofi (Hangzhou) for lectures, and Janssen (Australia) for advisory board membership. MOD and MJO have received grants from Akrivia Health and Takeda Pharmaceuticals for work unrelated to this project.
Acknowledgements
For the purposes of open access, the author has applied a Creative Commons Attribution (CC BY) license to any Accepted Author Manuscript version arising from this submission. We thank the participants who donated their time, life experiences and DNA to this research and the clinical and scientific teams that worked with them. This project was funded by the Baszucki Brain Research Fund via the Milken Institute Center for Strategic Philanthropy. We are deeply indebted to the investigators who comprise the PGC. The PGC has received major funding from the US National Institute of Mental Health (PGC4: R01MH124839, PGC3: U01 MH109528; PGC2: U01 MH094421; PGC1: U01 MH085520). Statistical analyses were carried out on the NL Genetic Cluster Computer (http://www.geneticcluster.org) hosted by SURFsara and the Mount Sinai high performance computing cluster (http://hpc.mssm.edu), which is supported by the Office of Research Infrastructure of the National Institutes of Health under award numbers S10OD018522 and S10OD026880. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Full acknowledgements are included in the Supplementary Note. JRIC is supported by a grant from the Medical Research Foundation (MRF-001-0012-RG-COLE-C0930) and by the NIHR Maudsley Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the authors and not necessarily those of the NIHR or the UK Department of Health and Social Care. EV thanks the support by CIBER - Consorcio Centro de Investigación Biomédica en Red-(CB07/09/0004), Instituto de Salud Carlos III, Spanish Ministry of Science and Innovation and grants PI18/00805 and PI21/00787, integrated into the Plan Nacional de I+D+I and co-financed by the ISCIII-Subdirección General de Evaluación and the Fondo Europeo de Desarrollo Regional (FEDER); the Instituto de Salud Carlos III; the Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement (2021 SGR 01358), the CERCA Programme, and the Departament de Salut de la Generalitat de Catalunya for the PERIS grant SLT006/17/00357. Thanks also for the support of the European Union Horizon 2020 research and innovation program (EU.3.1.1. Understanding health, wellbeing and disease: Grant No 754907 and EU.3.1.3. Treating and managing disease: Grant No 945151). PBM has been funded through an Australian National Health and Medical Research Council Investigator Grant (1177991). Work for the Japanese cohort was supported by Japan Agency for Medical Research and Development (AMED) grants 22wm0425008, 21ek0109555, 21tm0424220, 21ck0106642, 23ek0410114 and 23tm0424225, Japan Society for the Promotion of Science (JSPS) KAKENHI grant 21H02854 and JP20H00462. Work for the ‘for 2107’ cohort was supported through funds by the German Research Foundation (DFG grants FOR2107 KI588/14-1, and KI588/14-2, and KI588/20-1, KI588/22-1 to Tilo Kircher, Marburg, Germany). Biosamples and corresponding data were sampled, processed and stored in the Marburg Biobank CBBMR. Ethics approval was obtained from the ethics committees of the Medical Schools of the Universities of Marburg (approval identifier Studie 07/2014) and Münster, respectively, in accordance with the Declaration of Helsinki. All subjects volunteered to participate in the study and provided written informed consent.