Abstract
Genome-wide association studies (GWAS) have underrepresented individuals from non-European populations, impeding progress in characterizing the genetic architecture and consequences of health and disease traits. To address this, we present a population-stratified phenome-wide GWAS followed by a multi-population meta-analysis for 2,068 traits derived from electronic health records of 635,969 participants in the Million Veteran Program (MVP), a longitudinal cohort study of diverse U.S. Veterans genetically similar to the respective African (121,177), Admixed American (59,048), East Asian (6,702), and European (449,042) superpopulations defined by the 1000 Genomes Project. We identified 38,270 independent variants associating with one or more traits at experiment-wide (P < 4.6x10-11) significance; fine-mapping 6,318 signals identified from 613 traits to single-variant resolution. Among these, a third (2,069) of the associations were found only among participants genetically similar to non-European reference populations, demonstrating the importance of expanding diversity in genetic studies. Our work provides a comprehensive atlas of phenome-wide genetic associations for future studies dissecting the architecture of complex traits in diverse populations.
One Sentence Summary To address the underrepresentation of non-European individuals in genome-wide association studies (GWAS), we conducted a population-stratified phenome-wide GWAS across 2,068 traits in 635,969 participants from the diverse U.S. Department of Veterans Affairs Million Veteran Program, with results expanding our knowledge of variant-trait associations and highlighting the importance of genetic diversity in understanding the architecture of complex health and disease traits.
Main text
Among published GWAS, 95% of participants are genetically similar to individuals from European reference populations (1). The lack of diversity in genetic research has limited the generalizability of GWAS discoveries (2), creating significant concerns about compounding existing disparities in healthcare. This concern can be addressed partly by expanding the representation of diverse populations in large-scale genomic studies. When embraced, such efforts can identify new disease variants (3), elucidate disease pathobiology (4), and inform treatments that include personalized disease prevention strategies that are broadly applicable to individuals from diverse backgrounds (5, 6).
Electronic health record (EHR)-linked biobanks have played a pivotal role in genomic discovery by providing information on the genome and the phenome in large numbers of individuals (7–12). Historically, genetic association studies have traditionally been performed on cohorts assembled by recruiting participants with specific traits of interest, often from European populations. In contrast, genetic association studies from biobanks with diverse representation enable multi-population analyses that can identify population-specific genetic associations across multiple diseases or traits, assess their clinical significance, reveal pleiotropic effects of genetic variants, and validate previously identified genetic associations across populations (13, 14). The growth of increasingly diverse EHR biobanks are poised to begin to address inequities in genetic research (8, 15–18).
The Department of Veteran Affairs (VA) Million Veteran Program (MVP) is a longitudinal health, genomic, and precision medicine cohort established in 2011. MVP’s goal is to enroll at least one million Veterans from diverse populations: as of this writing, over 950,000 Veterans have volunteered for the program. The MVP database contains array-based genome-wide genotyping as well as longitudinal EHR data and questionnaire data. Here, we present a multi-population phenome-wide set of GWAS in 635,969 Veterans, of whom 29% are genetically similar to non-European population groups inferred from the 1000 Genomes Project reference panel. We characterize the genetic architecture of 2,068 traits across a comprehensive catalog of phenotypes representing EHR-derived diagnosis codes, clinical laboratory tests, and vital signs, as well as survey responses. We identify independently associated signals using contemporary fine-mapping methods, characterizing pleiotropy across traits and populations, highlighting signals found in non-European population groups (fig. S1).
Results
Study Design, Population Groups, and Phenotypic Definitions
In this study, we analyzed data from 635,969 participants (MVP Genomics Release 4 (19)), aggregated into four population groups based on genetic similarity to the 1000 Genomes Project (20) African (AFR, n = 121,177), Admixed Americans (AMR, n = 59,048), East Asian (EAS, n = 6,702), and European (EUR, n = 449,042) superpopulations (Fig. 1, table S1). After imputation and quality control (QC) filtering, > 44.3M variants (with minor allele count (MAC) > 20) were included for analysis (Methods). The frequency and imputation quality scores of single nucleotide polymorphisms (SNPs) among the population groups are provided (See Data availability).
We extracted phenotypic trait data from the VA EHR, which consisted of diagnosis codes, laboratory measures, and vital signs. Additionally, we included responses to survey questions on health and behavior administered at MVP enrollment. After trait QC, 1,854 binary and 214 quantitative traits were included in the downstream analysis in at least one population group (n=2,068, Fig. 1, Methods, Supplementary Note).
Biobank-scale genomic analysis across populations identifies tens of thousands of trait associations
We next turned to the substantial computational task of generating the >350 billion variant-trait associations across population groups. Unfortunately, the current implementation of the Scalable and Accurate Implementation of GEneralized mixture model (SAIGE) algorithm (21) – ideal for our design to address case/control imbalances – was not analytically tractable at this scale of computation and would have required ∼251 compute years to complete. As such, we enhanced the computational efficiency of this algorithm with baseline improvements, implemented a graphics processing unit (GPU) optimization for performing matrix operations, and completed analyses on the US Department of Energy (DOE)’s Oak Ridge Leadership Computing Facility Summit and Andes systems. Using this framework, we conducted a total of 4,045 independent GWAS for traits that met QC criteria in each population group (table S2), followed by a multi-population meta-analysis using GWAMA (22) for traits present in two or more populations (Methods). The actual analysis took 14,286 GPU hours (14 days of wall time), leading to an overall 160-fold reduction in the core hours required.
After multi-population meta-analysis, we identified 98,485 associations (38,269 lead variants at 23,584 loci from 1,274 traits) with a study-wide significance of P_<_4.6 × 10−11 (Methods, Fig. 2, and table S3); 1,096 binary traits (on average, 21 mean associations per trait, Fig. 2A) and 178 quantitative traits (on average, 421 mean associations per trait, Fig. 2B) exhibited significant associations. The mean genomic inflation factor across all traits was 1.01 (range from 0.85 to 1.19), indicating that the test statistics error rates were relatively controlled (fig. S2). Consistent with previous reports (23), we noted that the most common variant associations (minor allele frequency (MAF) > 1%) were located in non-coding regions (fig. S3); 85.6% (20,158) of variant-binary trait associations (Fig. 2A, 2B, and tables S4, S5) and 63.3% (47,496) of variant-quantitative trait associations (Fig. 2C, Fig. 2D, table S4, S5) have not been previously identified as lead variants or proxies (r2 > 0.6 within 500 kb, Methods) in the NHGRI-EBI GWAS (24) and Open Target Genetics catalogs (25). In fact, 92% (41,979) of the variants associated with quantitative traits and 83% (16,161) with binary traits have not previously been associated with any other trait. Approximately 57% of novel variants had MAF < 1% (Fig. 2C and 2D). Further, enrichment analysis to determine whether trait categories were over- or under-represented among novel SNP-trait associations indicated significant over-representation in the 8/22 binary and 6/17 quantitative trait categories (fig. S4 and table S6). In particular, this included disease categories such as hematopoietic and infectious disease (OR: 13.08, 95% CI: [7.88-23.49], P < 6.58 x 10-56; OR: 36.78, 95% CI: [10.12 – 305.07], P < 2.03 x 10-26), while more well-studied disease categories like neoplasms and circulatory system disorders were less enriched (OR=0.70, 95% CI: [0.62 – 0.79], P < 2.87x 10-09; OR=0.46, 95% CI: [0.41 – 0.52], P < 1.03x 10-37) in novel SNP-trait associations.
Multi-population meta-analysis improves the power to detect significant associations not detected in the EUR population alone
To quantify the suite of discoveries made through expanding representation in genetic analysis, we compared the results of the multi-population meta-analysis to those of the EUR-only GWAS. Over half of the variants analyzed were not included in the EUR population GWAS due to MAF or imputation quality, and a quarter (10M) were only available in AFR (Supplementary Note). The inclusion of AFR, AMR, and EAS individuals identified 1,608 additional genomic loci, which were not significant (P > 4.6 x 10-11) in the EUR-only analysis (Table S7). This led to a total of 3,477 variant-trait associations across 893 traits, 76% of which were with binary traits. The most significant of these results was a rare intronic variant, rs72725854, located near the long non-coding RNA (lncRNA), PCAT2, associated with prostate cancer (table S7). This single nucleotide polymorphism (SNP) is enriched in African populations (MAFAFR = 0.06, MAFAMR = 0.0068, MAFEUR =0.0006). It has been previously reported to increase the risk of prostate cancer two-fold, aligning with our study findings (26–28). We also replicated other previously reported African population-specific associations, such as ACKR1 for neutropenia and reduced white blood count levels (29) and a missense variant, rs73885319 in APOL1 with kidney-related conditions such as end-stage renal disease (table S7) (30).
Moreover, we identified 834 variant-trait associations primarily driven by non-EUR populations, as these associations were not even nominally significant in EUR populations (P > 0.05, table S7). We discovered an AFR-specific non-coding index variant in FAM234A associated with iron deficiency anemias (PAFR = 2.37 x 10-37, PAMR = 0.05, PEUR=0.42, table S7) and hereditary hemolytic anemias (PAFR = 5.32 x 10-33, PAMR = 0.28, PEUR=0.25, table S7). We also identified a novel association of rs3104394 in MTCO3P1 with alopecia areata reached experiment-wide significance only in AMR (PAFR = 0.01, PAMR = 1.27 x 10-11, PEUR= 7.66 x 10-6). Although there is no information available about the relationship between the MTCO3P1 gene and alopecia, a cross-sectional analysis of the NIH ’All of Us’ cohort found that alopecia areata is more prevalent in the Hispanic/Latinx community, suggesting potential genetic factors contributing to the development of this condition (31).
Lastly, we found 265 variant-trait associations that were significant in AFR or AMR populations and absent in EUR populations. As a positive control, one of the lead variants was rs334 (HBB), a well-established variant at increased frequency in African populations due to balancing selection as it is protective against malaria but causes sickle cell anemia risk allele (table S8).
Fine mapping of multi-population associations reveals single-variant credible sets
To create a catalog of putative causal genetic variants at our association signals, we performed signal-based fine-mapping using the Sum of Single Effects model implemented in SuSiE (32, 33). For the fine-mapping analysis, we defined 25,953 locus-trait pairs, corresponding to 1,257 traits with one or more study-wide significant variants outside the major histocompatibility complex (MHC) (fig. S5). We fine-mapped 99% of these pairs within each population group using exact, in-sample matched linkage disequilibrium (LD) matrices for the trait and identified signals at 22,866 (88%) of the pairs (fig. S5, Methods). The 1% of locus-trait pairs that failed to map were primarily due to computational constraints (table S9). The fine-mapped signals included 15,045 unique variant-trait pairs (6,318 variants and 613 traits) that mapped with high confidence, i.e. a posterior inclusion probability (PIP) > 0.95 in one or more populations. We merged signals across populations based on their Jaccard similarity indices (34) and identified 57,601 multi-population signals across 936 phenotypes (fig. S5, tables S10-11); 53,669 (93.1%) of the signals were mapped in a single population including 44,516 (77.3%) that were fine-mapped in only the EUR population (Fig. 3A). However, >75% of the signals that were fine mapped in only a single population segregated with one or more unmapped populations (directionally consistent effect estimate and with P < 10-3) or the underlying GWAS was underpowered to detect a suggestive association in the unmapped population (80% power or less). A larger effective meta-analysis sample size, and thus greater power, was correlated with a larger number of mapped signals (Fig. 3B). Among the 15,045 high-confidence pairs, 2,069 variant-trait associations were fine-mapped with high-confidence only in the non-EUR populations (table S12). These associations correspond to 974 unique variants and 271 traits (Fig. 3A).
To quantify the precision of fine mapping for the multi-population results, we conservatively defined an ‘approximate’ credible set for each Jaccard-similarity population-aligned signal as the union of variants in each population-level credible set. Despite this conservative definition, we observed that >54% of the merged signals identified by the fine-mapping pipeline contained ≤5 variants and 14,405 (25%) contained a single variant (Fig. 3C). To compare the relative precision of fine mapping between populations, we determined whether there was a difference in the size of our approximate credible sets for signals that mapped in multiple groups. Signals identified in both the AFR and EUR populations generally had slightly but significantly smaller sets in the AFR mapping than in the EUR mapping (Wilcoxon signed-rank P = 2.26 × 10-10; fig. S6). In contrast, our approximate credible sets in AMR mapping were larger than their AFR (P = 1.30 × 10-84, fig. S6) and EUR counterparts (P = 7.36 × 10-162, fig. S6). Overall, we observed an enrichment of annotations in coding regions for variants in signals that fine-mapped to fewer than ten variants, which increased as a function of fine-mapping precision (Fig. 3D, table S13). However, even among the single-variant signals, ∼90% of the fine-mapped variants were in non-coding regions. Among the non-coding fine-mapped variants, we observed a slight enrichment of higher functional prediction scores from RegulomeDB (35) relative to all significantly-associated variants (fig. S7).
We next analyzed the distribution of effect sizes and allele frequencies for lead variants and fine-mapped signals for the 15,822 variant-trait-population combinations with high confidence (PIP > 0.95) fine-mapped signals. Consistent with previous reports (36, 37), we observed an inverse relationship across all four population groups between the minor allele frequency of a variant and its effect size for both lead variants (fig. S8) and high-confidence signals (Fig. 3E). For the high-confidence signals, we additionally examined the relationship between frequencies and effect sizes for alleles derived in the human lineage since the last common ancestor of chimpanzees and bonobos (Fig. S8). Because 86.9% of derived alleles were the minor allele, it was not surprising that we saw strong effects for variants with allele frequencies close to 0. Large effect sizes were also observed for several variants with whose derived allele was high-frequency, some of which map to previously reported targets of positive selection in human populations (38–40). We observed this relationship between allele frequency and effect size for both newly observed variant-trait associations and those previously reported in the GWAS Catalog (24), with similar relative proportions in the three well-powered populations (AFR, AMR, EUR). Interestingly, the distribution of effects in binary and quantitative phenotypes was different. Although it was equally common for minor and derived alleles at high-confidence signals to associate with an increase or decrease in a quantitative trait, e.g., higher white blood cell count (WBC) or lower WBC, (49.6% of minor and derived alleles were associated with a higher value of the quantitative trait), the majority, (71%) of these alleles conveyed increased risk for binary traits. The increased-risk effect among minor alleles was also observed for lead variants; however, only 64% of lead-SNP minor alleles increased the risk (fig. S8).
Cross-trait genetic architecture identifies pleiotropic loci
Next, we investigated the associations of high-confidence fine-mapped variants (PIP > 0.95) with two or more traits. Filtering the 15,045 high-confidence fine-mapped variant-trait pairs, fine-mapped above to include only traits with a phenotypic correlation coefficient > |0.2| resulted in 3,955 variant-trait pairs across 3,201 variants and 377 traits (table S14, fig. S9). Among these, 87 variants were associated with three or more traits resulting in 360 associations (Fig. 4). Although a total of 172 out of these 360 associations reported were previously observed in prior studies (24, 25), we detected 188 novel variants associated with multiple traits across 30 trait categories (Fig. 4). In particular, the missense variant rs429358 in the APOE gene is linked to 24 different traits, including those previously identified conditions such as macular degeneration, abdominal aortic aneurysm, dementia, Alzheimer’s disease, and HDL levels. We observed new associations between this variant and chronic liver disease and cirrhosis.
Population-specific heritability and genetic correlation patterns for complex traits
To characterize phenotypic variation attributable to common genetic variants across the four major population groups, SNP heritability was calculated using linkage disequilibrium score regression (LDSC) with population-specific GWAS results and in-sample LD reference panels (Supplementary Note). This analysis identified significant (P < 9 × 10-6) SNP heritability for 233 traits (n = 1,525, mean h2 = 20.45%) in the AFR, 199 traits (n = 1,226, mean h2 = 22.12%) in the AMR, three traits (n = 353, mean h2 = 50.85%) in the EAS, and 816 traits (n = 1,898, mean h2 = 12.22%) in the EUR groups (fig. S10, table S15). Height was the most heritable trait across all four population groups, consistent with a previous report (41).
We then examined the genetic correlation between pairs of traits with significant heritability within each population group using LDSC and MVP-derived LD reference panels. We identified significant correlations (false discovery rate < 1%) in all populations: 6,424 pairs (671 traits) in AFR, 3,031 pairs (494 traits) in AMR, and 154,503 pairs (1,342 traits) in EUR populations. Although there was considerable overlap in trait pairs across populations, 1,267 unique trait pairs were identified among non-EUR participants. Classifying variables according to genetic correlations and seeking groups with robust correlations (rg > 0.60) revealed numerous clusters of attributes with shared genetic influences (table S16). Despite significant correlations, no trait exhibited an opposite genetic correlation between the two population groups. However, genetic correlations for many trait pairs varied by population. For instance, mean BMI and type 2 diabetes had a genetic correlation of 0.47 (0.41-0.52) in the EUR group, while only 0.28 (0.18-0.39) in the AFR group.
We finally analyzed the genetic correlation of each trait between the population groups using Popcorn (42). With EUR as the reference population, 212 traits exhibited a significant genetic correlation between EUR and AFR (P<3.6 x 10-5, 0.05 / 1380 traits), 13 between EUR and AMR (P<5.61 x 10-5, 0.05 / 890 traits), and six between EUR and EAS (P<1.4 x 10-4, 0.05 / 336 traits, table S17). Specifically, we found the strongest genetic correlation for height (pgi= 0.66) among quantitative traits and type 2 diabetes (pgi= 0.65) among binary traits between EUR and AFR populations. These correlations were not significant when compared with AMR and EAS, likely due to low sample sizes in those populations (table S17). We also noted that some traits, including skin cancer (pgi = 0.05), anemia of chronic disease (pgi =0.08), iron levels (pgi = 0.18), and white blood cell counts (pgi = 0.20), had weaker correlations between different populations, in keeping with population-specific variant-trait associations we identified above. Overall, the large number of observed genetic correlations between traits across populations underscores the role of shared genetic factors contributing to trait variance across our population groups.
Fine-mapped variant-trait associations specific to non-EUR populations
Of the 25,953 high-confidence variant-trait pairs identified by fine-mapping, 2,069 (974 unique variants and 271 phenotypes) were unique in the non-EUR population groups (table S12). Although most of the signals were from low frequency or rare variants, 15 novel signals (10 AFR, 3 AMR, 2 EAS) occurred in coding variants and had a MAF>0.05. Among these was a missense variant, rs73382631, associated with lower WBC and neutrophil counts in the AFR population (MAFAFR = 0.10, MAFAMR = 0.01, not present in EAS, EUR). Another example was a missense coding variant in ABCG2 (rs35965584, MAFAFR = 0.002, not present in AMR, EAS, EUR), for which our analysis identified a novel association with gout. Previous reports have identified an association between ABCG2 and hyperuricemia (43) and susceptibility to gout (44), with another known ABCG2 missense variant of the ABCG2 gene (rs2231142) (45). In MVP, the previously identified missense variant, rs2231142, was within the 95% credible set of a distinct gout signal (n=8 variants) mapped in EAS and EUR but was not in linkage disequilibrium with rs35965584 (r2 = 0.0001).
We noted that the majority of population-specific signals were in non-coding regions. To gather insights into these variants, we used functional prediction scores from RegulomeDB (table S12), identifying 43 previously known associations and 20 novel associations with SNPs that had strong evidence of being regulatory (RegulomeDB score>0.9). The previously reported loci were associated with factors such as hemoglobin A1c, cholesterol measures, heart rate, red blood cell count, and type 2 diabetes (46, 47). All other newly identified associations were related to quantitative traits, such as rs574674363 and lower high-density lipoprotein (HDL) cholesterol levels.
Heterogeneity of effect size was screened across common signals (MAF > 0.05) at 1,888 fine-mapped loci (representing 1,329 separate traits) with overlapping credible sets in multiple populations. This identified 16 heterogeneous variant-trait associations when comparing the AFR to the EUR population and 11 when comparing the AMR to the EUR population (table S18). Focusing on coding variants that mapped to the same trait with high confidence (PIP > 0.95) in multiple populations, we observed six associations with significant heterogeneity in effect size between the estimates in the AFR and the EUR populations, and 2 when comparing the AMR to the EUR populations (table S19). All variant-trait pairs had the same direction of effect. The majority of differences across signals mapped to rs429358 (MAFAFR: AFR= 0.22, MAFAMR=0.12, MAFEUR=0.14), the coding variant tag for APOE-e4 associated with a 30% lower risk of dementia in the AFR compared to the EUR population (48); there was no heterogeneity in the effect estimate between the AMR and the EUR population. The EAS population was underpowered for this heterogeneity analysis (fig. S11).
Discussion
Here, we present a comprehensive, phenome-wide atlas of GWAS analyses in the largest multi-population biobank to date. Analyzing approximately 44.3 million variants across 2,068 traits in 635,969 US Veterans representing four diverse populations, we identified 98,485 independent variant-trait associations. The summary statistics from this effort comprise a freely accessible resource for the community to dissect the genetic basis of health and disease and foundational data to support discovery, bridging the knowledge gap in the underrepresented populations. Additionally, we highlight the essential role of efficient software operating at exascale supercomputing capabilities to perform analyses at this scale, made possible through a collaboration between the VA and DOE.
Using fine-mapping with SuSiE, we identified 57,601 independent signals across 936 of the traits, including 14,405 signals that were fine-mapped to a single variant, enriched for annotations (e.g., coding variants) that a priori are likely causal. Despite this, 90.8% of approximate credible sets that mapped to a single variant carried only “‘non-coding” functional annotation, with the majority mapping to “introns” (52.7%, i.e., “within a gene”) or “non-gene” (26.2%, i.e., “outside of gene”) annotations. This suggests the presence of large numbers of biologically important variants within proximal or distal elements acting on genes to regulate their expression, where high-throughput functional assays are well positioned to systematically test this set of variants for function (49).
Our analysis expands on previous, large-scale fine-mapping experiments (34, 50, 51) aimed at determining the candidate causal variant(s), substantially increasing the number of fine-mapped traits and identified signals, particularly for the AFR and AMR populations. The increased representation of diverse participants in our genetic studies also facilitated improved precision of fine-mapping between groups. Notably, the AFR population yielded the most precise approximate credible sets of our three well-powered populations, followed by the EUR and the AMR groups. This finding was anticipated because haplotype blocks are smaller in populations that are genetically similar to African reference populations (52, 53). However, the increased precision of the AFR credible set sizes observed in the paired Wilcoxon test did not translate to anticipated differences in the median credible set sizes for signals mapped in the AFR and EUR populations; the median difference was zero. This lack of difference is likely due to the larger sample size in the EUR population increasing statistical power. Thus, we expect that the inclusion of increasing numbers of non-EUR individuals should continue to improve the precision of signal fine-mapping efforts.
The population diversity of MVP enabled a large-scale comparison of the similarities and differences between variants and traits across populations. As anticipated, fine-mapping results identified overwhelmingly more similarities than differences in the genetic associations between groups. The differences observed were largely attributable to variations in allele frequency or the presence of genetic variants in one population that were monomorphic in other populations. This study highlighted a few population-specific variants that may contribute to long-appreciated differences in trait biology. The novel association between the intergenic variant rs7338263 informs longstanding observations of lower WBC and neutrophil counts in individuals genetically similar to the AFR compared to EUR reference populations (54), adding to previously identified candidate loci explaining this difference (29, 55). The novel signal in the AFR population for rs35965584 with gout in ABCG2 may represent an independent signal for gout risk, in addition to the known AFR-specific variant rs2231142 in ABCG2 (46). As many risk factors are associated with gout, whether this novel signal contributes to the observed higher prevalence of gout in self-reported Black compared to self-reported White populations requires further study (56). Similar trends were seen when considering genome-wide architecture. Genetic heritability of individual traits and genetic correlation between trait pairs supports similar, rather than divergent, cross-population architecture.
Among the most common variants mapped with high precision, there was minimal evidence of heterogeneity in effect estimates. The APOE locus was a notable exception, where we observed an association between the high-confidence fine-mapped signal rs429358 and increased risk for dementia across all four populations examined. However, the risk was 30% lower in the AFR compared to EUR groups, consistent with prior studies that observed differential risk between APOE alleles and dementia in non-EUR compared to EUR populations (48, 57).
Our work must be interpreted within the context of its limitations. To efficiently conduct large-scale GWAS analyses across the phenome, we used an automated approach for phenotyping. This approach involved using Phecodes for collating clinical diagnosis codes and while efficient, could be more precise for most phenotypes. Our regression models also had to be standardized, accounting only for age, sex, and principal components, and performing inverse-normal transformations to quantitative traits prior to analysis. We also encountered challenges in deploying the fine-mapping method at this scale. In particular, the LD matrices used did not ideally synchronize with the SAIGE methodology due to our reliance on hard-called genotypes and not accounting for covariates. This could lead to minor LD mismatches, which may influence sensitive loci, resulting in inaccuracies or spurious results in the fine-mapping stage. Our method of defining loci has potential pitfalls as well. It is based on meta-analyzed data, not considering whether the population-specific GWAS peak was present. Consequently, we fine-mapped some regions with insignificant signals, especially within the EAS population, the smallest population group for which we had limited power. Additionally, we double-counted 115 population-specific signals that overlap between two adjacent loci. Our approach may have also overlooked population-specific peaks eclipsed in the meta-analysis and certain loci too vast to be completely mapped under our scheme. Our conservative approach, adhering to a minimum threshold of significance and purity for signals to maintain precision (positive predictive value), could result in missing true signals. Similarly, our preference for precision over recall (sensitivity) meant we limited the fine-mapping to a maximum of five signals per locus. This approach can lead to an underestimation of the number of signals at certain highly significant loci. Future research may consider these constraints and propose alternative approaches to further enhance the validity and comprehensiveness of the results.
Diversity is a critical feature in advancing precision medicine important in genomic studies because it helps to ensure that the results are generalizable across different populations. Despite large population biobank efforts such as UK Biobank, FinnGen, and Biobank Japan, the limited diversity in the published GWAS literature is quite evident. In recent years, data from more diverse biobanks are becoming become available, such as the All of Us Study Research Program, America Latino Research Biobank, and Human Hereditary and Health in Africa (H3Africa), as well as hospital and institutional biobanks (8, 10, 58). Since its inception, the MVP has sought to include a diverse population representative of the diverse United States Veteran population. As of this writing, MVP has enrolled the largest single population of participants genetically similar to African reference populations among current biobanks, expanding our power to study the genetic underpinning of complex traits in a more equitable manner. Our phenome-wide GWAS of health and disease traits among participants in the VA MVP provides a comprehensive atlas of genetic associations across diverse populations. Our overall results highlight the increased power of discovery that the inclusion of individuals from diverse populations brings to the understanding of the genetics of complex health and disease traits.
Data Availability
The complete summary statistics for all GWAS will soon be accessible to the public. These stats are currently being processed for transfer to dbGAP, and the study's accession number is phs002453.
Funding
The work was supported by the Million Veteran Program award #MVP000. This research used resources from the Knowledge Discovery Infrastructure at the Oak Ridge National Laboratory, supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725_and the Department of Veterans Affairs Office of Information Technology Inter-Agency Agreement with the Department of Energy under IAA No. VA118-16-M-1062. Other support by the National Institute of General Medical Sciences grant R01GM138597 (AV); National Institute Health grant T32 AA028259 (JDD); National Library of Medicine Grant 5R01LM010685 (RJC); National Human Genome Research Institute grant K99HG012222 (WZ); National Institute of Arthritis and Musculoskeletal and Skin Diseases grant P30AR072577 (KPL); National Institute of Diabetes and Digestive and Kidney Diseases grant DK126194 (BFV); National Institute of Health grants NIR01AG067025, K08MH122911 (GV); National Institute of Health grants BX004189, R01AG065582, R01AG067025 (PR); Office of Research and Development, Veterans Health Administration award I01CX001849-01 (JG); Office of Research and Development, Veterans Health Administration awards BX004821, CX001737, BX005831 (YSV); Veterans Health Administration awards IK2-CX001780 (SMD).
Author contributions
Conceptualization
EB, RR, GT, PST, CJO, SM, KC, MJG, RKM, SD, KPL; Methodology: AV, JEH, AR, MC, ML, YLH, YK, DAH, VAP, IG, RT, DCP, RS, MM, XW, DRD, PD, TNN, YS, YVS, SP, AB, WZ, TC, BFV, RKM, SD, KPL; Investigation: AV, JEH, AR, MC, ML, TA, CAB, LG, RJC, RC, SD, JG, AH, SKI, JJ, RK, HK, DL, SWL, VCM, CO, JDD, RP, PR, SV, GV, AJ, CJO, AB, TC, BFV, KC, MJG, RKM, SD, KPL; Visualization: AV, JEH, AR, MC, LG, VAP, JH, TC, BFV, RKM, SD, KPL; Funding acquisition: EB, RR, GT, CJO, SM, MJG, RKM; Project administration: YLH, HG, FL, LC, IG, RT, JH, LD, SW, JC, SM, JM, JPC, KC, RKM, SD, KPL; Supervision: EB, RR, GT, PST, CJO, TC, BFV, KC, MJG, RKM, SD, KPL; Writing – original draft: AV, JEH, AR, MC, ML, BFV, SD, KPL; Writing – review & editing: AV, JEH, AR, MC, ML, YLH, YK, DAH, LG, VAP, HG, FL, LC, IG, RT, JH, LD, SW, JC, DCP, RS, MM, XW, DRD, PD, YS, TNN, TA, CAB, RJC, RC, SD, JG, AH, SKI, JJ, RK, HK, DL, SWL, VCM, CO, JDD, SG, RP, PR, YVS, SV, GV, AJ, EB, RR, GT, SP, PST, CJO, SM, JM, JPC, AB, WZ, TC, BFV, KC, MJG, RKM, SD, KPL.
Competing interests
CJD and JPC are employed full-time by the Novartis Institute of Biomedical Interest (their major contributions to this project were while employed at VA Boston Healthcare System). H.K. is a member of advisory boards for Dicerna Pharmaceuticals, Sophrosyne Pharmaceuticals; Enthion Pharmaceuticals; and Clearmind Medicine; a consultant to Sobrera Pharmaceuticals; the recipient of research funding and medication supplies for an investigator-initiated study from Alkermes member of the American Society of Clinical Psychopharmacology’s Alcohol Clinical Trials Initiative; which was supported in the last three years by Alkermes, Dicerna, Ethypharm, Lundbeck, Mitsubishi, Otsuka, and Pear Therapeutics; and holder of U.S. patent 10,900,082 titled: "Genotype-guided dosing of opioid agonists," issued 26 January 2021. JG and RP are paid for their editorial work in the journal Complex Psychiatry. RP reports a research grant from Alkermes. SD reports grants from Alnylam Pharmaceuticals, Inc, grants from Astellas Pharma, Inc; grants from AstraZeneca Pharmaceuticals LP; grants from Biodesix, grants from Celgene Corporation; grants from Cerner Enviza; grants from GlaxoSmithKline PLC, grants from Janssen Pharmaceuticals, Inc.; grants from Kantar Health; grants from Myriad Genetic Laboratories, Inc.; grants from Novartis International AG; grants from Parexel International Corporation through the University of Utah or Western Institute for Veteran Research outside the submitted work. SMD receives research support from RenalytixAI and Novo Nordisk, outside the scope of the current research, and is named as a co-inventor on a Government-owned US Patent application related to the use of genetic risk prediction for venous thromboembolic disease and for the use of PDE3B inhibition for preventing cardiovascular disease, both filed by the US Department of Veterans Affairs in accordance with Federal regulatory requirements.
Data and materials availability
Full summary statistics of all the GWAS are publicly available and can be requested through dbGAP. The dbGAP accession number is phs002453.
Supplementary Materials
All the supplementary tables and figures can be downloaded from https://tinyurl.com/mww86dn8
Materials and Methods
Million Veteran Program study cohort
The VA Million Veteran Program (MVP) is a national cohort launched in 2011 to determine the contributions of genetics, lifestyle, and military exposures to health and disease among US Veterans (16). Blood biospecimens were collected for DNA isolation and genotyping. The biorepository was linked with the VA EHR, which includes diagnosis codes (International Classification of Diseases ninth revision [ICD-9] and tenth revision [ICD-10]), laboratory measures, and detailed survey questionnaires collected at the time of enrollment for all veterans and followed in the healthcare system until September 2020.
Genotyping and imputation
Genotyping and imputation methods for the MVP were described previously (19). In brief, the single nucleotide polymorphism (SNP) data in the MVP cohort were generated using a custom ThermoFisher Axiom MVP 1.0 genotyping platform, and imputation was performed to a hybrid reference panel comprised of the African Genome Resources panel (59) and 1000 Genomes Project (p3v5) (60). Following imputation, variant level quality control (QC) was performed, and genetic variants with 1) imputation quality < 0.3, 2) minor allele count (MAC) < 20, 3) call rate < 97.5% for common variants (minor allele frequency [MAF] > 1%), and 4) call rate < 99% for rare variants (MAF < 1%) were excluded. Additionally, variants were also excluded if they deviated > 10% from their expected allele frequency based on reference data from the 1000 Genomes Project.
Population group assignment
We used the reference dataset from the 1000 Genomes Project for genetically inferred population estimation with the smartpca module in the EIGENSOFT (61) package to project the principal component (PC) loadings from the group of unrelated individuals from the 1000 Genomes Project reference dataset. We merged the 1000 Genomes dataset with the MVP dataset before running smartpca to project the principal component analysis (PCA) loadings from the reference dataset (fig. S13). To define the genetically similar population, we trained a random forest classifier using cross-population meta-data based on the top 10 PCs from the reference training data. We used a random forest classifier on the predicted MVP PCA data to assign individuals to one of five 1000 Genomes superpopulation groups. Individuals with random forest probability were over 50% for a population were assigned to that population. Those who could not be assigned to a population were excluded from the study.
Phenotype data
Clinical outcomes from EHR were defined by Phecodes (62), which are curated groupings of ICD codes. Each Phecode represents ICD codes grouped into clinically relevant phenotypes for clinical studies. Case status for binary Phecodes were defined as participants with ≥ 2 instances of the corresponding Phecode-mapped ICD-9 CM or ICD-10 CM codes in the EHR, while control status was the presence of zero instances of mapped ICD codes. Based on our previous studies of ICD EHR data, populations where phenotypes comprise <200 cases or controls are most likely to result in spurious results; this threshold was applied to population-specific analyses. Additionally, the study excluded certain conditions considered protected at the VA, such as sickle cell anemia and HIV status, and their data cannot be reported broadly. Laboratory measures for each participant were summarized as mean, minimum, and maximum across all visits. We filtered values greater than six standard deviations from the mean to remove outliers. Only laboratory traits with more than 1,000 individuals within each population group were included in the analyses. Sixty-nine laboratory traits were included and normalized using rank-based inverse-normal transformation. Survey outcomes were collected from the VA baseline and lifestyle surveys administered at MVP enrollment (63) (Supplementary Note).
Genetic association analyses
Within each genetically-inferred population group, we performed a GWAS for each trait of interest to determine the association with each imputed DNA variant. We used the generalized linear mixed model framework to account for participant relatedness and unbalanced case-control ratios with a GPU-optimized version of the SAIGE package implemented on the US Department of Energy’s Summit supercomputer. Directly genotyped variants were used for step 1 of SAIGE. Imputed genetic dosages were used for step 2 of SAIGE. Only variants with an imputation quality > 0.3 and MAC > 20 within the relevant genetically-inferred population groups were included in the GWAS. Analyses were adjusted for age, sex, and ten population-specific genetic PCs.
Post-GWAS QC
GWAS results were filtered using a custom R script based on EasyQC (64). Quality control was implemented to remove variants with missing values for major summary statistics (effect size, standard error, etc.) or with unreasonable values (P-values or allele frequencies > 1 or < 0). Variants that were monomorphic, poorly imputed (r2 < 0.3), or very rare (MAF < 0.0001) in only the subset of individuals included in the GWAS were also removed. The post-GWAS QC, and subsequent meta-analysis and QC were completed using OLCF’s Andes supercomputing cluster.
Meta-analysis
Multi-population meta-analysis was performed using the fixed-effect, inverse-variance weighted method implemented in GWAMA (22). Meta-analysis results were subjected to the same QC procedures as the GWAS results. Imputation quality filters were not implemented; however, an additional filter excluded variants specific to only one population group.
Determination of the number of independent traits tested and the genome-wide significance threshold
We calculated Pearson pairwise correlations for the phenotype residuals derived during step 1 of SAIGE to determine the number of independent traits tested. Calculations were made for each population separately and only for the meta-analyzed traits (i.e., passed phenotype QC in more than one population). PCs were calculated from the correlations as well as the variance explained by each of the PCs. The number of independent traits for each population was determined by the number of PCs required to explain a variance of 0.99. For the meta-analysis, population-specific residuals were combined before calculating PCs. This approach resulted in 1,083 independent traits. The population-specific and meta-analysis P-value significance thresholds were then determined by dividing the traditional genome-wide threshold of 5 x 10-8 by the number of independent traits: 5 × 10-8 / 1083, resulting in our study-wide significance of 4.6 × 10-11.
Derivation of independent signals and lead SNPs
For each full GWAS summary statistics dataset, we used PLINK for clumping and thresholding (65). A cross-population reference panel was created by selecting 5,000 individuals randomly from each population group. We estimated pairwise correlation within each haplotype block to report independent signals selected those with the smallest P-values and counted the number of independent SNPs. The first clumping round grouped SNPs significant at a study-wide P-value threshold (i.e., P < 4.6 × 10-11) and independent at r2 < 0.6. This clumping function reported significant, independent SNPs. The second clumping of significant independent SNPs at r2 < 0.1 reported the lead SNPs. For both steps, we used a 500-kb distance to compute pairwise LD.
Defining loci for fine mapping
We tiled the genome into adjoining, non-overlapping 250kb segments and, for each phenotype, we identified all segments that contained one or more variants with a meta-analyzed P < 5 × 10-8. Due to its long-range LD complexities, we excluded the major histocompatibility complex (MHC) (chr6: 25–36 Mb) from the tiling. Next, within each phenotype, we joined all adjacent significant segments into loci and padded the loci with 250kb on both sides. When significant loci overlapped telomere ends, the telomere-side boundaries of the loci were trimmed to coincide with the chromosomal boundaries. Significant loci overlapping the boundaries of the MHC were similarly trimmed. Finally, we mapped only the loci containing at least one variant significant at the study-wide threshold, P = 4.6 × 10-11. These parameters for defining loci differed from those used to identify lead variants due to the exclusion of the MHC and the retaining of loci that contained significant short insertions or deletions but not SNPs. Insertions and deletions were later removed from the variants used for fine-mapping to avoid strand-flip issues though the loci were still mapped. Twenty-one traits had no lead SNPs outside the MHC, and 4 had no significant variants besides insertions and deletions.
Fine-mapping
We statistically fine-mapped the significant phenotype-locus pairs using the Sum of the Single Effects framework (SuSiE) (32, 65) with the population-specific summary statistics and computed LD matrices. The LD matrices were matched to each population group and trait to eliminate any potential artifacts that could arise from an LD mismatch between the meta-analyzed GWAS outcomes and the reference panel. We carried out this process independently for each population group, which encompassed AFR, AMR, EAS, and EUR ancestries (Supplementary Note). We allowed up to five signals per locus and ran SuSiE through the coloc (66) R package with the z-scores as inputs, the “estimate_residual_variance” flag set to true, and the default uniform prior probability of causality. We calculated 95% credible sets for each identified signal representing the fewest number of variants whose posterior inclusion probabilities (PIP) for the signal summed to ≥ 0.95. We discarded credible sets in which the variants had an absolute minimum correlation < 0.1 and/or a minimum P > 5 × 10-8 for the population group in which the signal was mapped.
We set the maximum number of signals per locus to five instead of the default setting of ten because, during testing, we observed multiple instances of loci with suspicious credible sets at the higher setting. We tested by calculating the residual association for each signal after removing the effects of the other signals according to the following equation:
Here ρl corresponds to the expected residuals from the SuSiE algorithm after disregarding the effect of the l-th signal (see line 4 of the algorithm for Iterative Bayesian stepwise selection using sufficient or summary statistics (32)), N is the sample size, and a is the standard deviation for the phenotype under the SuSiE model. Under the null model, the residual association ∼ N (0,1).
The suspicious credible sets were non-primary signals at loci for phenotypes with suspected highly polygenic architecture including, most notably, height-related traits. These signals generally manifested as single-variant credible sets whose signal-level residual associations appeared as strong outliers in what would have otherwise been loci with the null residual association for the given signals. Reducing the number of signals mapped per locus from ten to five reduced the frequency of these suspicious credible sets.
Merging signals across population groups
To identify fine-mapped signals in multiple populations, we merged the retained 95% credible sets using an approach first reported by Kanai et al. (34). This method involves calculating the PIP-weighted Jaccard similarity indices between all pairs of signals identified for each unique phenotype-locus pair. For each pair of signals, we computed the similarity index as Σi min(xi,yi) / Σi max(xi, yi), where xi and yi are PIP values for the two signals for each variant i that was retained in both populations after making the LD reference panels. We converted each Jaccard similarity index into a distance by taking one minus the similarity index. Then we used the distances to hierarchically cluster the signals using the complete linkage method. We cut the resulting dendrogram tree at the height of 0.9, thereby merging any two credible sets with PIP-weighted Jaccard indices above 0.1 into a single credible set. For any analyses that examined the number of variants in a merged credible set, we defined the variants in each merged approximate credible set as the union of those comprising the component sets.
Computational environment and constraints
We defined loci, constructed LD matrices, fine-mapping, and signal merging on OLCF’s supercomputing infrastructure. LD matrix construction and fine mapping were parallelized across unique combinations of population, phenotype, and locus, and each combination was given 48 hours of wall time and 1 TB of RAM to complete each step. Any analyses that could not be completed within those constraints or that resulted in an error due to SuSiE estimating a negative residual variance were excluded (table S8). All phenotype-locus pairs that appeared in at least one population were put through the signal margining process for the completed populations, and any of their merged signals that passed the absolute minimum correlation and significance thresholds were retained in the fine-mapping results.
Suggestive associations and power calculations
To investigate the cause of a large number of signals mapped in a single population group, we looked for signals that had suggestive evidence of an association or lacked the power to detect a suggestive association in one or more unmapped populations. Signals were considered to show suggestive evidence of association in an unmapped population if any variants in the merged approximate credible set had directionally consistent effects between an unmapped and mapped population and population-specific P < 0.001 in the unmapped population. We also calculated the approximate power to identify associations in the unmapped populations using a two-step approach based on the t-distribution. In the first step, we calculated the critical effect size (absolute beta) for each variant within each unmapped population above which we would detect a suggestive association, given the standard errors and sample sizes for the unmapped population groups. In the second step, we assessed the proportion of an identically shaped t-distribution centered at the “true” effect size for each variant that lay above the critical value or below its negative opposite. For the “true” effect size of a given variant, we assumed the smallest absolute beta across the populations in which the variant was mapped in the credible set. If the proportion outside the critical values was < 80% in any of the unmapped populations and the signal did not show suggestive association, we considered that the signal lacked power.
Effective sample size calculations
To jointly examine the relationship between sample size and the number of fine-mapped signals for quantitative and binary traits, we calculated effective sample sizes for binary traits as the harmonic mean of the cross-population case and control counts (67) for the trait. For quantitative traits, we similarly used the cross-population sample sizes.
Functional annotations and coding enrichment
We used the Variant Effect Predictor (VEP) (68) to determine the most severe functional consequence of each associated and fine-mapped variant. The 39 annotations assigned by VEP were grouped into ten larger categories, including three comprising coding variants: splice/start/stop gain/loss, missense, and synonymous (table S12). For each variant annotated as non-coding, we used RegulomeDB v2.2 (35) to assign a probability score and category indicating the variant’s probability of being functional. To assess enrichment in coding variation and higher RegulomeDB scores across different categories of variants, we respectively used Fisher’s exact and Wilcoxon rank sum tests.
Derived and ancestral allele identification for high-confidence, fine-mapped variants
We determined the identities of the ancestral and derived alleles for all variants that were fine-mapped with high confidence in any population (PIP > 0.95) by referencing the parsimony-derived ancestral allele identities in the 1000 Genomes (1000G) Phase III variant call format files (60). Given our imputation scheme, 6,150 of 6,318 (97.3%) of the high-confidence variants were included in the 1000G callsets. Of the included variants, only those with consistently annotated ancestral alleles across the inferred orangutan-chimp-human progenitor, inferred chimp-human progenitor, and chimp lineages were retained for analyses involving ancestral/derived alleles. These remaining variants comprise 12,613 high-confidence SNP-phenotype pairs.
Defining known and novel associations
We compared the lead and fine-mapped variants and their proxies (with an r2 > 0.1) to determine their previously reported association in both the GWAS Catalog and Open Target Genetics database. These databases employ Experimental Factor Ontology (EFO) terms as the principal vocabulary for standardizing traits and phenotypes, prompting us to initially map all traits analyzed in our study to EFO terms. We implemented semi-automated processes to map specific trait descriptions to EFO terms to bridge the gap between disparate data sources. This process involves processing various mapping files and EFO terms from the GWAS catalog and Open Target Genetics, which provide Phecode, labs, and vitals to EFO term mappings. Initially, we map these Phecodes to their corresponding EFO terms using pre-existing mappings. For traits not categorized as Phecodes, we employed a “text descriptive fuzzy mapping” technique to assign these remaining traits to EFO terms. Subsequently, we carried out a manual review of trait to EFO term mappings since there were instances where different EFO terms were used for the same term by NHGRI EMBL GWAS (24) and Open Target Genetics catalogs (69). Upon a thorough review, the final table was composed, including traits and their corresponding EFO terms, ready to cross-reference with the GWAS catalog and Open Target Genetics.
For each trait, we identified the tag SNPs associated with each lead SNP (r2 > 0.1 and within a 500kb window). We then proceeded to cross-reference the lead SNP and the tag SNP in the GWAS catalog and Open Target Genetics to determine if they were associated with the trait of interest. We categorized the novelty of each association into one of three groups: a) Known Association: when either the lead SNP or tag SNPs are associated with the same trait as recorded in the GWAS or Open Target Genetics catalogs; b) Novel Association with Known Signal: when the lead SNP or tag SNPs are not associated with the same trait but do have associations with other traits; c) Novel Signal: when the lead SNP or tag SNPs are not associated with any trait listed in the GWAS catalog or Open Target Genetics.
Cross-population comparisons of credible set sizes
We compared the number of variants in the population-specific credible sets for signals successfully merged across more than one population (excluding the EAS population, as it lacked sufficient power). We used a paired Wilcoxon sign-rank test to identify significant differences in the credible set sizes for signals mapped in AFR/AMR, AFR/EUR, and AMR/EUR populations.
Identifying variant-trait associations specific to the non-EUR populations
We focused on characterizing novel SNPs in the non-EUR population fine-mapped to a trait by identifying variants mapped with high confidence (PIP>0.95) in the non-EUR populations and either unmapped or mapped with low confidence in the EUR population (table S11). We further restricted to coding variants as defined by VEP and with a MAF>0.05 in each of the non-EUR populations in the main results.
Estimating heterogeneous effects
Screening for heterogeneous effects across populations was performed on variants fine-mapped to a trait, specifically for Phecode only, from any population with a MAF>0.05. The quantitative traits were not tested for heterogeneity due to the potential discrepancy in their scales across different population groups. The variant-trait pair must have been fine-mapped with high confidence (PIP>0.95) in both populations being compared (AFR vs. EUR, AMR vs. EUR; the sample size for EAS was underpowered for the multiple testing among ultra-high-dimensional hypotheses) with MAF>0.05. To adequately control for the false discovery rate (FDR), the heterogeneity analysis performs an adaptive multiple testing procedure on the heterogeneity effect (69) (Supplementary Note) on the full set of variant-phenotype pairs. The adaptive heterogeneity multiple testing procedure improves power by reweighting the heterogeneity test statistics according to the level of evidence for the presence of an overall mean effect. In the broad heterogeneity screen, we examined all fine-mapped variants with MAF>0.05 in both populations being compared, i.e., AFR vs. EUR and AMR vs. EUR, and required overlapping credible sets. This allowed us to potentially detect variants with effects only present in the population or with small effect sizes that may be underpowered for AFR or AMR populations. The primary heterogeneity analysis further required that the variant be amino acid changing with PIP>0.95 in both populations and that the fine-mapping result showed the same variant-phenotype pair in both populations.
Selection of pleiotropic associations
To assess pleiotropy for SNPs with more than one trait association and PIP > 0.95, we needed to determine which traits were correlated. For each SNP, the following procedure was used to determine the number of independent traits. First, a trait (trait #1) was selected for comparison with all others based on the largest PIP value (0.95–1). If there was a tie, we selected trait #1 as (a) the trait with the most significant meta-analysis P-value, (b) the most significant population-specific P-value starting with the largest population size (EUR, AFR, AMR, then EAS), or (c) the trait with the largest absolute meta-analysis effect estimate. Next, all associated traits with an absolute phenotypic correlation coefficient >0.2 with trait #1 were flagged as correlated. We iterated over these steps until only uncorrelated traits remained. (fig. S12).
Supplementary Note
Figs. S1 to S14
Tables S1 to S19
Acknowledgments
We thank the Million Veteran Program, Office of Research and Development, and Veterans Health Administration for supporting this work. A complete acknowledgment of contribution to MVP is provided in Supplementary Note. We would like to sincerely thank Dr. Thomas Zacharia for providing access to the supercomputers at the Oak Ridge National Laboratory Leadership Computing Facility and Dr. Dimitri Kusenov, the previous DOE Headquarters lead for the VA-DOE partnership, for his invaluable guidance and support. Their contributions have been instrumental in the successful completion of this study. We want to thank NHGRI’s dbGAP team – particularly Mike Feolo, Neha Gupta, Jack Wang, and Anne Sturcke for all their hard work enabling the public release of this large data resource; Gao Wang and Yuxin Zou for their help deriving the formula for residual associations used to tune the parameters for fine-mapping; Colleen Morse for her assistance with visualization. Last but not least, we thank former staff members, and volunteers, who have contributed to MVP and, most of all, MVP participants for their service and their continued contributions to our nation through participation in this study. This publication does not represent the views of the Department of Veteran Affairs or the United States Government.
Footnotes
Joint Authorship
↵‡These authors jointly supervised this work