SUMMARY
Infectious diseases (ID) represent a significant proportion of morbidity and mortality across the world. Host genetic variation is likely to contribute to ID risk and downstream clinical outcomes, but there is a need for a genetics-anchored framework to decipher molecular mechanisms of disease risk, infer causal effect on potential complications, and identify instruments for drug target discovery. Here we perform transcriptome-wide association studies (TWAS) of 35 clinical ID traits in a cohort of 23,294 individuals, identifying 70 gene-level associations with 26 ID traits. Replication in two large-scale biobanks provides additional support for the identified associations. A phenome-scale scan of the 70 gene-level associations across hematologic, respiratory, cardiovascular, and neurologic traits proposes a molecular basis for known complications of the ID traits. Using Mendelian Randomization, we then provide causal support for the effect of the ID traits on adverse outcomes. The rich resource of genetic information linked to serologic tests and pathogen cultures from bronchoalveolar lavage, sputum, sinus/nasopharyngeal, tracheal, and blood samples (up to 7,699 positive pathogen cultures across 92 unique genera) and a large catalog of genome-wide associations of microbiome variation generated from phylogenetic analysis of 16S rRNA gene sequences are developed here into a platform to interrogate the genetic basis of compartment-specific infection and colonization. To accelerate insights into cellular mechanisms, we develop a TWAS repository of gene-level associations in a broad collection of human tissues with 79 pathogen-exposure induced cellular phenotypes as a discovery and replication platform. Cellular phenotypes of infection by 8 pathogens included pathogen invasion, intercellular spread, cytokine production, and pyroptosis. These rich datasets will facilitate mechanistic insights into the role of host genetic variation on ID risk and pathophysiology, with important implications for our molecular understanding of potentially severe phenotypic outcomes.
HIGHLIGHTS
Atlas of genome-wide association studies (GWAS) and transcriptome-wide association studies (TWAS) results for 35 clinical infectious disease (ID) phenotypes, with genome-wide and transcriptome-wide significant results for 13 and 26 clinical ID traits, respectively
Phenome-scale scan of ID-associated genes across 197 hematologic, respiratory, cardiovascular, and neurologic traits, facilitating identification of genes associated with known complications of the ID traits
Mendelian Randomization analysis, leveraging naturally occurring DNA sequence variation to perform “randomized controlled trials” to test the causal effect of ID traits on potential outcomes and complications
A genomic resource of TWAS associations for 79 pathogen-induced cellular traits from High-throughput Human in vitrO Susceptibility Testing (Hi-HOST) across 44 tissues as a discovery and replication platform to enable in silico cellular microbiology and functional genomic experiments
INTRODUCTION
Genome-wide association studies (GWAS) and large-scale DNA biobanks with phenome-scale information are making it possible to identify the genetic basis of a wide range of complex traits in humans (Bycroft et al., 2018; Roden et al., 2008). A parallel development is the increasing availability of GWAS summary statistics, facilitating genetic analyses of entire disease classes and promising considerably improved resolution of genetic effects on human disease (Cotsapas et al., 2011; Gamazon et al., 2019). Recent analysis involving 558 well-powered GWAS results found that trait-associated loci cover ~50% of the genome, enriched in both coding and regulatory regions, and of these, ~90% are implicated in multiple traits (Watanabe et al., 2019). However, the breadth of clinical and biological information in these datasets will require new methodologies and additional high-dimensional data to advance our understanding of the genetic architecture of complex traits and relevant molecular mechanisms (Bulik-Sullivan et al., 2015; Gamazon et al., 2018; Shi et al., 2016). Approaches to understanding the functional consequences of implicated loci and genes are needed to determine causal pathways and potential mechanisms for pharmacological intervention.
The genetic basis of infectious disease (ID) risk and severity has been relatively understudied, and its implications for etiological understanding of human disease and drug target discovery may be investigated using phenome-scale information increasingly available in these biobanks. ID risk and pathogenesis is likely to be multifactorial, resulting from a complex interplay of host genetic variation, environmental exposure, and pathogen-specific molecular mechanisms. With few exceptions, the extent to which susceptibility to ID is correlated with host genetic variation remains poorly understood (de Bakker and Telenti, 2010). However, for at least some ID traits, including poliomyelitis, hepatitis, and Helicobacter pylori (Burgner et al., 2006; Herndon and Jennings, 1951; Hohler et al., 2002; Malaty et al., 1994), disease risk is heritable, based on twin studies. Although monogenic mechanisms of ID risk have been demonstrated (Casanova, 2015a, b), the contribution of variants across the entire allele frequency spectrum to interindividual variability in ID risk remains largely unexplored.
Here we conduct genome-wide association studies (GWAS) and transcriptome-wide association studies (TWAS) of 35 ID traits. To implement the latter, we apply PrediXcan (Gamazon et al., 2018; Gamazon et al., 2015), which exploits the genetic component of gene expression to probe the molecular basis of disease risk. We combine information across a broad collection of tissues to determine gene-level associations using a multi-tissue approach, which displays markedly improved statistical power over a single-tissue approach (Barbeira et al., 2019; Gamazon et al., 2018; Gamazon et al., 2015). Notably, we identify 70 gene-level associations for 26 of 35 ID traits, i.e., heretofore referred to as ID-associated genes, and conduct replication using the corresponding traits in the UK Biobank and FinnGen consortia data (Bycroft et al., 2018; Locke et al., 2019). The rich resource of genetic information linked to clinical microbiology information (serology and culture data) across bacterial, fungal, and viral genera that we leverage provides a platform to interrogate the genetic basis of compartment-specific infection and colonization. Linking high-resolution taxonomic classification from 16S ribosomal RNA (rRNA) sequencing to host GWAS information has been used to investigate the contribution of host genetic variation to microbiome composition (Hughes et al., 2020). We therefore exploit host GWAS for 155 pathogens in the microbiome (Hughes et al., 2020) to gain further insights into identified genetic risk factors for an ID trait. To determine the phenotypic consequences of ID-associated genes, including adverse outcomes and complications, we perform a phenome-scale scan across hematologic, respiratory, cardiovascular, and neurologic traits. To extend these findings, we use a Mendelian Randomization framework (Lawlor et al., 2008) to conduct causal inference on the effect of a clinical ID trait on an adverse clinical outcome. To elucidate the cellular mechanisms through which host genetic variation influences disease risk, we generate an atlas of gene-level associations with 79 pathogen-induced cellular phenotypes determined by High-throughput Human in vitrO Susceptibility Testing (Hi-HOST) (Wang et al., 2018) as a discovery and replication platform. The rich genomic resource we generate and the methodology we develop promise to accelerate discoveries on the molecular mechanisms of infection, improve our understanding of adverse outcomes and complications, and enable prioritization of new therapeutic targets.
RESULTS
A schematic diagram illustrating our study design and the reference resource we provide can be found in Figure 1. Here we analyzed 35 clinical ID traits, 79 pathogen-exposure-induced cellular traits, and 197 (cardiovascular, hematologic, neurologic, and respiratory) traits. We performed GWAS and TWAS (Gamazon et al., 2015; Gusev et al., 2016) to investigate the genetic basis of the ID traits and their potential adverse outcomes and complications. We exploited serology and culture data linked to genetic information and genome-wide associations with microbiome traits to investigate compartment-specific patterns of infection. We conducted causal inference within a Mendelian Randomization framework (Davey Smith and Hemani, 2014), exploiting genetic instruments for naturally “randomized controlled trials” to evaluate the causality of an observed association between a modifiable exposure or risk factor and a clinical phenotype. We generate a rich resource for understanding the genetic and molecular basis of infection and potential adverse effects and complications.
GWAS and TWAS of 35 infectious disease clinical phenotypes implicate broad range of molecular mechanisms
We sought to characterize the genetic determinants of 35 ID traits, including many which have never been investigated using a genome-wide approach. First, we performed GWAS of each of these phenotypes using a cohort of 23,294 patients of European ancestry with extensive EHR information from BioVU (Roden et al., 2008). We identified genome-wide significant associations (p < 5×10-8) for 13 ID traits (Figure 2A and Supplementary Table 1). The SNP rs17139584 on chromosome 7 was our most significant association (p = 1.21 x 10-36) across all traits, with bacterial pneumonia. A LocusZoom plot shows several additional genome-wide significant variants in the locus (Figure 2B), in low linkage disequilibrium (r2 < 0.20) with the sentinel variant rs17139584, including variants in the MET gene and in CFTR. The MET gene acts as a receptor to Listeria monocytogenes internalin InlB, mediating entry into host cells; interestingly, listeriosis, a bacterial infection caused by this pathogen, can lead to pneumonia (García-Montero et al., 1995). Given the observed associations in the cystic fibrosis gene CFTR (~650 Kb downstream of MET), we also asked whether the rs17139584 association was driven by cystic fibrosis. Notably, the SNP remained nominally significant, though its significance was substantially reduced, after adjusting for cystic fibrosis status (p = 0.007; see Methods) or excluding the cystic fibrosis cases (p = 0.02). The LD profile of the genome-wide significant results in this locus (Figure 2B) is consistent with the involvement of multiple gene mechanisms (e.g., MET and CFTR) underlying bacterial pneumonia risk. The rs17139584 association replicated (p = 5.3×10-3) in the UK Biobank (Bycroft et al., 2018). Eighty percent to ninety percent of patients with cystic fibrosis suffer from respiratory failure due to chronic bacterial infection (with Pseudomonas aeruginosa) (Lyczak et al., 2002). Thus, future studies on the role of this locus in lung infection associated with cystic fibrosis may provide germline predictors of this complication; alternatively, the locus may confer susceptibility to lung inflammation, regardless of cystic fibrosis status. Collectively, our analysis shows strong support for allelic heterogeneity, with likely multiple independent variants in the locus contributing to interindividual variability in bacterial pneumonia susceptibility.
Additional examples of genome-wide significant associations with other ID traits were identified. For example, rs192146294 on chromosome 1 was significantly associated (p = 1.23×10-9) with Staphylococcus infection. In addition, 10 variants on chromosome 8 were significantly associated (p < 1.17×10-8) with Mycoses infection.
Next, to improve statistical power, we performed multi-tissue PrediXcan (Barbeira et al., 2019; Gamazon et al., 2018; Gamazon et al., 2015). We constructed an atlas of TWAS associations with these ID traits in separate European and African American ancestry cohorts (Supplementary Data File 1) as a resource to facilitate mechanistic studies. Notably, 70 genes reached experiment-wide or individual ID-trait significance for 26 of the 35 clinical ID traits (Figure 3A and Table 1). Sepsis, the clinical ID trait with the largest sample size in our data (Figure 3B; Phecode 994; number of cases 2,921; number of controls 22,874), was significantly associated (p = 8.16×10-7) with IKZF5 after Bonferroni correction for the number of genes tested. The significant genes (Table 1) were independent of the sentinel variants from the GWAS (Supplementary Table 1), indicating that the gene-based test was identifying additional signals.
Our analysis identified previously implicated genes for the specific ID traits but also proposes novel genes and mechanisms. ID-associated genes include NDUFA4 for intestinal infection, a component of the cytochrome oxidase and regulator of the electron transport chain (Balsa et al., 2012); AKIRIN2 for candidiasis, an evolutionarily conserved regulator of inflammatory genes in mammalian innate immune cells (Tartey et al., 2015; Tartey et al., 2014); ZNF577 for viral hepatitis C, a gene previously shown to be significantly hypermethylated in hepatitis C related hepatocellular carcinoma (Revill et al., 2013); and epithelial cell adhesion molecule (EPCAM) for tuberculosis, a known marker for differentiating malignant tuberculous pleurisy (Sun et al., 2014), among many others. These examples of ID-associated genes highlight the enormous range of molecular mechanisms that may contribute to susceptibility and complication phenotypes.
Replication of gene-level associations with infectious diseases in the UK Biobank and FinnGen
To bolster our genetic findings and show that our results were not driven by biobank-specific confounding, we performed replication analysis for a subset of ID traits available in the independent UK Biobank and FinnGen consortia datasets (see Methods and Supplementary Data File 2). Notably, the genes associated with intestinal infection (p < 0.05,) in BioVU – the ID trait with the largest sample size in BioVU and with a replication dataset in the independent FinnGen biobank – showed a significantly greater level of enrichment for gene-level associations with the same trait in FinnGen compared to the remaining set of genes (Figure 3C). Thus, higher significance (i.e., lower p-value) was observed in FinnGen for the intestinal infection associated genes identified in BioVU, which included the top association NDUFA4 (discovery p = 1.83×10-9, replication p = 0.044). These results illustrate the value of exploiting large-scale biobank resources for genetic studies of ID traits-despite well-known caveats (Ko and Urban, 2013; Power et al., 2017).
Replication of tissue-level PrediXcan results enables identification of robust tissue-specific associations, lending insights into putative molecular mechanisms. Here we provide tissue-level replication for ID traits available in the UK Biobank and FinnGen: bacterial pneumonia (lung, Phecode 480.1), influenza (lung, Phecode 481), viral pneumonia (lung, Phecode 480.2), meningitis (10 brain regions, Phecode 320), and encephalitis (10 brain regions, Phecode 323) (Supplementary Data File 2). For example, replication analysis in the UK Biobank (see Methods) for influenza (Figure 3D, Supplementary Data File 2) and bacterial pneumonia (Figure S1A, Supplementary Data File 2) associated genes in lung tissue revealed many robust tissue-level associations. We also replicated the top genes (p < 0.05) associated with influenza (Supplementary Data File 2) and viral pneumonia (Supplementary Data File 2) in FinnGen (see Methods), providing additional support for the role of these genes in pulmonary disease. We found substantial replication in the UK Biobank for individual brain tissue gene expression for meningitis in hippocampus, cerebellar hemisphere, and hypothalamus (Figure S1B-D, Supplementary Data File 2). Similar trends were observed for the remaining brain tissues (Supplementary Data File 2). In addition, we found substantial replication in the UK Biobank in caudate, cortex, and cerebellum tissue for encephalitis (Figure S1E-G, Supplementary Data File 2). Again, similar trends were observed across all brain regions (Supplementary Data File 2). Intriguingly, the cerebellar gene expression traits nominally associated with meningitis in BioVU improved the signal-to-noise ratio for identifying cerebellar associations in FinnGen (Figure 3E, Supplementary Data File 1-2). These data highlight reproducible genetically-determined mechanisms underlying ID risk.
Tissue expression profile of infectious disease associated genes suggests tissue-dependent mechanisms
The ID-associated genes tend to be less tissue-specific (i.e., more ubiquitously expressed) than the remaining genes (Figure S2A, Mann Whitney U test on the τ statistic, p = 7.5×10-4), possibly reflecting the multi-tissue PrediXcan approach we implemented, which prioritizes genes with multi-tissue support to improve statistical power, but also the genes’ pleiotropic potential. We hypothesized that tissue expression profiling of ID-associated genes can provide additional insights into disease etiologies and mechanisms. For example, the intestinal infection associated gene NDUFA4 is expressed in a broad set of tissues, including the alimentary canal, but displays relatively low expression in whole blood (Figure S2B). In addition, TOR4A, the most significant association with bacterial pneumonia (Table 1), is most abundantly expressed in lung, consistent with the tissue of pathology, but also in spleen (Figure S2C), whose rupture is a lethal complication of the disease (Domingo et al., 1996; Gerstein et al., 1967). These examples illustrate the diversity of tissue-dependent mechanisms that may contribute in complex and dynamic ways to interindividual variability in ID susceptibility and progression. We therefore provide a resource of single-tissue gene-level associations with the ID traits to facilitate molecular or clinical follow-up studies.
Genetic overlap reveals host gene expression programs and common pathways as targets for pathogenicity
We hypothesized that ID-associated genes implicate shared functions and pathways, which may reflect common targeted host transcriptional programs. Among the 70 gene-level associations with the 35 clinical ID traits, 40 proteins are post-translationally modified by phosphorylation (Supplementary Table 2), a significant enrichment (Benjamini-Hochberg adjusted p < 0.10 on DAVID annotations (Huang da et al., 2009)) relative to the rest of the genome, indicating that phosphoproteomic profiling can shed substantial light on activated host factors and perturbed signal transduction pathways during infection (Soderholm et al., 2016; Stahl et al., 2013). In addition, 16 proteins are acetylated, consistent with emerging evidence supporting this mechanism in the host antiviral response (Murray et al., 2018) (Supplementary Table 3). These data identify specific molecular mechanisms across ID traits with critical regulatory roles (e.g., protein modifications) in host response among the ID-associated genes.
We tested the hypothesis that distinct infectious agents exploit common pathways to find a compatible intracellular niche in the host, potentially implicating shared genetic risk factors. Notably, 64 of the 70 ID-associated genes (Table 1) were nominally associated (p < 0.05) with multiple ID traits (Supplementary Table 4). These genes warrant further functional study as broadly exploited mechanisms targeted by pathogens or as broadly critical to pathogen-elicited immune response. Gene Set Enrichment Analysis (GSEA) of these genes implicated a number of significant (FDR < 0.05) gene sets (Figure 4A), including those involved in actin-based processes and cytoskeletal protein binding, processes previously demonstrated to mediate host response to pathogen infection (Taylor et al., 2011). Since diverse bacterial and viral pathogens target host regulators that control the cytoskeleton (which plays a key role in the biology of infection) or modify actin in order to increase virulence, intracellular motility, or intercellular spread (Aktories and Barbieri, 2005; Yu et al., 2011; Zahm et al., 2013), these results reassuringly lend support to the involvement of the genes in infectious pathogenesis.
Notably, we identified an enrichment (FDR = 9.68×10-3) for a highly conserved motif (“TCCCRNNRTGC”), within 4 kb of transcription start site (TSS) of multi-ID associated genes (Figure 4A-B), that does not match any known transcription factor binding site (Xie et al., 2005) and may be pivotal for host-pathogen interaction for the diversity of infectious agents included in our study. In addition, we found that several of the multi-ID associated genes (with the sequence motif near the TSS) have been observed in host-pathogen protein complexes (by both coimmunoprecipitation and affinity chromatography approaches) for the specific pathogens responsible for the ID traits (Ammari et al., 2016). See Supplementary Data File 3 for complete list of host-pathogen interactions for these genes/proteins. One example is CDK5, a gene significantly associated with Gram-positive septicemia (Table 1) and nominally associated with multiple ID traits, including herpes simplex. CDK5 is activated by p35, whose cleaved form p25 results in subcellular relocalization of CDK5. The CDK5-p25 complex regulates inflammation (Na et al., 2015) (whose large-scale disruption is characteristic of septicemia) and induces cytoskeletal disruption in neurons (Patrick et al., 1999) (where the herpes virus is responsible for lifelong latent infection). The A and B chains of the CDK5-p25 complex (Figure 4C for structure diagram (Tarricone et al., 2001)) are required for cytoskeletal protein binding (CDK5), whereas the D and E chains (p25) are involved in actin regulation and kinase function, all molecular processes implicated in our pathway analysis. Intriguingly, blocking CDK5 can have a substantial impact on the outcome of inflammatory diseases including sepsis (Pfänder et al., 2019), enhancing the anti-inflammatory potential of immunosuppressive treatments, and has been shown to attenuate herpes virus replication (Man et al., 2019), suggesting that modulation of this complex is important for viral pathogenesis.
CDK5 is also altered by several other viruses, identified using unbiased mass spectrometry analysis (Davis et al., 2015) (Figure 4D), indicating a broadly exploited mechanism (across pathogens) that is consistent with the gene’s multi-ID genetic associations in our TWAS data (Figure 4D). The CDK5-interaction proteins include: 1) M2_134A1 (matrix protein 2, influenza A virus), a component of the proton-selective ion channel required for viral genome release during cellular entry and is targeted by the anti-viral drug amantadine (Hay et al., 1985); 2) VE7_HPV16, a component of human papillomavirus (HPV) required for cellular transformation and trans-activation through disassembly of E2F1 transcription factor from RB1 leading to impaired production of type I interferons (Barnard et al., 2000; Chellappan et al., 1992; Phelps et al., 1988); 3) VE7_HPV31, which has been shown to engage histone deacetylases 1 and 2 to promote HPV31 genome maintenance (Longworth and Laimins, 2004); 4) VCYCL_HHV8P (cyclin homolog within the human herpesvirus 8 genome), which has been shown to control cell cycle through CDK6 and induce apoptosis through Bcl2 (Duro et al., 1999; Ojala et al., 1999; Ojala et al., 2000); and 5) F5HC81_HHV8, predicted to act as a viral cyclin homolog. Overall, these data underscore the evolutionary strategies that pathogens have evolved to promote infection, including the hijacking of the host transcriptional machinery and the biochemical alterations of the host proteome.
Serology and culture data reveal insights into clinical infection and pathogen colonization
We exploited extensive clinical microbiological laboratory analysis of blood (Figure 5A), bronchoalveolar lavage, sputum, sinus/nasopharyngeal, and tracheal cultures for bacterial and fungal pathogen genus identification (Figure S3A-D), as well as respiratory viral genus identification (Figure S3E) (see Methods) to evaluate phenotype resolution and algorithm. For example, we found that Staphylococcus infection (Phecode = 041.1) performed well in classifying Staphylococcus aureus infection based on blood culture data. The area under the Receiver Operating Characteristic (ROC) curve was 0.938 (Figure 5B) with standard error of 0.008 generated from bootstrapping (see Methods). The area under the curve (AUC) quantifies the probability that the Phecode classifier ranks a randomly chosen positive instance of Staphylococcus aureus infection in blood higher than a randomly chosen negative one. In comparison, the first principal component (PC) in our European ancestry samples showed AUC of 0.514 (Figure 5B) while sex and age performed even more poorly (AUC ≈ 0.50). We then tested a logistic model with the Phecode classifier, age, sex, and the first 5 PCs in the model. The Phecode classifier was significantly associated (p < 2.2×10-16) after conditioning on the remaining covariates. The fitted value from the joint model consisting of the remaining covariates showed AUC of 0.568 (Figure 5B). Collectively, culture data for improved resolution of clinical infection and pathogen colonization provide validation of our approach.
To expand these findings and further dissect the complex pathogen-colonization patterns in humans, we utilized host genome-wide associations with human gut microbiome variation for 155 pathogens (Hughes et al., 2020) identified through 16S rRNA sequencing. For example, among the top SNP associations with Desulfovibrionaceae (p<0.0001), a sulfate-reduced bacterium associated with intraabdominal infections and inflammatory bowel disease (Goldstein et al., 2003; Loubinoux et al., 2002), we observed an enrichment for SNPs associated with intestinal infection in BioVU (Figure 5C). These data provide a reference resource to elucidate how genetically determined microbiome variation influences ID trait susceptibility.
Phenome scan of clinical ID-associated genes identifies adverse outcomes and complications
Electronic Health Records (EHR) linked to genetic data may reveal insights into associated clinical sequalae (Bastarache et al., 2018; Denny et al., 2013; Unlu et al., 2020). To assess the phenomic impact of ID-associated genes (Table 1), we performed a phenome-scale scan across 197 hematologic, respiratory, cardiovascular, and neurologic traits available in BioVU (Figure 6A and Supplementary Data File 4). Correcting for total number of genes and phenotypes tested, we identified four gene-phenotype pairs reaching experiment-wide significance: 1) WFDC12, our most significant (p = 4.23×10-6) association with meningitis and a known anti-bacterial gene (Hagiwara et al., 2003), is also associated with cerebral edema and compression of brain (p = 1.35×10-6), a feared clinical complication of meningitis (Niemöller and Täuber, 1989); 2) TM7SF3, the most significant gene with Gram-negative sepsis (p = 1.37×10−6), is also associated with acidosis (p = 1.95×10-6), a known metabolic derangement associated with severe sepsis (Suetrong and Walley, 2016), and a gene known to play a role in cell stress and the unfolded protein response (Isaac et al., 2017); 3) TXLNB, the most significant gene associated with viral warts and human papillomavirus infection (p = 4.35×10-6), is also associated with abnormal involuntary movements, p = 1.39×10-6; and 4) RAD18, the most significant gene associated with Streptococcus infection (p = 2.01×10-6), is also associated with anemia in neoplastic disease (p = 3.10×10-6). Thus, coupling genetic analysis to EHR data with their characteristic breadth of clinical traits offers the possibility of determining the phenotypic consequences of ID-associated genes, including known (in the case of WFDC12 and TM7SF3) potentially adverse health outcomes and complications.
Mendelian Randomization provides causal support for the effect of infectious disease trait on identified adverse phenotypic outcomes/complications
Since our gene-level associations with clinical ID diagnoses implicated known adverse complications, we sought to explicitly evaluate the causal relation between the ID traits and the adverse outcomes/complications. We utilized the Mendelian Randomization paradigm (Lawlor et al., 2008) (Figure 6B), which exploits genetic instruments to make causal inferences in observational data, in effect, performing randomized controlled trials to evaluate the causal effect of “exposure” (i.e., ID trait) on “outcome” (e.g., the complication). Specifically, we conducted multiple-instrumental-variable causal inference using GWAS (Davey Smith and Hemani, 2014) and PrediXcan summary results. First, we used independent SNPs (r2 = 0.01) that pass a certain threshold for significance with the ID trait (p < 1.0×10-5) as genetic instruments. To control for horizontal pleiotropy and account for the presence of invalid genetic instruments, we utilized MR-Egger regression and weighted-median Mendelian Randomization (see Methods) (Bowden et al., 2015; Bowden et al., 2016).
Here, we performed Mendelian Randomization on the ID trait and a complication trait identified through the unbiased phenome scan. This analysis yields causal support for the effect of 1) Gram-negative sepsis on acidosis (Figure 6C, weighted-median estimator p = 2.0×10-7); and 2) meningitis on cerebral edema and compression of brain (Figure 5C, weighted-median estimator p = 2.7×10-3). Our resource establishes a framework to elucidate the genetic component of an ID trait and its impact on the human disease phenome, enabling causal inference on the effect of an ID trait on potential complications.
TWAS of 79 pathogen-exposure induced cellular traits highlights cellular mechanisms and enables validation of ID gene-level associations
Elucidating how the genes influence infection-related cellular trait variation may provide a mechanistic link to ID susceptibility. We thus performed TWAS of 79 pathogen-induced cellular traits – including infectivity and replication, cytokine levels, and host cell death, among others (Wang et al., 2018) (Supplementary Data File 5). We identified 38 gene-level associations reaching trait-level significance (p < 2.87×10-6, correcting for number of statistical tests; Figure 7A). In addition, we replicated SNP associations with the cellular traits using the genetic associations with the ID traits (Supplementary Data File 6), which map to cellular phenotypes (Supplementary Data File 7).
Integration of EHR data into Hi-HOST (Wang et al., 2018) may enable replication of gene-level associations with a clinical ID trait. Indeed, we observed a marked enrichment for genes associated with direct Staphylococcus toxin exposure cellular response in Hi-HOST among the human Gram-positive septicemia associated genes from BioVU (see Supplementary Data File 5 for genes with FDR < 0.05) (Figure 7B). In addition, integration of EHR data into Hi-HOST may improve the signal-to-noise ratio in Hi-HOST TWAS data. Indeed, the top 300 genes nominally associated (p < 0.016) with Staphylococcus infection (Phecode 041.1) in BioVU departed from null expectation for their associations with Staphylococcus toxin exposure in Hi-HOST compared to the full set of genes, which did not (Figure 7C), as perhaps expected due to the modest sample size. Collectively, these results demonstrate that integrating the EHR-derived TWAS results into TWAS of the cellular trait can greatly improve identification of potentially relevant pathogenic mechanisms.
Phenome scan of TWAS findings from Hi-HOST
To identify potential adverse effects of direct pathogen exposure, we performed a phenome-scan across the 197 cardiovascular, hematologic, neurologic, and respiratory traits as described above. Our top gene-phenotype pairs include: 1) FAM171B, our most significant association with interleukin 13 (IL-13) levels is also associated with alveolar and parietoalveolar pneumonopathy (p = 4.04×10-5), a phenotype known to be modulated by IL-13 dependent signaling (Zheng et al., 2008); 2) OSBPL10, the most significant gene associated with cell death caused by Salmonella enterica serovar Typhimurium, is also associated with intracerebral hemorrhage (p = 4.99×10-5), a known complication of S. Typhimurium endocarditis (Gόmez-Moreno et al., 2000). These data highlight the utility of joint genetic analysis of pathogen-exposure-induced phenotypes and clinical ID traits to gain insights into the molecular and cellular basis of complications and adverse outcomes. However, more definitive conclusions will require larger sample sizes and functional studies.
DISCUSSION
ID susceptibility is a complex interplay between host genetic variation and pathogen-exposure induced mechanisms. While GWAS has begun to identify population-specific loci conferring ID risk (Tian et al., 2017), the underlying function of identified variants, predominantly in non-coding regulatory regions, remains poorly understood. Molecular characterization of infectious processes has been, in general, agnostic to the genetic architecture of clinical infection. Although pathogen exposure is requisite to display clinical ID traits, the role of host genetic variation remains largely unexplored.
Our study provides a reference atlas of genetic variants and genetically-determined expression traits associated with 35 clinical ID traits from BioVU. We identified 70 gene-level associations, with replication for a subset of ID traits in the UK Biobank and FinnGen. To provide additional support to our findings, we leveraged a rich resource of genetic information linked to serologic tests and pathogen cultures from five clinical sample sites and exploited a large catalog of genome-wide associations of microbiome variation generated from 16S rRNA based taxonomic classification. A phenome scan across 197 hematologic, respiratory, cardiovascular, and neurologic traits proposes a molecular basis for the link between certain ID traits and outcomes. Using Mendelian Randomization, we determined the ID traits which, as exposure, show significant causal effect on outcomes. Finally, we developed a TWAS catalog of 79 pathogen-exposure induced cellular traits (Hi-HOST) in a broad collection of tissues, which provides a platform to interrogate mediating cellular and molecular mechanisms.
Genetic predisposition to ID onset and progression is likely to be complex (Casanova, 2015a). Monogenic mechanisms conferring ID risk have been proposed, but these mechanisms are unlikely to explain the broad contribution of host genetic influence on ID risk (Casanova, 2015b). Thus, a function-centric methodology is necessary to disentangle potentially causal pathways. Our approach builds on PrediXcan, which estimates the genetically-determined component of gene expression (Gamazon et al., 2015). The genetic component of gene expression can then be tested for association with the trait, enabling insights into potential pathogenic mechanisms (Gamazon et al., 2019) and novel therapeutic strategies (So et al., 2017).
Our study identified genes with diverse functions, including roles in mitochondrial bioenergetics (Balsa et al., 2012; El-Bacha and Da Poian, 2013), regulation of cell death (Labbé and Saleh, 2008), and of course links to host immune response (Brouwer et al., 2019; Liang et al., 2019; Pan et al., 2017; Saitoh et al., 2009; Sharfe et al., 1997; Tsuboi and Meerloo, 2007; Walenna et al., 2018; Willis et al., 2009; Yu et al., 2017; Zhang et al., 2015). These diverse functions may therefore contribute to pleiotropic effects on clinical outcomes and complications.
In addition, we identified genes implicated in Mendelian diseases, for which susceptibility to infection is a predominant feature, including WIPF1 (OMIM #614493; recurrent infections and reduced natural killer cell activity (Lanzi et al., 2012)), IL2RA (OMIM #606367; recurrent bacterial infections, recurrent viral infections, and recurrent fungal infections (Sharfe et al., 1997)), and TBK1 (OMIM #617900; herpes simplex encephalitis (HSE), acute infection, and episodic HSE (Herman et al., 2012)). These examples show that the identified genes may also confer predisposition, with near-complete penetrance, to an infectious disease related trait displaying true Mendelian segregation.
Enrichment analysis of 64 of the 70 ID-associated genes with nominal support for associations with other clinical ID traits identified modulation of the actin cytoskeleton as a potential shared mechanism of host susceptibility to infection (Figure 4). While manipulation of the actin cytoskeleton by pathogens is hardly a new concept, our study identified specific host genetic variation in actin regulatory genes that is potentially causative of clinical ID manifestations. In addition to pathogen interaction with the cytoskeletal transport machinery, efficient exploitation of host gene expression program is crucial for successful invasion and colonization, and here we mapped several pathogenicity-relevant targets. Notably, we observed a significant enrichment for a highly conserved sequence motif, within 4 kb of a multi-ID-associated gene’s TSS, that is not a known transcription factor binding site. The motif’s presence near multi-ID associated genes suggests a broad regulatory role in host-pathogen interaction, involving the diversity of pathogens examined here, towards successful reprogramming of host gene expression. Furthermore, we identified a significant enrichment for phosphorylated host proteins, suggesting the value of global phosphoproteomic profiling, which has recently been used to prioritize pharmacological targets for the novel SARS-CoV-2 virus (Bouhaddou et al., 2020). These data provide several potential avenues by which host susceptibility can be breached by a pathogen’s requirement to maintain a niche through manipulation of host cellular machinery.
To obtain additional support for our gene-level associations, we leveraged two genomic resources with rich phenotypic information (UK Biobank (Bycroft et al., 2018) and FinnGen (Locke et al., 2019)). These data will prove increasingly useful to characterizing the genetic basis of the ID-associated adverse outcomes and complications. Despite the caveats for the use of EHR in genetic analyses of ID traits (Ko and Urban, 2013; Power et al., 2017), the growing availability of such independent datasets will facilitate identification of robust genetic associations. Perhaps more importantly, the breadth of clinical phenotypes in these EHR datasets should enable identification of associated adverse outcomes and complications for the ID-associated genes.
The primary challenges in conducting GWAS of ID traits include phenotype definition and case-control misclassification. Obstacles to accurate phenotype definition include the requirement of specialized laboratory testing to identify specific pathogens and administration of prophylactic therapeutics complicating identification of potentially causative pathogens. Seropositivity may result from the complex genetic properties of the pathogen and the particular mechanisms governing host-pathogen interaction. However, seropositivity may not indicate clinical manifestations of the disease. On the other hand, seronegativity may imply lack of exposure to the pathogen, the absence of infection even in the presence of exposure, or host resistance to infection. Anchoring the analysis to host genetic information (as in our use of genetically-determined expression) and replication of discovered associations may address some aspects of this challenge. Here we exploit an extensive resource of culture data (for identification of pathogens from clinical specimens) linked to whole-genome genetic information to provide additional support to our gene-level associations. One of the ubiquitous problems in diagnosis is that culture recovery is often for multiple organisms, or a contaminant not relevant to the actual pathogen. Similarly, molecular diagnostics of pathogen identification is often a curation of multiple statistically relevant putative pathogens. The mapping of pathogen genome identification to transcriptional response (molecular seropositivity) is a valuable validation of a finding that a given pathogen is associated with a particular infectious syndrome, and our approach to identification of genetically determined expression changes may facilitate this mapping. Future studies may also implement more complex GWAS models, including incorporating the pathogen genome.
Our catalog of TWAS associations with microbiome composition may facilitate insights into molecular mechanisms of infectious disease risk and complications, inform studies of host-pathogen interactions, and improve anti-microbial pharmacologic strategies. Improved characterization of pathogen colonization and taxonomic classification at species and strain level through 16S rRNA sequencing-based approaches may lead to greater resolution of causative infectious processes. Disruption of pathogen equilibrium in the microbiome by environmental or genetic variation may determine susceptibility to human disease (Goodrich et al., 2016; Hall et al., 2017). However, critical challenges to understanding the patterns of host colonization include identification of rare pathogen populations as well as environmental pressures (i.e. medication use, dietary alterations, etc.) acutely altering the microbiome landscape (Kurilshikov et al., 2017). Thus, linking microbiome traits to host genetic variation promises to improve resolution of causative mechanisms for ID traits and potentially adverse outcomes.
Mendelian Randomization provides a framework to perform causal interference on the effect of the exposure on the outcome (Davey Smith and Hemani, 2014; Lawlor et al., 2008). We leveraged a summary statistics based approach to test the causal effect of an ID trait on potential adverse outcomes, using genetic instruments. Mendelian Randomization requires three assumptions: 1) the genetic instrument is associated with exposure (i.e., ID trait); 2) the genetic instrument is associated with the outcome (i.e., adverse outcome or complication) only through the exposure of interest; and 3) the genetic instrument is affecting the outcome independent of other factors (i.e., confounders). Violations of these assumptions can have critical implications for the interpretation of the results. Thus, several approaches have been developed that are robust to these violations. In the case of ID traits, a methodology that distinguishes causality from comorbidity is critical. While many phenotypes are highly comorbid and suspected to have a causal relationship (e.g., smoking and depression/anxiety), Mendelian Randomization does not necessarily support the causal hypothesis (Taylor et al., 2014). Furthermore, since RCTs cannot be ethically conducted for ID traits and adverse outcomes, the methodology offers an approach for elucidating the role of an infection phenotype or pathogen exposure in disease causation using an observational study design. Here, we found strong causal support for the effect of certain clinical ID traits on potential adverse complications identified through a phenome scan of the ID-associated genes: 1) meningitis - cerebral edema and compression of brain; and 2) Gram-negative sepsis - acidosis. These data indicate that genetic risk factors for select adverse outcomes and complications exert their phenotypic effect through the relevant ID traits.
To enable investigations into mediating cellular and molecular traits for the ID-associated genes, we provide a functional genomics resource built on a high-throughput in vitro pathogen infection screen (Hi-HOST) (Wang et al., 2018). Integration of EHR data into Hi-HOST facilitates replication of gene-level associations with clinical ID traits and greatly improves the signal-to-noise ratio. This discovery and replication platform, encompassing human phenomics and cellular microbiology, provides a high-throughput approach to linking host cellular processes to clinical ID traits and adverse outcomes.
Although additional mechanistic studies are warranted, our study lays the foundation for anchoring targeted molecular studies in human genetic variation. Elucidation of host mechanisms exploited by pathogens requires multi-disciplinary approaches. Here, we show the broader role of host genetic variation, implicating diverse disease mechanisms. Our study generates a rich resource and a genetics-anchored methodology to facilitate investigations of ID-associated clinical outcomes and complications, with important implications for the development of preventive strategies and more effective therapeutics. Causal inference on the clinical ID traits and potential complications promises to expand our understanding of the molecular basis for the link and, crucially, enable prediction and prevention of serious adverse events.
Data Availability
All results and code are available at the links below.
AUTHOR CONTRIBUTIONS
Conceptualization, A.T.H. and E.R.G.; Methodology, A.T.H., D.Z., R.L.S., L.B., S.J.S., D.C.K., and E.R.G.; Investigation, A.T.H., D.Z., R.L.S., L.B., L.W., S.S.Z., S.J.S., D.C.K., and E.R.G. Writing – Original Draft, A.T.H. and E.R.G.; Writing – Review and Editing, A.T.H., D.Z., R.L.S., L.B., L.W., S.S.Z., S.J.S., D.C.K., and E.R.G, Funding Acquisition – A.T.H. and E.R.G., Supervision, E.R.G.
DECLARATION OF INTERESTS
E. R.G. receives an honorarium from the journal Circulation Research of the American Heart Association, as a member of the Editorial Board. The other authors declare no competing interests.
STAR★METHODS
CONTACT FOR REAGENT AND RESOURCE SHARING
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Eric R. Gamazon (eric.gamazon{at}vanderbilt.edu).
EXPERIMENTAL MODEL AND SUBJECT DETAILS
BioVU
BioVU, one of the largest DNA biobanks tied to an EHR database, is a subset of the synthetic derivative (SD), a deidentified electronic health record, consisting of individuals with whole-genome genetic information. Detailed information on the construction, utilization, ethics, and policies of the BioVU resource is described elsewhere (Roden et al., 2008). ID traits were defined based on a hierarchical grouping of International Classification of Diseases, Ninth Revision (ICD-9) codes into phenotype codes (Phecodes) representing clinical traits, as previously described (Denny et al., 2013; Denny et al., 2010). (See below for a description of pathogen culture and viral test data in the BioVU individuals, including genera detected from different types of cultures.) We used version 1.2 of the Phecode Map containing 1,965 Phecodes based on 20,203 ICD-9 codes, which substantially improves signal-to-noise and more accurately reflects the clinical trait. Phecodes may exclude related phenotypes (e.g., in the case of Gram negative septicemia (Phecode = 038.1), the range of Phecodes given by 010-041.99, involving bacterial infection) and, importantly, include the definition of the appropriate control group (Wei et al., 2017). Detailed description of Phecode trait maps can be found at phewascatalog.org. As an efficient and viable model for human genetics research, the Phecode system has been used to perform phenome-wide association studies (PheWAS) for validation of known genetic associations and discovery of new genetic disorders (Denny et al., 2013; Unlu et al., 2020).
Pathogen culture and virology data linked to whole-genome genetic information
For individuals with whole-genome genetic information, we analyzed pathogen (bacterial, mycobacterial, and fungal) culture data derived from the following positive cultures for the indicated clinical samples: 1) blood (n = 7,699), 2) sputum (n = 2,478), 3) sinus/nasopharyngeal (n = 1,820), 4) bronchial-alveolar lavage (n = 1,265), and 5) tracheal sampling (n = 422). Furthermore, we analyzed a respiratory panel containing 28 viral strains from 2,890 individuals with whole-genome genetic information. Viral strains included the following: 1) Adenovirus, 2) Bocavirus, 3) Bordetella parapertussis, 4) Bordetella pertussis, 5) Chlamydia pneumoniae, 6) Coronavirus 229E, 7) Coronavirus HKU1, 8) Coronavirus NL63, 9) Coronavirus NOS, 10) Coronavirus OC43, 11) Enterovirus/Rhinovirus, 12) Human Metapneumovirus, 13) Influenza A, 14) Influenza A, H1, 15) Influenza A, H1N1, 16) Influenza A, H3, 17) Influenza B, 18) Mycoplasma pneumoniae, 19) Parainfluenza, 20) Parainfluenza 1, 21) Parainfluenza 2, 22) Parainfluenza 3, 23) Parainfluenza 4, 24) Respiratory syncytial virus (RSV), 25) RSV, A, 26) RSV, B, and 27) Rhinovirus. The pathogen information for each individual in our study included: 1) Total number of cultures; 2) Number of negative cultures (i.e., no pathogen growth); 3) Number of ambiguous cultures (i.e., normal upper respiratory bacteria or low level contamination); 4) Number of positive cultures (i.e., the number of cultures with growth consistent with clinical infection); 5) Genus or genera isolated (up to 96 unique genera per sample site), which ranged from zero to 10 per sample.
METHODS DETAILS
GWAS of ID traits
GWAS of the ID traits were performed on the 23,294 and 4,321 BioVU individuals of European and African ancestry, respectively. Quality control pre-processing and SNP-level imputation were conducted, as previously described (Unlu et al., 2020). Genomic ancestry was quantified using principal components analysis of the genotype data (Derks et al., 2017; Price et al., 2006). The association analysis was performed using age, gender, batch, and the first five principal components as covariates.
Conditional SNP-level analysis
We performed conditional analysis on the top GWAS association with the ID trait (in this case, bacterial pneumonia) to determine whether it was driven by a related covariate (in this case, cystic fibrosis status). We used logistic regression to model the conditional probability of the infectious disease: where s is the genotype at the sentinel variant, Y is the disease (i.e., bacterial pneumonia) status, and CF is the covariate of interest (i.e., cystic fibrosis).
Transcriptome-wide association studies (TWAS) using PrediXcan
We performed multi-tissue PrediXcan (Barbeira et al., 2019; Gamazon et al., 2018; Gamazon et al., 2015) in the 23,294 BioVU subjects. Experiment-wide significance was determined using Bonferroni correction for the total number of genes tested (n = 9,868) across 35 phenotypes (i.e., p < 1.4×10-7). Trait-specific significance was determined using Bonferroni correction for the total number of genes tested (n = 9,868, p < 5.07×10-6). Genomic ancestry was quantified using principal components analysis (Derks et al., 2017; Price et al., 2006). TWAS results were visualized using PhenoGram (Wolfe et al., 2013).
GWAS and TWAS Replication in the UK Biobank and FinnGen consortia
Replication of GWAS and TWAS was performed in the UK Biobank (Bycroft et al., 2018) and FinnGen consortia (Locke et al., 2019). We used the UK Biobank (http://www.nealelab.is/uk-biobank) and the FinnGen (https://www.finngen.fi/en/access_results) summary results to generate the gene-level associations. GTEx v6p models were used to generate tissue-level results.
Classification of pathogen infection based on serology and culture data using several classifiers
Let X be a classifier (e.g., the Phecode or a logistic regression classifier) of serology and culture data based infection for a given pathogen, with probability density φ+(x) for positive instances and probability density φ−(x) for negative instances. The ROC curve plots the specificity (SP) and sensitivity (SN) at various thresholds:
The area Ω under the curve (AUC) is given by: where I(A) is the indicator function, i.e., equal to one if (x,T) ∈ A and zero otherwise. The last equals the probability that the classifier X ranks a randomly chosen positive instance (of culture data based infection) higher than a randomly chosen negative instance. We note that the expression for Ω suggests other metrics of interest, for example:
Here (c,SN(c) is the point on the ROC curve closest to true positive rate of 1 and false positive rate of 0. We estimated the sampling distribution of Ω (including standard error), using bootstrapping (n = 100) (Efron, 1979). We used the pROC package for visualization.
Leveraging GWAS of human microbiome traits to extend GWAS of ID traits
We leveraged genome-wide associations of microbiome traits, involving 155 pathogens derived from phylogenetic analysis of 16S rRNA gene sequences (Hughes et al., 2020).
Causal inference by Mendelian Randomization
To infer causality between the infectious diseases and potential complications, we performed Mendelian Randomization (MR, (Davey Smith and Hemani, 2014; Lawlor et al., 2008)) in 23,294 individuals of European ancestry in BioVU. To define instrumental variables (IVs), we clumped the exposure-associated SNPs with high linkage disequilibrium (LD) using Plink1.9 (p < 1×10-5, r2 = 0.01). Only biallelic non-palindromic variants were considered as IVs. Considering the pervasive horizontal pleiotropy in human genetic variation (Jordan et al., 2019), we applied summary statistics based MR-Egger regression (Bowden et al., 2015). MR-Egger regression generalizes the inverse-variance weighted method, where the intercept is assumed to be zero. We also used the weighted-median estimator (Bowden et al., 2016) to test the causal effect of the exposure trait on the outcome. We leveraged the R package ‘MendelianRandomization’.
High-throughput Human in vitrO Susceptibility Testing (Hi-HOST)
We generated an atlas of TWAS associations with 79 pathogen-induced cellular traits – including infectivity and replication, cytokine levels, and host cell death (Wang et al., 2018) using the Hi-HOST platform (Ko et al., 2012; Ko et al., 2009). A list of populations, pathogens and project description may be found at http://h2p2.oit.duke.edu/About/, and phenotype definitions and family-based GWAS of the Hi-HOST Phenome Project were previously described (Wang et al., 2018). Briefly, lymphoblastoid cell lines (LCLs) from the 1000 Genomes Consortium (Auton et al., 2015) were obtained from the Coriell Institute. The LCLs represented diverse populations, including ESN (Esan in Nigeria), GWD (Gambians in Western Divisions in the Gambia), IBS (Iberian Population in Spain), and KHV (Kinh in Ho Chi Minh City, Vietnam). LCLs were cultured in RPMI 1640 media containing 10% fetal bovine serum, 2 mM glutamine, 100 U/ml of penicillin-G, and 100 mg/ml streptomycin for 8 days prior to experimental use, as previously described (Wang et al., 2018). Chlamydia trachomatis infection of LCLs was performed using C. trachomatis LGV-L2 RifR pGFP::SW2 (Saka et al., 2011). Salmonella infection was performed using pMMB67GFP (Pujol and Bliska, 2003), and sifA deletion was constructed using lambda red and validated using PCR (Datsenko and Wanner, 2000; Ko et al., 2009). Candida albicans SC5314 infection was performed as previously described (Odds et al., 2004) and levels of fibroblast growth factor 2 were measured using enzyme linked immunosorbent assays. Staphylococcus aureus toxin (alpha-hemolysin) was obtained from Sigma and applied to LCLs at a concentration of 1 μg/ml for 23 hours. Cell death was measured using 7-AAD staining and flow cytometry. Additional experimental details can be found at http://h2p2.oit.duke.edu/About/.
We estimated the gene-level effect size on the Hi-HOST phenotypes, using GWAS summary statistics (Barbeira et al., 2018) in each of the 44 GTEx tissues (version 6p) (Battle et al., 2017). The gene expression prediction model was trained using GTEx as the reference dataset (https://zenodo.org/record/3572842/files/GTEx-V6p-HapMap-2016-09-08.tar.gz). The gene-level effect size was estimated using S-PrediXcan after allele harmonization (Barbeira et al., 2018). We also applied MultiXcan to improve the ability to identify potential target genes (Barbeira et al., 2019). In brief, MultiXcan regresses the cellular trait on the principal components of the predicted expression data across all the available tissues. For each gene, MultiXcan yields a joint effect estimate across the 44 tissues. We applied the summary-statistic based version (S-MultiXcan) and followed the guides from the tool’s webpage https://github.com/hakyimlab/MetaXcan.
DATA AND SOFTWARE AVAILABILITY
All code is available at the project’s github page:
https://github.com/gamazonlab/infectiousDiseaseResource. All trait-level GWAS, PrediXcan, and Hi-HOST TWAS results are available at www.phewascatalog.org.
KEY RESOURCES TABLE
ACKNOWLEDGEMENTS
A.T.H. is supported by the National Institutes of Health (F30HL143826) and Vanderbilt University Medical Scientist Training Program (T32GM007347). E.R.G. is supported by the National Human Genome Research Institute of the National Institutes of Health under Award Numbers R35HG010718 and R01HG011138. E.R.G. and S.S.Z. are funded by the National Heart, Lung, & Blood Institute of the National Institutes of Health under Award Number R01HL133559. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. E.R.G. has also significantly benefitted from a Fellowship at Clare Hall, University of Cambridge (UK) and is grateful to the President and Fellows of the college for a stimulating intellectual home. Genomic data are also supported by individual investigator-led projects including U01-HG004798, R01-NS032830, RC2-GM092618, P50-GM115305, U01-HG006378, U19-HL065962, and R01-HD074711. Additional funding sources for BioVU are listed at https://victr.vanderbilt.edu/pub/biovu/. L.B. is supported by R01-LM010685. S.J.S. is supported by an NIH Director’s Pioneer and Transformative Awards DP1-HD086071 and R01-AI145057. D.C.K. is supported by R01-AI118903, R21-AI144586, and R21-AI146520. D.C.K. and L.W. are supported by R21-AI133305.