Abstract
Background Oral cavity cancer (OCC) is one of the most common carcinoma diseases. Recent genome-wide association studies (GWAS) have reported numerous genetic variants associated with OCC susceptibility. However, the regulatory mechanisms of these genetic variants underlying OCC remain largely unclear.
Objective This study aimed to identify OCC-related genetics risk genes contributing to the prognosis of OCC.
Methods By combining GWAS summary statistics (N = 4,151) with expression quantitative trait loci (eQTL) across 49 different tissues from the GTEx database, we performed an integrative genomics analysis to uncover novel risk genes associated with OCC. By leveraging various computational methods based on multi-omics data, risk genes were prioritized as promising candidate genes for drug repurposing in OCC.
Results Using two independent computational algorithms, we found that 14 risk genes whose genetics-modulated expressions showed a notable association with OCC. Among them, nine genes were newly identified, such as IRF4 (P = 2.5×10-9 and P = 1.06×10-4), TNS3 (P = 1.44×10-6 and P = 4.45×10-3), ZFP90 (P = 2.37×10-6 and P = 2.93×10-4), and DRD2 (P = 2.0×10-5 and P = 6.12×10-3).
These 14 genes were significantly overrepresented in several cancer-related terms, and 10 of 14 genes were enriched in 10 potential druggable gene categories. Based on differential gene expression analysis, the majority of these genes (71.43%) showed remarkable differential expressions between OCC patients and paracancerous controls. Integration of multi-omics-based evidence from genetics, eQTL, and gene expression, we identified that the novel risk gene of IRF4 exhibited the highest ranked risk score for OCC. Survival analysis showed that dysregulation of IRF4 expression was significantly associated with cancer patients outcomes (P = 8.1×10-5).
Conclusions In summary, we prioritized 14 OCC-associated genes with nine novel risk genes, especially the IRF4 gene, which provides a drug repurposing resource to develop therapeutic drugs for oral cancer.
1. Introduction
Oral cavity cancer (OCC) is one of the most common malignancy diseases [1, 2]. There are approximately 405,000 new OCC patients anticipated each year worldwide [3]. Although the rapid development of cancer diagnoses and therapies, the 5-year survival rate of OCC patients has remained at a dismal 50%[4]. Genetic components have been reported to largely influence OCC, especially variants located in alcohol-relevant genes, including alcohol dehydrogenase 7 (ADH7) and alcohol dehydrogenase 1B (ADH1B) [5, 6]. Hence, understanding the genetic underpinnings of OCC is of considerable interests, which may promote the advancement of individualized treatment strategies to maximize oncologic control and minimize the adverse influence of therapy.
With the development of high-throughput microarray and sequencing technologies, genome-wide association study (GWAS) as an effective method is widely applied to simultaneously examine the associations of millions of single nucleotide polymorphisms (SNPs) with complex diseases [7, 8]. The GWAS approach leveraged large-scale sample sizes to enhance the statistical power for discovering disease-associated risk SNPs and genes, such as the UK-Biobank project [9, 10] and the COVID-19 Host Genetics Initiative project [11–13]. These large GWAS consortia have identified hundreds of thousands of genetic variants associated with complex diseases [14]. In recent years, numerous GWASs have uncovered many genetic loci associated with OCC [15–19]. Due to GWAS applies a strict genome-wide significant threshold at P < 5×10-8, many genetic variants with small or moderate effect sizes were hard to be found in single GWAS analysis. This posed an intractable question that how to effectively pinpoint genuine variants from current existing GWAS summary statistics.
In addition, the vast majority of GWAS-reported SNPs were mapped within non-coding genomic regions[20, 21], indicating these variants potentially have cis-/trans-regulatory effects on modulating the expression level of a given gene rather than changing the function of its protein. Mounting genomics analyses [22–25] have been carried out to examine the regulatory mechanisms of GWAS-derived genetic variants, and determine whether GWAS-nominated genes whose aberrant alterations of expression contributing to disease pathogenesis due to genetic pleiotropy. For example, He et al. [22] reported a Sherlock integrative genomics tool based on Bayesian inference algorithm, which could incorporate genetic statistical values from GWAS summary data and expression quantitative trait loci (eQTL) data to systematically uncover the regulatory effects of genetic variants on gene expression for complex diseases [23, 26–28]. Recently, Barbeira and coworkers [29] introduced an efficient GWAS summary result-based extension statistical method termed S-MultiXcan, which could utilize the substantial sharing of eQTL across multiple tissues to enhance the capability to pinpoint potential target genes.
The primary goal of the current comprehensive genomics study is to identify genuine risk genes for OCC susceptibility and extend our understanding of genetic architecture implicated in the etiology of OCC. We first combined the converging effects of SNPs around a given gene, and then performed an integrative genomics analysis by incorporating GWAS summary statistics with eQTL to identify novel OCC-relevant genetic variants and susceptible genes. Moreover, by leveraging various computational methods based on multi-omics data, we prioritized risk genes as promising candidates for drug repurposing in OCC.
2. Materials and Methods
2.1. GWAS summary statistics on OCC
To yield genetic statistical information concerning OCC, we downloaded a GWAS summary statistics on oral cavity cancer from the MRC Integrative Epidemiology Unit (IEU) database (https://gwas.mrcieu.ac.uk/, accession ID: ieu-b-94) [16]. This resource, which is a manually curated collection of complete GWAS summary datasets on a brand of complex phenotypes, could be queried a database of the complete data. This dataset [16] comprised 1,223 OCC patients and 2,928 controls, part of the International Head and Neck Cancer Epidemiology Consortium (INHANCE). Written informed consent was signed for all subjects, and the ethical approval was approved by the respective institutional review boards. The PLINK 1.934 [30] was used to carry out systematic quality controls on genotype calling. After stringent quality control, a total of 7,510,833 common SNPs were included in the current analysis. More detailed clinical information was described in the original paper [16]. The web-access tool of LocusZoom (http://locuszoom.sph.umich.edu/) was used to visualize the regional genetic association signals for SNPs within risk genes. cis-eQTL and cis-sQTL analysis was conducted based on the GTEx online tool (https://gtexportal.org/home/).
2.2. Integrative genomics analysis of combining GWAS data with eQTL data
To identify risk genes whose genetically-regulated expression, we applied a comprehensive integrative genomics analysis to integrate GWAS summary statistics on OCC with expression quantitative trait loci (eQTL) data for 49 tissues from the GTEx database (vserion 8) [31]. First, based on the MASHR statistical method, we leveraged a linear regression model in S-PrediXcan to evaluate gene expression weights based on samples who having both genotypes and gene expression data. Through combining the variance and co-variance information of SNPs based on the reference panel of the 1,000 Genomes Project European Phase 3 [32], these estimated expression weights were assessed with a log odds ratio (OR) and standard errors from the GWAS summary data to calculate the levels of gene expression. For effective utilizing the information of significant genes whose expression has similar functions across different tissues, we further used the S-MultiXcan method to meta-analyze these MASHR results from S-PrediXcan to improve the statistical power. By integrating all available genes with convergent evidence across 49 GTEx tissues, the S-MultiXcan tool conducted a multivariate regression model for 22,331 genes, and FDR < 0.05 was considered to be of significance.
2.3. Gene-level genetic association analysis
To pinpoint risk genes associated with OCC, we conducted a gene-level genetic association analysis by incorporating the converging genetic signals around a given gene based on the Multi-marker Analysis of GenoMic Annotation (MAGMA) [7, 33]. We first extracted the SNP information including P values and chromosome position from GWAS summary statistics, and defined an extended genomic region for a specific gene containing the analyzed SNPs located within downstream or upstream 20kb of the gene. To calculate the linkage disequilibrium (LD) between SNPs, we used the 1,000 Genome Project Phase3 European Panel for reference [32]. To adjust the raw P values for multiple testing correction, we leveraged the widely-used method of Benjamini-Hochberg false discovery rate (FDR) [34], and FDR < 0.05 was considered to be of significance.
2.4. Pathway-based enrichment analysis
To understand the biological functions for these identified risk genes, we conducted a pathway-based enrichment analysis according to the authorized KEGG pathway resource [35] through utilizing the online-access tool of WEB-based Gene SeT AnaLysis Toolkit (WebGestalt; http://www.webgestalt.org) [36]. There were three well-documented methods used for enrichment analysis, including gene set enrichment analysis, network topology-based analysis, and over-representation analysis. By using the over-representation approach, we inputted these identified significant or suggestive gene lists for establishing the functional relationships of these genes with biological pathways. In addition, we also performed a gene-ontology (GO) enrichment analysis using the WebGestalt tool based on three widely-used functional categories: cellular component (CC), molecular function (MF), and biological process (BP). The number of genes in each pathway or GO-term were limited to a range between 5 and 2,000. The hypergeometric test was adopted to assess the significance level of each enrichment analysis, and the Benjamini-Hochberg FDR method was used for multiple testing correction.
2.5. Computer-based permutation analysis
To determine whether there exist prominently consistent results between genes from MAGMA analysis (N = 1,324, P < 0.05) and genes from S-MultiXcan analysis (N = 1,366, P < 0.05), we used a permutation method [11] to carry out a in silico permutation analysis with 100,000 times of random selections. In the first step, we calculated the overlapped genes between MAGMA and S-MultiXcan (N observation = 472 genes). In the second step, we leveraged all examined genes from S-MultiXcan analysis as background genes (N background =22,331 genes). By randomly selecting the same number of genes as those identified from S-MultiXcan analysis from background genes to overlap with genes identified from MAGMA for 100,000 times, we counted the number of overlapped genes in each random time (N random). The empirically permuted P value = and P value < 0.05 is of significance.
2.6. Protein-protein interaction network analysis
To uncover the functional interaction patterns of these identified OCC-risk genes, we performed a protein-protein interaction (PPI) network analysis based on two extensively-used databases of STRING (v11.0, https://string-db.org/) [37] and GeneMANIA (http://www.gene-mania.org)[38]. There were 14 risk genes as an input list for searching interaction relationships based on current existing proteomics and genomics data, including genetic interactions, co-expression links, physical interactions, pathway links, and co-localizations. The Cytoscape platform [39] was adopted for visualizing the subnetworks.
2.7. Drug-gene interaction analysis
To find effective gene-targeted drugs for OCC, we submitted these 14 significant OCC-risk genes into the widely-applied Drug Gene Interaction Database (DGIdb v.3.0.2, http://www.dgidb.org/) based on 20 databases with 51 known interaction types. Based on the druggable categories in the DGIdb database, we attempted to search genes with potential drug abilities for these 14 OCC-risk genes. Furthermore, based on the GLAD4U database [40], we applied the hypergeometric test to conduct a drug-based enrichment analysis of these identified common genes between MAGMA analysis and S-MultiXcan analysis (Supplemental Table S5) for obtaining a drug repurposing resource.
2.9. Gene expression profiles between OCC and matched controls
To further validate the functionality of these 14 identified OCC-risk genes, we downloaded two independent RNA expression datasets from NCBI GEO database (Accession Nos. GSE160042 and GSE139869). The first analyzed dataset, GSE160042, contained 10 fresh human oral squamous cell carcinoma and their paired adjacent non-cancerous counterparts. All these enrolled patients received radical surgery without any form of pre-surgical adjuvant therapy. The RNA expression profiles were detected by the Agilent-045142 Human LncRNA v4 4X180K Microarray. The second RNA expression dataset (i.e., GSE139869) contained five paired tongue squamous cell carcinoma and paracancerous tissues, detected the transcriptomes with the use of the Agilent-062918 OE Human lncRNA Microarray. The paired Student’s t-test was used to calculate the significance between cancer patients and controls, and the Corrplot R package was used to visualizing the co-expression patterns among these 14 risk genes. Based on the expression dataset on head and neck squamous cell carcinoma (N = 518) from Cancer Genome Atlas (TCGA) database, we applied the GEPIA v2 online tool (http://gepia2.cancer-pku.cn/#survival) to perform survival analysis for identified genes and plot Kaplan-Meier curves.
3. Results
3.1. The workflow of current integrative genomics analysis
Current study is a two-stage designed integrative genomics analysis to identify novel risk
variants and genes for OCC in the discovery stage, and replicate these results in the validation stage (Figure 1). In the discovery stage, we collected a GWAS summary dataset on OCC with 1,223 patients and 2,928 matched controls [16], and collected eQTL data across 49 tissues from the GTEx database [31]. Among these eQTL datasets, RNA sequencing expression samples with < 10 million mapped reads were excluded, and normalized across samples using the edgeR R package [41]. There were 838 donors having RNA sequencing data and genotype data. Combining the GWAS summary data with eQTL data, we highlighted risk genes whose genetically-regulated abnormal expression associated with OCC. In the validation stage, we further conducted subsequent comprehensive computational analyses by using multiple bioinformatics tools for replication.
3.2. Gene-based genetic association analysis for OCC
In view of the original study [16] did not perform a gene-level genetic association analysis, we conducted a MAGMA gene-based association analysis to incorporate the converging effects of SNPs in a given gene. There were 14 genes showing notable associations with OCC susceptibility (FDR < 0.05, Figure 2 and Supplemental Table S1); including IRF4 (P = 2.5×10-9), TNS3 (P = 1.44×10-6), NOTCH1 (P = 2.13×10-6), ZFP90 (P = 2.37×10-6), ATRN (P = 3.28×10-6), HLA-DQA1 (P = 6.45×10-6), DRD2 (P = 2.0×10-5), EPHX2 (P = 2.03×10-5), CLPTM1L (P = 2.13×10-5), SCL6A2 (P = 2.76×10-5), FGF7 (P = 2.78×10-5), GTF2IRD1 (P = 3.15×10-5), NLRP12 (P = 3.26×10-5), and CDH16 (P = 3.46×10-5). Quantile-quantile plot exhibited a low-level genomic inflation (λ = 1.02) in the gene-based association analysis (Supplemental Figure S1). There were two loci of 5p15.33 and 9q34 mapped by significant genes of CLPTM1L and NOTCH1 were consistent with the previous study [16].
Among these 14 genes with multiple converging genetic association signals, there were 15 index SNPs showing notable associations with OCC (Supplemental Figures S2-S16), such as rs10775305 in ZFP90, rs6499103 in CDH16, rs4586205 in DRD2, rs10283378 in EPHX2, and rs12916839 in FGF7. For the most significantly MAGMA-identified gene of IRF4, the top lead SNP rs7773324 (P = 7.8×10-7) has high CADD score and regulomeDB score with strong activating chromatin state (Figure 3 and Supplemental Figure S17). In addition, both top lead SNPs of rs3125011 in NOTCH1 (P = 7.9×10-7) and rs12916839 (P = 6.28×10-7) in FGF7 have high CADD score and regulomeDB score with strong activating chromatin state (Supplemental Figures S18-S19). We observed that rs12916839 exhibited prominently allelic specific associations with FGF7 among multiple tissues, and the T allele of rs1296839 showed down-regulated effects on the expression of FGF7 (Supplemental Figure S20). Furthermore, the rs12916839 represents a splicing quantitative trait locus (sQTL) for FGF7 (Supplemental Figures S21-S22).
Moreover, a number of 1,310 genes showed suggestive associations with OCC susceptibility from MAGMA analysis (P < 0.05, Supplemental Table S1). Among these suggestive genes, 191 genes have been documented to be implicated in other cancers in the GWAS Catalog database (Supplemental Table S1), and 11 genes of HLA-DQB1, ZNF512, CCDC121, C2orf16, GPN1, GALNT14, AIF1, CTLA4, MICB, FAM175A, and ADH7 have been reported to be associated with OCC in previous studies (Supplemental Table S1). To examine their biological functions, we performed a pathway-based enrichment analysis of these suggestive genes, and found that six biological pathways were significantly associated with OCC (FDR < 0.05, Figure 4A-B, Supplemental Table S2). For example, staphylococcus aureus infection (P = 8.04×10-6), antigen processing and presentation (P = 9.18×10-5), graft-versus-host disease (P = 2.01×10-4), and cell adhesion molecules (P = 8.40×10-4). By performing a GO enrichment analysis, we found that there were 4 cellular component (CC) terms and 11 molecular function (MF) terms significantly enriched, including MHC protein complex, synaptic membrane, endonuclease complex, and GABA receptor activity (Figure 4C-D, and Supplemental Tables S3-S4). Overall, these results suggest that these observed genes may be a good resource for pinpointing causal risk genes for OCC.
3.3. Integrative genomics analysis identifies OCC-risk genes
By integrating GWAS summary statistics with eQTL data across 49 GTEx tissues using the S-MultiXcan method, we identified that 1,366 genes whose aberrant gene expressions were significantly or suggestively associated with OCC (P < 0.05, Supplemental Table S5). Such as ATRN (P = 5.9×10-6), VWA7 (P = 1.64×10-5), FGD4 (P = 3.75×10-5), TERT (P = 6.61×10-5), HLA-DRB1 (P = 8.51×10-5), CDH3 (P = 8.84×10-5), and BTNL2 (P = 8.89×10-5). Among them, 33 genes have been published to be significantly associated with OCC, such as HLA-DQA1, CLPTM1L, ZNF512, HLA-DQB1, GALNT14, CCDC121, CTLA4, GPN1, C2orf16, TENM3, LAMC3, ADH7, MICB, CCDC192, AIF1, RERGL, OR52N2, METAP1, WDFY3, ALDH2, MUC21, IL1A, KCNC4, STK31, BTBD11, and ADCY2 [6, 15, 16, 42]. By comparing these identified genes between S-MultiXcan and MAGMA, we observed that 472 genes have shared significant evidence between two methods (Figure 5A and Supplemental Table S5). Based on 100,000 times of in silico permutation analysis, we found that there existed a prominent concordance of results from between S-MultiXcan and MAGMA analysis (Empirical P < 1.0×10-5, Figure 5B), suggesting that these findings were relatively reliable and could be used for subsequent analyses.
Furthermore, we performed two pathway enrichment analyses for these 472 common genes based on two independent pathways resources of KEGG and Reactome. We observed that 18 biological pathways were significantly overrepresented (FDR < 0.05, Figure 5C-D and Supplemental Tables S6-S7). For example, staphylococcus aureus infection (P = 5.09×10-8), allograft rejection (P = 2.35×10-5), autoimmune thyroid disease (P = 2.79×10-5), graft-versus-host disease (P = 3.94×10-5), type I diabetes mellitus (P = 5.43×10-5), and activation of C3 and C5 (P = 2.78×10-5), reminiscing that these pathways showed significant enrichments in aforementioned pathway analysis.
To explore the druggable actions of these 472 common genes, we carried out a drug-based enrichment analysis based on the GLAD4U database. We found that seven drug-terms were significantly enriched (FDR < 0.05, Figure 5E and Supplemental Table S8), including mumps vaccines (P = 5.73×10-6), oxcarbazepine (P = 1.67×10-5), trichloroethylene (P = 3.16 ×10-5), rubella vaccines (P = 4.32×10-5), desflurane (P = 9.79×10-5), and lumiracoxib (P = 9.79×10-5). In addition, there were 45 suggestively-enriched drug-terms by these OCC-related common genes (P < 0.05, Supplemental Table S8). These observed drug terms and risk genes may provide a repurposing resource for searching drug targets for OCC.
3.4. Prioritization of OCC-risk genes
Among the 14 MAGMA-identified significant genes (FDR < 0.05), 13 genes (13/14=92.86%) were significantly replicated by using S-MultiXcan integrative analysis (Figure 6A). There exist a moderate correlation of these top-ranked significant genes from two independent tools (r = 0.386, Figure 6B). Five genes of HLA-DQA1, NOTCH1, FGF7, SLC6A2, and CLPTM1L have been reported to be associated with oral and pharynx cancer [15–19]. By performing the S-PrediXcan integrative analysis, we found that several genes showed significant associations with oral cancer for each specific tissue (Supplemental Figures S23-S26). By using the DisGeNET algorithm [43] in the Metascape tool [44], these 14 risk genes were significantly enriched in cancer-related categories, including malignant carcinoid syndrome, small lymphocytic lymphoma, basal cell neoplasm, and basal cell cancer (FDR < 0.05, Supplemental Figure S27 and Table S9).
To explore whether there exist collective functions of these 14 genes on OCC, we conducted a PPI network enrichment analysis based on two databases of STRING and GeneMAINA, and these genes were remarkably enriched in a functional subnetwork (P < 0.05, Figure 6C), suggesting that these risks have highly biological interactions implicated in oral cancer. Compared with patients with non-oral squamous cell carcinoma or other squamous cell carcinomas, mutations in NOTCH1 have been frequently identified in patients with oral squamous cell carcinomas [17]. The genes of IRF4, GTF2IRD1, FGF7, and SLC6A2 showed functional connections with NOTCH1. An earlier study [45] showed that there existed a genetic interaction of NOTCH1 with SLC6A2 and GTF2IRD1. By performing an interlaboratory comparability study, Dobbin et al. [46] reported that the cancer-related gene of ARTN was highly co-expressed with EPHX2.
Based on a drug-gene interaction analysis, we observed that 10 of 14 OCC-associated genes (71.43%) were overrepresented in 10 potential “druggable” gene categories, such as druggable genome, cell surface, drug resistance, transcription factor, growth factor, and G-protein coupled receptor (Supplemental Figure S28). There were five genes, namely NOTCH1, SLC6A2, DRD2, EPHX2, and HLA-DQA1, found to be targeted at least four drugs (Figure 6D). For example, the gene of NOTCH1 is targeted by 10 drugs, including brontictuzumab, nirogacestat, crenigacestat, wnt-974, rg-4733, asparaginase, ribociclib, mercaptopurine, temozolomide, and prednisolone. DRD2 gene shows molecular interactions with 10 drugs, including nemonapride, pipotiazine, bupropion, pramipexole, ropinirole, pridopidine, bromperidol, acetophenazine, and aripiprazole. These results suggest that these 14 risk genes predisposed to have collective functions on OCC.
3.5. Differential expression of 14 risk genes between OCC patients and control samples
By using the expression dataset of GSE139869, we conducted a DGE analysis of these 14 risk genes. Three genes not expressed in this dataset. We found that six of 11 genes were prominently expressed between OCC patients and paracancerous controls (Figure 7A). ATRN (P = 0.027), CLPTM1L (P = 0.007), IFR4 (P = 0.0054), and TNS3 (P = 7.6×10-4) showed significantly higher expressions among OCC patients than that among controls, and EPHX2 (P = 4.0×10-5) and GTF2IRD1 (P = 0.017) exhibited significantly lower expressions among OCC patients than that among controls (P < 0.05, Figure 7A). Consistently, by analyzing the expression dataset of GSE160042, we observed that 10 of 14 genes were significantly differentially expressed between OCC patients and matched controls (Figure 7B). ATRN (P = 0.014), CLPTM1L (P = 8.7×10-5), HLA-DQA1 (P = 0.003), IRF4 (P = 4.6×10-4), TNS3 (P = 1.2×10-6), and ZFP90 (P = 0.043) were prominently highly expressed in OCC patients, and CDH16 (P = 9.6×10-5), EPHX2 (P = 2.1×10-5), GTF2IRD1 (P = 3.3×10-4), NOTCH1 (P = 4.6×10-4) were remarkably down-expressed in OCC patients (Figure 7B).
Furthermore, to determine whether the co-expression patterns among these 14 genetics-risk genes were changed by OCC disease status, we conducted a Pearson correlation analysis for the expression data from three datasets. Interestingly, we observed that an obvious change of the co-expression patterns among 14 genes categorized by OCC status (Supplemental Figures S29-S30). For example, the correlation coefficient between ATRN and CLPTM1L was fundamentally increased from -0.356 in controls to 0.779 in OCC patients among the GSE160042 dataset.
Integrating multi-omic evidence across different datasets, we calculated the ranked scores of these 14 risk genes by summing five pieces of evidence from genetic, eQTL, and gene expression, and found that the novel risk gene of IRF4 exhibited the highest rank score (Figure 7C). Based on a large-scale TCGA dataset on head and neck squamous cell carcinoma (N = 518), we found the expression level of IRF4 gene was significantly associated with patients’ survive (P = 8.1×10-5, Figure 7D).
4. Discussion
Previous genetic studies have indicated that there existed a crucial role of genetic susceptibility in the etiology of OCC [47, 48]. Many genes have been reported to be associated with OCC, such as ADH1B and ADH7 [5, 6, 49]. Lessuer and coworkers [16] found four genetic loci, including 2p23.3, 5p15.33, 9q15.3 and 9q34.12, were prominently associated with OCC. Another study [50] identified six loci including 5q14.3, 6p21.33, 6q16.1, 11q12.2, 12q24.21 and 16p13.2 were remarkably associated with the laryngeal cancer risk. Recently, Shete et al. [15] conducted a two-phase GWAS based on non-Hispanic white for identifying oral cancer-associated genetic loci, and replicated four known cancer loci (namely 2p23.1, 5p15.33, 6p21.32, and 6p21.33) associated with both oral cancer and oropharyngeal cancer risk. However, these reported genetic risk loci are still limited and only explain a small proportion of heritability. More comprehensive genomic studies are warranted.
In the current study, by performing an integrative genomics analysis, we found that 14 risk genes, including IRF4, TNS3, NOTCH1, ZFP90, ATRN, HLA-DQA1, DRD2, EPHX2, CLPTM1L, SLC6A2, FGF7, GTF2IRD1, NLRP12, and CDH16, contribute to OCC susceptibility. By collecting cancer-related genes from the GWAS Catalog database, seven genes that are IRF4, TNS3, HLA-DQA1, EPHX2, CLPTM1L, FGF7, and NLRP12 have been documented to be associated with the susceptibility of other cancer types [51–58], and five genes that are NOTCH1, FGF7, SLC6A2, HLA-DQA1, and CLPTM1L have been reported to be associated with oral and pharynx cancer [15–19]. Based on the DisGeNET method, we observed that 14 genes were significantly enriched in cancer-related terms, including malignant carcinoid syndrome, small lymphocytic lymphoma, basal cell neoplasm, and basal cell cancer. These results suggest the 14 identified risk genes may have important functions in OCC. Additionally, we also found that 71.43% of these genes were significantly enriched in 10 druggable gene categories, and five genes that are NOTCH1, SLC6A2, DRD2, HLA-DQA1, and EPHX2 were targeted by at least one known drug, indicating that these identified genes have druggable functions and may as good candidates for drug repurposing in OCC.
The novel risk gene IRF4, which is a tumor suppressor gene, encodes a protein that belongs to the interferon regulatory factor family of transcription factors, which have important roles in modulating interferon-inducible genes and activating the innate and adaptive immune systems. The IRF4 gene was initially found as the transcription factors responsible for the generation of regulatory CD4+T cells specifically controlling Th2 response [59, 60]. Zhang and coworkers [60] demonstrated that mice with specific ablation of Irf4 in T regulatory cells develop multi-organ autoimmunity because of exacerbated Th1, Tfh, and Th17 responses and plasma cell infiltration. Multiple lines of evidence have demonstrated that IRF4 gene was significantly associated with numerous human cancers, including myeloma [61], non-small-cell lung cancer [62], breast cancer [63], and large B-cell lymphomas [64]. By performing a large-scale bisulfite genomic sequencing analysis, Vernier et al. [63] reported that reversal of promoter hyper-methylation and consequently de-repression of IRF4 gene is a factor contributing to the suppression of breast cancer.
The NOTCH1 gene, which encodes a member of the NOTCH family of proteins, has an important role in the development of numerous cells and tissue types. Mutations in the NOTCH1 gene have been reported to be associated with chronic lymphocytic leukemia [65–68], pancreatic cancer metastasis [69], head and neck squamous cell carcinoma [70–73], T-cell acute lymphoblastic leukemia [74], and oral cavity or tongue cancer [17, 75, 76]. For example, based on whole exome sequencing and gene copy number analyses, Agrawal et al. [73] reported approximate 40% of the 28 mutation in the NOTCH1 gene were predicted to truncate the gene product, which indicated that NOTCH1 may play as a tumor suppressor gene instead of an oncogene. Meanwhile, Stransky and colleagues [70] performed whole exome-wide sequencing analysis and found at least 30% of patients harbored mutations in genes including NOTCH1, which is a major driver of head and neck squamous cell carcinoma for modulating squamous differentiation. Using the literature-mining analysis by combining 26 studies with 17,675 samples, Ma et al. [77] found 48 smoking-related methylation genes including NOTCH1 in the 11 oncogenic pathways, which were associated with lung, oral, and other cancers. Furthermore, we identified that there were 10 drugs targeted with NOTCH1. For example, previous studies have demonstrated that brontictuzumab is a humanized monoclonal antibody that targets NOTCH1 and inhibits pathway activation and is designed to treat cancer [78], and nirogacestat, which has an inhibitory interaction with NOTCH1, is an oral small-molecule gamma secretase inhibitor in phase 3 clinical development for patients with cancer [79].
The druggable gene of SLC6A2, which encodes a member of the sodium neurotransmitter symporter family that is responsible for the reuptake of norepinephrine into presynaptic nerve terminals, was identified to be targeted by ten drugs, including reboxetine, dexmethylphenidate, and guanethidine. Previous studies have demonstrated that SLC6A2 is associated with neuroblastoma [80], colorectal cancer [81], and head and neck cancer [18]. Multiple lines of evidence have shown that a number of neurotransmitters have roles in the development and maintenance of fatigue and energy levels in breast cancer [82]. With regard to the HLA-DQA1 druggable gene, it belongs to the HLA class II alpha chain paralogues and plays a crucial role in the immune system by presenting extracellular peptides. Multiple earlier studies have demonstrated that mutations in HLA-DQA1 are significantly associated with oral cancer [16, 83], hepatocellular carcinoma [84], cervical cancer[85], and gastric cancer [86].
With regard to the druggable gene of DRD2, it encodes the D2 subtype of the dopamine receptor, which belongs to the G-protein coupled receptor that inhibits adenylylcyclase activity. DRD2 as a G-protein coupled receptor showed over-expressions in glioblastoma and controls growth factor signaling via cross-talk mechanisms involving beta-arrestin and scaffold protein [87]. By blockading DRD2 in a preclinical model of glioblastoma, it enables to inactivate growth factor signaling and induce tumor cell death [88]. Based on results from epidemiological statistical analysis, patients with schizophrenia, who inherently have elevated DRD2 signaling, have an elevated risk of cancer, whereas this risk could be recovered to normal by treating with DRD2 antagonists [89]. Furthermore, DRD2 has been reported to be involved in other cancer types, including breast cancer, colorectal cancer, and prostate cancer [90–92]. Mounting evidence has shown that DRD2 was significantly associated with smoking behaviors [93–98] and alcohol consumption [99–101] among various ethnicities. For example, by performing a meta-analysis based on 11,075 participants, Ma et al. [93] reported that the Taq1A A1/* genotypes were remarkably associated with smoking cessation under both the fixed- and random-effects models. Based on a GWAS of self-reported alcohol consumption in 112,117 subjects from the UKBiobank database, Clarke et al. [100] demonstrated that DRD2 was found to implicate in the neurobiology of alcohol use. Taken together, these identified risk genes potentially have important roles in the carcinogenesis of oral cavity cancer.
There exist some limitations of the present study needed to be cautious. Although we collected multiple layers of omics data, there were other useful omics data not adopted in the current investigation, such as histone marks [102–105] and DNAase I-hypersensitive sites (DHS) [106, 107]. It should be noted that the sample size used in the GWAS summary statistics was not very large. Furthermore, due to the heterogeneity across different datasets, we applied distinct correction methods for multiple testing for each omics dataset. The current study was based on the European population, and we did not validate these OCC-risk genes in other ethnicities. Further studies are needed to evaluate the inference by using genotype and gene expression data from other ancestries.
Conclusion
In summary, our integrative genomics analysis provides an effective method for identifying risk genes that convey susceptibility to OCC, and highlights 14 important OCC-risk genes with 9 novel risk genes, such as IRF4, TNS3, and DRD2. These identified genes targeted with drugs may be promising candidates for drug repurposing in the future pharmaceutical studies. Furthermore, we connected risk variants with susceptible genes and functional processes, providing a good clue for explaining the effects of genetic variants on OCC. More relevant studies are warranted to study the molecular functions and therapeutic efficacy of these identified risk genes.
Data Availability
All the GWAS summary statistics on oral cancer applied in the present analysis can be obtained from the MRC Integrative Epidemiology Unit (IEU) database (https://gwas.mrcieu.ac.uk/, accession ID: ieu-b-94). The GTEx eQTL data (version 8) were gained from the Zenodo repository (https://zenodo.org/record/3518299#.Xv6Z6igzbgl). The 1,000 Genomes Project European Phase 3 reference panel was downloaded from the IGSR: The International Genome Sample Resource (https://www.internationalgenome.org/category/phase-3/).
Authors’ contributions
Y.L., X.X., Z.W., M.W., Y.M., and Y.H. assimilated the data, performed integrative genomics analysis, and wrote the manuscript. Y.L., Y.M., Y.Y., and M.W. were involved in data collection and reviewed the manuscript. Y.M. and M.W. conceived the current study and reviewed the manuscript. All authors have reviewed the current version, and approved the finial manuscript.
Conflicts of Interests
The authors declare no conflicts of interest.
Funding
This research was supported by National Natural Science Foundation of China (No. 81970956 to M.W.), and the Scientific Research Foundation for Talents of Wenzhou Medical University (KYQD20201001 to Y.M.). The funders had no role in the designing and conducting of this study and collection, analysis, and interpretation of data and in writing the manuscript.
Data Availability Statement
All the GWAS summary statistics on oral cancer applied in the present analysis can be obtained from the MRC Integrative Epidemiology Unit (IEU) database (https://gwas.mrcieu.ac.uk/, accession ID: ieu-b-94). The GTEx eQTL data (version 8) were gained from the Zenodo repository (https://zenodo.org/record/3518299#.Xv6Z6igzbgl). The 1,000 Genomes Project European Phase 3 reference panel was downloaded from the IGSR: The International Genome Sample Resource (https://www.internationalgenome.org/category/phase-3/).
Supplemental Figures
Supplemental Figure S1. Quantile-quantile (QQ) plot of the MAGMA-based gene-level association analysis on GWAS summary statistics.
Supplemental Figure S2. The 15 lead SNPs with 14 genomic loci associated with OCC. Left panel shows the ID, P value, risk allele, and chromosome of each index SNP, and right panel shows the odds ratio with 95% confidence interval.
Supplemental Figure S3. Regional plot for the OCC-associated risk gene of ZFP90 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs10775305 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S4. Regional plot for the OCC-associated risk gene of CLPTM1L identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs10462706 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S5. Regional plot for the OCC-associated risk gene of CDH16 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs6499103 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S6. Regional plot for the OCC-associated risk gene of DRD2 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs4586205 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S7. Regional plot for the OCC-associated risk gene of EPHX2 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs10283378 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S8. Regional plot for the OCC-associated risk gene of HLA-DQA1 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs9271378 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S9. Regional plot for the OCC-associated risk gene of TNS3 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs35894636 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S10. Regional plot for the OCC-associated risk gene of GTF2IRD1 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs4717897 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S11. Regional plot for the OCC-associated risk gene of NOTCH1 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs3125011 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S12. Regional plot for the OCC-associated risk gene of FGF7 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs12916839 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S13. Regional plot for the OCC-associated risk gene of SLC6A2 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs2397776 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S14. Regional plot for the OCC-associated risk gene of NLRP12 identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs4806514 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S15. Regional plot for the OCC-associated risk gene of ATRN identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs6051873 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S16. Regional plot for the OCC-associated risk gene of ATRN identified from MAGMA analysis. The purple diamond marks the most strongly associated SNP of rs17711842 in each gene with OCC. The color illustrates LD information with the given SNP, as shown in the color legend.
Supplemental Figure S17. Multiple layers of evidence supporting the functional role of rs7773324 in IRF4. There were three distinct data used for assessing the variant functionality, including CADD score, regulomeDB, and chromatin state. This plot was generated by using the FUMA online tool (https://fuma.ctglab.nl/).
Supplemental Figure S18. Multiple layers of evidence supporting the functional role of rs3125011 in NOTCH1. There were three distinct data used for assessing the variant functionality, including CADD score, regulomeDB, and chromatin state. This plot was generated by using the FUMA online tool (https://fuma.ctglab.nl/).
Supplemental Figure S19. Multiple layers of evidence supporting the functional role of rs12916839 in FGF7. There were three distinct data used for assessing the variant functionality, including CADD score, regulomeDB, and chromatin state. This plot was generated by using the FUMA online tool (https://fuma.ctglab.nl/).
Supplemental Figure S20. Violin plots showing rs12916839 influences the expression of FGF7 across multiple tissues. A)-B) Violin plots showing rs12916839 represents an eQTL quantitative trait locus for FGF7 across multiple GTEx tissues. C)-I) Violin plots showing rs12916839 represents a splicing quantitative trait locus for FGF7 across multiple GTEx tissues.
Supplemental Figure S21. Violin plots showing rs12916839 influences the expression of FGF7 across multiple tissues. For A)-I), there were 9 tissues showing a significant sQTL for rs12916839.
Supplemental Figure S22. Violin plots showing rs12916839 represents a splicing quantitative trait locus for FGF7 across multiple GTEx tissues. For A)-L), there were 12 tissues showing a significant sQTL for rs12916839.
Supplemental Figure S23. S-PrediXcan-based integrative genomics analysis identified risk genes associated with oral cancer in each tissue. A) Whole blood, B) Thyroid, C) Skin sun exposed lower leg, D) Skin not sun exposed suprapubic, E) Muscle skeletal, and F) Minor salivary gland.
Supplemental Figure S24. S-PrediXcan-based integrative genomics analysis identified risk genes associated with oral cancer in each tissue. A) Liver, B) Lung, C) Esophagus Mucosa, D) Adrenal Gland, E) Adipose Visceral Omentum, and F) Adipose Subcutaneous.
Supplemental Figure S25. S-PrediXcan-based integrative genomics analysis identified risk genes associated with oral cancer in each tissue. A) Esophagus_Gastroesophageal_Junction, B) Esophagus Muscularis, C) Artery Aorta, D) Artery Coronary, E) Cells- EBV-transformed lymphocytes, and F) Cells-Cultured fibroblasts.
Supplemental Figure S26. S-PrediXcan-based integrative genomics analysis identified risk genes associated with oral cancer in each tissue. A) Colon Transverse , B) Pancreas, C) Colon Sigmoid, D) Pituitary, E) Small intestine terminal ileum, and F) Stomach.
Supplemental Figure S27. Bar plot shows the enrichment results of these 14 common risk genes identified from MAGMA and S-MultiXcan by using the DisGeNET algorithm in the Metascape.
Supplemental Figure S28. Drug-gene enrichment analysis shows 10 of 14 risk genes enriched in 10 potential druggable gene categories based on the DGIdb database.
Supplemental Figure S29. Co-expression patterns of 14 genes between OCC patients and paracancerous controls based on the GSE139869 dataset.
Supplemental Figure S30. Co-expression patterns of 14 genes between OCC patients and paracancerous controls based on the GSE160042 dataset.
Acknowledgements
Not applicable.
Abbreviations
- OCC
- Oral cavity cancer
- GWAS
- Genome-wide association study
- eQTL
- expression quantitative loci trait
- sQTL
- splicing quantitative loci trait
- SNP
- Single nucleotide polymorphism
- CC
- Cellular component
- MF
- Molecular function
- BP
- Biological process
- IEU
- Integrative Epidemiology Unit
- INHANCE
- International Head and Neck Cancer Epidemiology Consortium
- OR
- Odds ratio
- MAGMA
- the Multi-marker Analysis of GenoMic Annotation
- LD
- Linkage disequilibrium
- FDR
- False discovery rate
- WebGestalt
- WEB-based Gene SeT AnaLysis Toolkit
- GO
- Gene Ontology
- PPI
- Protein-protein interaction
- GEO
- the database of Gene Expression Omnibus
- TCGA
- The Cancer Genome Atlas.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.↵
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.
- 68.↵
- 69.↵
- 70.↵
- 71.
- 72.
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.
- 92.↵
- 93.↵
- 94.
- 95.
- 96.
- 97.
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.
- 104.
- 105.↵
- 106.↵
- 107.↵