Abstract
Background Ovarian cancer (OC) is a significant gynecological malignancy characterized by its high mortality rate, poor long-term survival rate, and late-stage diagnosis. OC is the 5th leading cause of cancer death among woman and counts 2.1% of all cancer death. OC survival rates are much lower than other cancers that affect woman. Its 5-year survival rate is less than 50%. Only ∼17% of OC patients are diagnosed within the early stage. The majority are diagnosed at an advanced stage, making early detection and effective treatment critical challenges. Currently, the identified OC predictive genes are still very sparse, resulting in pool prognostic performance. There exists unmet needs to identify novel prognostic gene biomarkers for OC occurrence, survival, and clinical stages to promote the likelihood of survival and to perform optimal treatments or therapeutic strategies at the earliest stage possible.
Methods Previous RNAseq analysis on OC focused on detecting differentially expressed (DE) genes only. Many genes, although having weak marginal differential effects, may still exude strong predictive effects on disease outcomes though regulating other DE genes. In this work, we employed a new machine learning method, netLDA, to detect such predictive coregulating genes with weak marginal DE effects for predicting OC occurrence, 5-year survival, and clinical stage. The netLDA detects predictive gene networks (PGN) containing strong DE genes as hub genes and detects coregulating weak genes within the PGNs. The network structures of the detected PGNs along with the strong and weak genes therein are then used in outcome prediction on test datasets.
Results We identified different sets of signature genes for OC occurrence, survival, and clinical stage. Previously identified prognostic genes, such as EPCAM, UBE2C, CHD1L, TP53,CD24, WFDC2, and FANCI, were confirmed. We also identified novel predictive coregulating weak genes including GIGYF2, GNPAT, RAD54L, and ELL. Many of the detected predictive gene networks and coregulating weak genes therein overlapped with OC-related biological pathways such as KEGG tight junction, ribosome, and cell cycle pathways. The detection and incorporation of the gene networks and weak genes significantly improved the prediction performance. Cellular mapping of selected feature genes using single-cell RNAseq data further revealed the heterogeneous expression distributions of the signature genes on different cell types.
Conclusions We established a transcriptomic gene network profile for OC prediction. The novel genes detected provide new targets for early diagnostics and new drug development for OC.
Introduction
Ovarian cancer (OC) is a significant gynecological malignancy characterized by its high incidence rate, poor survival, late-stage diagnosis, and limited treatment options [1,2]. It ranks as one of the most lethal cancers and the 5th leading cause of cancer death among woman [3]. It counts 2.1% of all cancer death. Ovarian cancer survival rates are much lower than other cancers that affect woman. Its 5-year survival rate is less than 50%. Its survivors often face physical and psychological challenges, including long-term side effects of treatment, infertility, and anxiety about cancer recurrence. Women diagnosed before the cancer has spread have a much higher five-year survival rate than those diagnosed at a later stage. However, only ∼17% of ovarian cancer patients are diagnosed within the early stage. The majority of OC cases are diagnosed at an advanced stage, making early detection and effective treatment critical challenges [4].
Advances in transcriptomic and genomic research have provided new insights in the discovery of OC oncogenes. Dozens of OC susceptibility genes, such as BRCA1 [5–8], BRCA2 [5–7], EPCAM [9,10], TP53 [11,12]and CHD1L [13–15], have been identified over the past decades. However, the sensitivity and specificity using only these genes in prognostics remain suboptimal. Efforts of using machine learning (ML) methods to identify novel gene biomarkers for predicting OC occurrence are still ongoing.
Compared to cancer occurrence, less research in the literature has been conducted on prediction of OC survival [15,16] and clinical stages [17,18]. It is particularly of scientific interest to investigate whether it is the same or different sets of genes that contribute to OC development and progression (including survival and clinical stages). Identification of novel oncogenes for predicting OC patients’ survival and clinical stages would be extremely helpful to promote the likelihood of survival and optimal treatment or therapeutic strategies at the earliest stage as possible, even after the patients being diagnosed of OC.
Through a series of work, Li et al. [19–23] have shown that in cancer-genomic studies, some genes, even though having weak marginal differential effects (DE), may still exude strong prediction effects on disease outcomes though regulating other strong DE genes. These weak DE genes (or weak genes), together with their coregulated strong genes and the coregulations between them, form predictive gene networks (PDN). Detecting such PDNs and the weak genes therein and integrating them into disease outcome prediction could significantly improve the prediction accuracy. In this project, we used a novel cancer-genomics analytical tool that we recently developed: netLDA – network-based linear discriminant analysis (https://github.com/lyqglyqg/netLDA) – to detect predictive gene networks and strong/weak signature genes for predicting OC occurrence, 5-year survival, and clinical stages, using both bulk and single-cell RNAseq (scRNAseq) data.
By looking at the gene-gene coregulation networks and weak genes, novel signature genes were identified, and the outcome prediction accuracy was significantly increased. The results helped with a better understanding of the underlining dynamic mechanisms of OC development and progression. They may shed light on promotion of precision medicine and new gene therapy development.
Materials
Data acquisition and processing
Bulk RNAseq and clinical data of 419 OC patients from The Cancer Genome Atlas (TCGA) program and 88 non-disease controls from The Genotype-Tissue Expression (GTEx) project were combined and used as the training data for OC occurrence prediction. Bulk RNAseq data in GSE18521 for 53 OC tumor samples and 10 normal ovary tissue samples from the Gene Expression Omnibus (GEO) database were used as an independent test dataset in the case-control study. There were 11,069 mapped genes on both training and test datasets.
Bulk RNAseq and survival data from GSE26712 for 195 OC patients were downloaded from GEO and used as the training dataset in the survival prediction. The same types of data for 53 OC patients were downloaded from GSE18521 and used as a test dataset. There were 12,645 common genes mapped on both datasets. The reason for not using TCGA data as the training data is that TCGA subjects cross a wide range of OC stages, which are heavily confounded with the survival. We did not find a GEO dataset that contains both survival and clinical stage outcomes. Therefore, we used two GEO datasets (both of which contained only late-staged OC patients) as the training and test datasets to alleviate the confounding effect from clinical stages.
Bulk RNAseq and clinical data from 419 TCGA OC patients and from 77 GSE63885 OC patients were used as the training and test data, respectively, in the OC clinical stage prediction. There were 17,490 mapped genes on both training and test datasets.
Single-cell RNAseq data of 22,153 cells and 47,913 transcripts from GSE229343 were used in the scRNAseqs data analysis and cellular mapping for the signature genes selected in each study.
The RNAseq data went through quality control before analysis using R package edgeR [24] or Seurat [25]. For the bulk RNAseq data, genes with counts less than 10 for more than 70% of the samples were removed from analyses. For the scRNAseq data, cells with UMI numbers below 500, gene numbers below 300 or greater than 6,000, or mitochondrial-derived UMI counts of more than 15% were considered low-quality and were removed [103].
Methods
Three studies were conducted for prediction of different OC outcomes: i) occurrence prediction of OC v.s. healthy, ii) 5-year survival prediction of survival longer than 5 years v.s. shorter than 5 years, and iii) severity prediction of clinical stage ≤ III v.s. V. The following methods were used in each study.
PGN and network-based weak gene selections
We use the netLDA [20] in both feature selection and outcome prediction. Figure 1 depicts the major steps of netLDA. First, netLDA selects top strong DE gene as hub genes according to their marginal DE effects. Then for each strong DE gene, netLDA selects its coregulated gene network containing its highly correlated genes (having a Pearson correlation coefficient 𝜌 with |𝜌| > 0.8). Next, netLDA assigns the following predictive score, or network-adjusted DE effect, to each gene in a selected coregulating network, , where 𝑖 and 𝑗 are gene indices, 𝐶i is the set of genes connected to gene 𝑖 through a coregulation path, Ωi is the precision matrix (inverse of the covariance matrix) that characterizes the coregulation information (directions and strengths) between genes in 𝐶i, and and are the average expression level of gene 𝑗 in outcome groups 1 and 2, respectively. The predictive score integrates, for each targeted gene, how many other genes it coregulates, the strengths and directions of those regulations, and expression levels of its coregulated genes, as well as expression levels of the targeted gene itself. The most predictive genes are selected according to the strengths of their predictive scores.
Selected predictive genes with small marginal DE effects are weak coregulating genes. For prediction, netLDA uses only the selected predictive genes and the coregulation network structures between them to predict outcomes on the test data. Figure 2 explains the calculation of the predictive scores using toy example data. We also developed permutation tests to evaluate the significance of selected individual genes and PGNs.
Prediction performance comparison with competing ML methods
We compared the prediction performance of netLDA with other commonly used cancer-genomics ML methods including Lasso [26], Ridge [27], ElasticNet [28], XGboost [29], and linear discriminate analysis (LDA) using only the strong genes and ignoring the coregulatory network structures between them. Prediction sensitivities, specificities, and the areas under the receiver operating characteristic curves (ROC) were evaluated to assess the prediction.
Kepler-Meijer analysis
Kepler-Meijer (KM) analysis is a commonly used biomarker validation approach in cancer genomics studies. It compares the survival or KM curves between high- and low- expressed groups of a DE gene [30]. Here we generated KM curves according to the long- and short-term survival groups predicted by using the selected strong and weak genes, and their PGN structures. We compared the KM curves to the ones generated from only using the top strong genes’ expression levels.
Gene set enrichment analysis
To validate our identified genes and PGNs from a biological perspective, we conducted gene set enrichment analysis (GSEA) [31], a knowledge-based approach for interpreting transcriptome profiles, using GeneOntology (GO) [32] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [33] pathways. The selected strong/weak genes and PGNs were mapped to the top enriched KEGG pathways to confirm their oncological functionals.
Cellular mapping for the selected genes
To reveal the cellular expression heterogeneity of the selected signature genes, we also performed a scRNAseq analysis for cell type profiling and cellular mapping of the selected genes.
Results
Predictive gene network and network-based gene selections
Top selected genes in the three studies are listed in Table 1. Marginal expression patterns for the selected genes are depicted in Figure 3. Top selected PGNs harboring the selected genes were listed in Table 2. Topological structures (illustrating the connection topologies) and connection matrices (illustrating the connection strengths) of the PNGs, along with marginal and network-adjusted
DE effects of the genes within, were depicted in Figure 4. Most of the selected strong genes have both a significant marginal p-value (from marginal tests) and a significant permutation test p-value (<10-4, from a network-based test). While majority of the selected weak genes have only a significant permutation test p-value. This demonstrates that integrating of coregulation between genes helps to promote the significance of weak genes in their empirical distributions. For many of the top selected genes, we found literature evidence supporting their associations with OC (last column in Table 1). Most of the selected predictive gene networks have significant permutation test p-values (<0.05). Full lists of the selected genes and PGNs are given in the Supplemental Materials.
We confirmed strong genes previously reported. SMPDL3B (marginal p-value=8.8×10-182, network-adjusted permutation p-value < 10-4) and SLC34A2 (marginal p-value=1.9×10-190, network-adjusted permutation p-value < 10-4) were confirmed in the OC occurrence study, TRAFD1 (marginal p-value = 0.0013, network-adjusted permutation p-value < 10-4) and CHD1L (marginal p-value = 0.0021, network-adjusted permutation p-value < 10-4) were confirmed in the 5-year survival study, and FANCI (marginal p-value = 0.0016, network-adjusted permutation p- value < 10-4) was confirmed in the clinical stage study. Expression of SMPDL3B was found related to specific aptamers for ovarian tumors, such as AptaC2 and AptaC4, through molecular docking [34]. SLC34A2 overexpression was reported related to development and progression of OC, brain cancer, and pancreatic cancer [35]. TRAFD1 suppression was observed in ovarian, colon, brain, and renal cancers [36]. CHD1L overexpression was reported to augment ovarian carcinoma metastasis [15]. FANCI has recently been identified as a new ovarian cancer predisposing gene [37,38]. We also discovered some strong genes not been reported before to be associated with OC. For example, ILDR1 (marginal p-value = 2.1×10-204, network-adjusted permutation p-value < 10-4) in OC occurrence study, and TRAPPC14 (marginal p-value = 5.6×10-4, network-adjusted permutation p-value = 2×10-4), RRP1 (marginal p-value = 0.0019, network-adjusted permutation p-value = 6×10-4), and ZSWIM8 (marginal p-value = 0.0045, network-adjusted permutation p-value = 6×10-4) in 5-year survival study.
We also identified weak genes in regulations with the strong genes, such as PPP1CA (marginal p-value = 3.3×10-9, network-adjusted permutation p-value < 10-4) and HMGA1 (marginal p-value = 1.0×10-57, network-adjusted permutation p-value < 10-4) in occurrence study, RPS8 (marginal p-value = 0.044, network-adjusted permutation p-value = 0.016), RPL28 (marginal p-value = 0.36, network-adjusted permutation p-value = 0.024), and RPL31 (marginal p-value = 0.042, network-adjusted permutation p-value = 0.18) in 5-year survival study, and MAPKAPK5 (marginal p-value = 0.0017, network-adjusted permutation p-value < 10-4) and BYSL (marginal p-value = 0.042, network-adjusted permutation p-value < 10-4) in the clinical stage study. PPP1CA is a catalytic subunit gene and plays an essential role in the growth of cancer cells [39]. HMGA1 plays a crucial role in the self-renewal and drug resistance of ovarian cancer stem cells [40]. Ribosomal genes, including RPS8, RPL28, and RPL31, have been recently identified as a novel therapeutic target against high-grade OC [41]. Long noncoding RNA MAPKAPK5- AS1 promotes cancer cell proliferation by cis-regulating the nearby gene MK5 [42]. BYSL expression was reported to be elevated and promote tumor cell growth [43].
Several novel OC-associated weak genes that have not been reported in the literature before were identified in our study, such as GIGYF2 (marginal p-value = 0.35, network-adjusted permutation p-value < 10-4), GNPAT (marginal p-value = 0.21, network-adjusted permutation p- value < 10-4) and RAD54L (marginal p-value = 0.066, network-adjusted permutation p-value < 10-4) in the clinical stage study.
Table 2 lists the top detected PGNs from each of the three studies. Many of these PGNs are overlapping with the top enriched KEGG and/or GO pathways (also see GSEA results). Genes in a KEGG/GO pathway are biologically validated to be related to systematic biology or oncology. Links between genes in a KEGG/GO pathway are lab-confirmed molecular interaction, reaction, and regulations. Overlapping between our selected PGNs and KEGG/GO pathways can serve as biological evidence of our findings. In the OC occurrence study, one of the two netLDA-detected gene networks contains overlapping genes CLDN7 (weak), CLDN4 (weak), CLDN3 (strong) that are also in the tight junction pathway (enrichment p-value=1.86×10- 3), leukocyte transendothelial migration pathway (enrichment p-value=1.76×10-3), and cell adhesion molecules cams pathway (enrichment p-value=2.68×10-3). Weak genes TJP3 and CRB3, in the same network, are also overlapped in the tight junction pathway. The other predictive gene network selected in OC occurrence study contains two weak genes PTTG1 and CDC20 that are overlapping with cell cycle pathway (enrichment p-value=1.85×10-5). In the survival study, one detected predictive gene network is largely overlapped with ribosome pathway (enrichment p-value=1.97×10-30). Twenty-eight out of thirty-one genes in the network are in the ribosome pathway, which accounts for 31.8% of the 88 leading genes in the ribosome pathway). In the clinical stage study, multiple detected gene networks overlap with KEGG cell cycle pathway (enrichment p-value=4.40×10-4), KEGG pathways in cancer (enrichment p-value=8.12×10-3), GO DNA replication pathway (enrichment p-value=6.91×10-9), GO DNA recombination pathway (enrichment p-value=2.09×10-8), and GO chromosome segregation pathway (enrichment p-value=1.49×10-7). Genes overlapping with KEGG cell cycle pathway include MCM2 (strong), CREBBP (weak), ABL1 (weak), PLK1 (weak), MCM4 (weak), MCM6 (weak), BUB1 (weak), CDC20 (weak), CCNB1 (weak), ORC3 (weak), SMAD2 (weak), GSK3B (weak), and PCNA (weak). Genes overlapping with KEGG pathways in cancer include PTGS2 (strong), KRAS (strong), WNT6 (strong), ABL1 (weak), MTOR (weak), SMAD2 (weak), STK4 (weak), CREBBP (weak), MSH3 (weak), RXRB (weak), and GSK3B (weak). Genes overlapping with GO DNA replication pathway include PRIM1 (strong), MCM6 (weak), DDX23 (weak), and WDHD1 (weak). Genes overlapping with GO DNA recombination pathway include MCM2 (strong), MCM4 (weak), HMCES (weak), and RUVBL1 (weak). Genes overlapping with GO chromosome segregation pathway include BUB1 (weak), PRC1 (weak), KIF2C (weak), CDC20 (weak), PLK1 (weak), and RMI2 (weak).
Prediction performance comparison with competing ML methods
ROC curves in each study are depicted in Figure 5. In the occurrence study, all methods gave almost perfect prediction results – area under the ROC curve (AUC) equaling 1 – as all top genes (strong and weak) have much significant differentiating effects compared to top genes in the survival and clinical stage studies. In the survival study, netLDA gave an AUC = 0.91, much higher than using only the strong genes and Lasso/Ridge/elasticNet (0.85-0.87). XGboosting gave a comparable AUC of 0.90. In the clinical stage study, netLDA also gave the highest AUC = 0.65, XGboosting gave an AUC = 0.61, and Lasso/Ridge/elasticNet and LDA using only the strong genes gave an AUC around 0.5, similar to a random guess.
KM analysis
Figure 6 shows the Kepler-Meijer curves and log-rank test results in the 5-year survival study. Figure 6 (a) is for the KM curves and log-rank test between the two netLDA predicted groups using both selected strong/weak genes, and PGN structures. Figure 6 (b-d) are KM curves and log-rank tests between high- and low-expression (above and below the median expression value) groups of top three selected strong genes. The two KM curves were more separated, and the log-rank test p-value were more significant between the netLDA predicted groups than those between the expression level groups from a single strong gene, demonstrating the effects of weak genes and PGNs] in improving the classification results.
GSEA
Figure 7 shows KEGG and GO pathway enrichment analysis results. The left panels are examples of the top enriched KEGG pathways for each of the three studies. Tight junction pathway was enriched in the OC occurrence study (enrichment p-value = 1.86×10-3), ribosome pathway was enriched in the 5-year survival study (enrichment p-value = 1.97×10-30), cell cycle pathway was enriched in the clinical stage study (enrichment p-value = 4.40×10-4). Many of the weak genes (highlighted in yellow) overlap with these top enriched pathways, confirming that the weak genes play a biological role in the development and progression of OC. A complete list of enriched KEGG pathways is given in the appendix. The right panels in Figure 7 list the top enriched GO pathways for each study. Top GO pathways enriched in the occurrence study include cell-cell junction organization (enrichment p-value = 4.26×10-10), tight junction assembly pathway (enrichment p-value = 3.55×10-9), and epidermis development pathway (enrichment p-value = 7.00×10-9). Top GO pathways enriched in the survival study include SRP-dependent cotranslational protein targeting to membrane (enrichment p-value = 2.54×10-55) and nuclear-transcribed mRNA catabolic process (enrichment p-value = 2.16×10-53). Top GO pathways enriched in the clinical stage study include DNA replication (enrichment p-value = 6.91×10-9) and recombinational repair (enrichment p-value = 1.59×10-7). Table 3 lists the top enriched KEGG pathways. Lists of top enriched GO pathways are provided in the Supplemental Materials.
Cellular mapping for the selected genes using scRNAseq data
Figure 8 shows the cellular distribution of the GSE229343 scRNAseq data and the expression maps of the selected feature genes. In Figure 8 (a), Seurat was first used to identify 28 cell subtype clusters using resolution = 0.2. For OC occurrence study, many genes are expressed on epithelial cells (including strong genes: CLDN3, SLC34A2, SMIM22, FOLR1, and weak genes: CLDN7, MAL2, SPRINT1, PRSS8, EHF, ELF3, KRT8, SLPI, KRT18, EPCAM, VAMP8, KRT7, CLDN4, WFDC2, CD24, MSLN); on fore/mid/hindgut epithelial cells (including strong genes: CLDN3, SMIM22, and weak genes: CLDN7, MAL2, SPINT1, PRSS8, ELF3, KRT8, SLPI, KRT18, EPCAM, KRT7, CLDN4); on cycling neural program/mesenchymal stem cells (including weak genes: UBE2C, CDC20, PTTG1, UBE2T); on airway/retinal epi/ciliated cells (including strong genes: CLDN3, SMIM22, FOLR1, and weak genes: ELF3, KRT8, UCP2, SLPI, EPCAM, KRT7, CLDN4, WFDC2, CD24); on myeloid/T cells (including weak genes: UCP2, VAMP8); and on immature neuron cells (including weak gene CD24). For the clinical stage study, many genes are overexpressed on cycling neural program/Mesenchymal stem cells (including weak genes: CDC20, UBE2T, NUSAP1, CCNB1, PLK1, ASPM, PRC1, KIF20A); on immature neuron cells (including strong gene TUBB2B); and on myeloid cells (including weak gene CD163). For the survival study, as there were very few genes mapped to the scRNAseq GSE229343 gene set, we did not observe a particular cellular express pattern for the selected genes.
Summary and Discussion
We confirmed previously identified prognostic genes such as EPCAM, UBE2C, CHD1L, TP53, CD24 [53], WFDC2 [53], and FANCI associated with OC occurrence, survival, or clinical stage. We identified novel susceptibility strong genes including: ILDR1 in ocurrence study, TRAPPC14, RRP1, and ZSWIM8 in the survival study, as well as novel coregulating weak genes including GIGYF2, GNPAT, RAD54L, and ELL in the clinical stage study. Our identified gene networks overlapped with KEGG tight junction, leukocyte transendothelial migration, and cell cycle pathways in the occurrence study; with ribosome pathway in the survival study; and with cell cycle pathway and pathways in cancer in the clinical stage study. We found many identified genes particularly expressed on epithelial cells in the occurrence study and on cycling neural program/Mesenchymal stem cells in the clinical stage study. By incorporating gene network structures and weak genes, netLDA significantly improves the prediction performance compared to other ML/DL methods such as Lasso, Ridge, elasticNet, XGboost, and LDA with only strong genes.
A major contribution of this work is the identification of prognostic oncogenes, especially weak genes in the OC related pathways. The CLDN genes (CLDN7 and CLDN4 detected in the OC occurrence study) in the tight junction pathway and cell adhesion molecules cams pathway, which functioning as one of the protective barriers in the epithelial and endothelial cells, were also observed to overexpress on fore/mid/hindgut epithelial cells in the single cell analysis. The weak genes detected in the survival study, mainly ribosomal genes, such as RPS8 and RPL28, overlapping with ribosome pathway, which was known to promote protein homeostasis in cancers by fine-tuning protein synthesis and preventing toxic protein aggregation [102]. The CCN gene family (CCNB1 detected in the clinical stage study) and CDK genes (CDK2 within the detected network indexed by PRIM1 in clinical stage study, see Table 2, even though it is not in the final selected weak gene set) are coregulating genes in the cell cycle pathway. These regulating weak genes were not reported in the literature, as they were difficult to detect by themselves.
Although netLDA incorporates gene coregulation network structures into calculating gene differential expressions, it assumes the same network structures (topology, coregulation direction and strength, etc.) between different outcome groups. In real applications, the network structures can be different between groups, in which case a quadratic discriminant analysis might be a better approach. However, accurate and robust inference of different network structures requires large sample sizes per group. Moreover, gene coregulations are usually more stable and robust compared to individual gene expression levels. That is, even individual gene expression levels can vary a lot between different disease groups, cell types, and environments, but the gene-gene coregulation maintains rather stable across the heterogeneous situations. Such robustness in gene networks is critical for assembling a dynamic biology system. Essentially, the network structures inferred by netLDA are the average of different groups when the same network structure assumption is violated.
Most of the weak genes selected in the OC occurrence study also have strong marginal DE effects. In that sense, they may also be considered as strong genes. The prediction performance is dominated by the top strong genes; therefore, the prediction performance (Figure 5) was not much different between netLDA and LDA using only strong genes and other competing methods. The predictive effects of weak genes were manifested in the survival and clinical stage studies, where the marginal DE differential effects of genes are much weaker than that in the occurrence study (Figure 3 and Figure 5). Especially in the clinical stage study, as the strong genes only explained a small portion of the outcome variation on the training data, netLDA dug deeper with more weak genes residing in more predictive networks compared to the occurrence and survival study, in order to accumulate sufficient information to optimize the prediction accuracy on the test data.
The selected gene sets from the occurrence, survival, and stage studies are non-overlapping. This is mainly because the input gene sets are different for the three studies. Moreover, it also reveals that the OC development and progression may have different underlying molecular mechanisms.
Since gene expressions in bulk RNAseq data are averaged expressions over different types of cells, DE effects of some genes might be washed out in the averaging. For example, a gene that is significantly differentially expressed only on a particular cell type but not on other cell types might exhibit only a marginally weak DE effect. A gene differentially expressed on two cell types but in different directions will show no marginal DE effect due to signal cancellation. The netLDA is a desired method for detecting such genes. It would be an optimal validation approach to confirm the DE effects of the selected genes using scRNAseq data from both cohorts. Our feature mapping of the selected genes using scRNAseq data helped identify which types of cells the genes are particularly expressed on. Investigators are recommended to conduct DE analyses using scRNAseq data on the identified cell types to further validate the findings.
Data Availability
All data produced are available online at: The Cancer Genome Atlas (TCGA) (https://xenabrowser.net/); The Genotype-Tissue Expression (GTEx) (https://www.gtexportal.org/); Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/)
Abbreviations
- DE
- Differentially expressed
- GEO
- Gene Expression Omnibus
- GO
- GeneOntology
- GSEA
- Gene set enrichment analysis
- GTEx
- The Genotype-Tissue Expression project
- KEGG
- Kyoto Encyclopedia of Genes and Genomes
- KM
- Kepler-Meijer
- LDA
- Linear discriminant analysis
- OC
- Ovarian cancer
- PGN
- Predictive gene networks
- RNAseq
- RNA sequencing
- scRNAseq
- single cell RNA sequencing
- TCGA
- The Cancer Genome Atlas
Reference
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.↵
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
- 71.
- 72.
- 73.
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.
- 80.
- 81.
- 82.
- 83.
- 84.
- 85.
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.
- 93.
- 94.
- 95.
- 96.
- 97.
- 98.
- 99.
- 100.
- 101.
- 102.↵
- 103.↵