Abstract
Defining the human factors associated with severe vs mild SARS-CoV-2 infection has become of increasing interest. Mining large numbers of public gene expression datasets is an effective way to identify genes that contribute to a given phenotype. Combining RNA-sequencing data with the associated clinical metadata describing disease severity can enable earlier identification of patients who are at higher risk of developing severe COVID-19 disease. We consequently identified 358 public RNA-seq human transcriptome samples from the Gene Expression Omnibus database that had disease severity metadata. We then subjected these samples to a robust RNA-seq data processing workflow to quantify gene expression in each patient. This process involved using Salmon to map the reads to the reference transcriptomes, edgeR to calculate significant differential expression levels, and gene ontology enrichment using Camera. We then applied a machine learning algorithm to the read counts data to identify features that best differentiated samples based on COVID-19 severity phenotype. Ultimately, we produced a ranked list of genes based on their Gini importance values that includes GIMAP7 and S1PR2, which are associated with immunity and inflammation (respectively). Our results show that these two genes can potentially predict people with severe COVID-19 at up to ∼90% accuracy. We expect that our findings can help contribute to the development of improved prognostics for severe COVID-19.
Introduction
Human infections with severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has resulted in hundreds of millions of confirmed cases and millions of deaths globally. In addition, countless others have been hospitalized and a subset of the infected population has experienced severe health consequences--particularly those who are elderly, immunocompromised, or have other underlying conditions. The genetic material for this pathogen consists of a monopartite positive-sense single-stranded RNA molecule that is approximately 30 kb in length that contains multiple open reading frames [1]. Since the virus was first detected in late 2019, the scientific community has performed multiple studies to better understand the underlying mechanism(s) of entry and pathogenesis [2–6].
Human pathogenesis studies performed early in the SARS-CoV-2 pandemic showed that the virus induces the interferon response and interleukin-6, as well as other cytokines and chemokines that contribute to COVID-19 [7–9]. Interestingly, multiple studies have shown that the majority of infections are either mild or asymptomatic [10–13]. The large diversity in the human response to infection, combined with large numbers of infections, contributed to strained hospital capacity [14–16]. Various factors contribute to these observed differences in disease severity, and the demand for robust biomarkers associated with COVID-19 disease severity has continually grown. Some studies have identified associations between acute infection and the host response [17,18]. Other studies have evaluated associations between disease severity and aspects of the adaptive immune system [19–23] and quantified viral RNA [24–29]. A recent study has used neural networks to predict patient survival outcomes with high accuracy [30], which can be useful when whole transcriptome data are available. However, to our knowledge, a meta-analysis on transcriptional biomarkers associated with mild versus severe infection has not been previously reported.
The aim of the current study is to perform a meta-analysis of existing human transcriptomics data from collected blood samples to predict transcriptional prognostic markers. Since presence of the virus is possible with existing diagnostics, such markers of disease severity could then contribute to making informed decisions concerning the care of infected patients who seek treatment at the hospital.
Methods
Identification of Relevant Datasets
An established process that combines automated and manual methods was used to identify and analyze the samples from published SARS-CoV-2 human transcriptomics studies with metadata specifying the severity of infection in May, 2021. Samples from patients having either “mild” or “asymptomatic” infections were manually labeled by one reviewer as “mild”, while those recorded as “hospitalized”, “ICU”, or “death” were labeled as “severe”. Records (GSE152418 (PBMC), GSE157103 (whole blood), GSE166424 (whole blood)) in the Gene Expression Omnibus (GEO) database [31–33], within the National Center for Biotechnology Information (NCBI), were queried and manually reviewed for relevance (Figure 1). Specifically, studies were selected based on the presence of three predefined criteria: 1) the host organism was human, 2) the data were generated as part of a RNA-sequencing experiment, and 3) the study included samples collected during acute SARS-CoV-2 infection together with disease severity metadata. In total, we identified and processed 358 relevant samples across three independent RNA-sequencing studies.
Data Pre-Processing
The fastq files containing the RNA-sequencing data were obtained from the Sequence Read Archive (SRA) at NCBI [34] using the sratools software. These files were divided into “case” and “control” categories based on whether they had “severe” or “mild” disease. The Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data (ARMOR) was then used to preprocess and analyze the RNA-seq data [35]. Briefly, this ARMOR workflow uses the python-based snakemake workflow language [36] to perform steps including: trimming of sequencing adapters and low-quality regions from the originally-generated RNA-sequencing reads with TrimGalore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), calculate quality control metrics with FastQC (www.bioinformatics.babraham.ac.uk/projects/fastqc/), as well as map and quantify reads to the human GRCh38 transcriptome with Salmon [37]. Differential gene expression was calculated from the read counts with edgeR [38]. Gene ontology terms were calculated from the list of significant gene identifiers produced by edgeR by using the DAVID resource [39].
Machine Learning
The salmon counts for each gene in each GEO sample identifier were compiled into a table. The counts for each sample were then normalized using a z-score transformation. These samples into test and training sets, with 70% of samples assigned to a training set and the remaining 30% of samples assigned to a test set. A random forest classification method (using the R randomforest package) was then trained and used to identify the genes that were most useful in classifying samples as coming from patients with either “severe” or “mild” COVID-19 disease. This random forest algorithm calculated the Gini impurity values for each feature, which were then sorted to rank the importance of genes based on the original training set. This process was repeated for subsets of the best-scoring features to quantify their accuracy, specificity, and sensitivity.
Results
Transcriptomics Meta-Analysis Identifies Significant Genes, Enriched Terms, and Signaling Pathways
We began by identifying publicly available RNA-sequencing data generated from either whole blood samples or peripheral blood mononuclear cells (PBMCs) that had been previously collected from patients infected with SARS-CoV-2 and had associated disease severity metadata. We then assigned these samples to either “high severity” or “low severity”. We processed these RNA-seq files using an automated computational workflow that performed quality control, trimmed reads, mapped them to the human transcriptome, and calculated significant differentially-expressed genes. We then used these genes to identify enriched Gene Ontology (GO) terms.
Overall, we identified 8176 significant differentially expressed genes after applying a multiple hypothesis correction with log2 fold-change values ranging from -4.2 to 3.78 (Figure 2). We found that the most significant differentially expressed genes included ASPH, C5orf30, DGKH, SLC26A6. We then subjected this list of significant DEGs to Gene Ontology enrichment. This analysis produced 90 significant GO terms including immune response, apoptosis, and I-kappaB kinase/NF-kappaB signaling (Table 1). It is possible that a subset of these results may contain bias due to the heterogeneity of the samples. However, we expect that the robust statistical analysis of these samples minimized such effects.
We then used the signaling pathway impact analysis algorithm to calculate which intracellular signaling pathways were best represented by the list of DEGs. This analysis identified nine pathways that were significantly affected by severe COVID-19 (Table 2). We observed that five of these significant pathways dealt directly with T-cell receptor (TCR) signaling, while a sixth described a Zap70 immunological synapse. Interestingly, all six of these immune-related pathways were predicted to be inhibited during severe COVID-19.
Machine Learning Identifies GIMAP7 and S1PR2 as Robust Biomarkers
Prior to predicting relevant biomarkers, we first wanted to confirm whether the intracellular transcriptional response in blood was strongly associated with the disease severity metadata that was recorded with each record. We consequently constructed a table with all transcripts from each gene represented as columns and the read mapping data represented as rows. We then used this table to generate a receiver-operator characteristic (ROC) curve (Figure 3). In this case, the area under the curve (AUC) represents the percent specificity and sensitivity for the host transcriptomic data to predict disease severity. We calculated the AUC of the curve across all transcripts to be 96.6%, which indicates that the host transcriptional response strongly contributes to disease severity.
We then subjected the same tabular data to a machine learning algorithm to enable us to predict the features (e.g. expressed genes) that were most associated with the disease severity phenotype (Table 3). We ranked our random forest output by descending order of the Mean Decrease in Gini Impurity value, which is a measure of entropy. Transcripts from genes with larger Gini Impurity values represented those that could be used to most accurately predict the recorded disease phenotype.
We next wanted to reduce the selected markers to the smallest number that would provide the best performance. To do so, we generated a ROC curve for the six expressed genes having the highest Gini Impurity values and calculated its AUC to be 94.3%. We repeated this process for only the top two expressed genes (GIMAP7 and S1PR2) and only the top expressed gene (GIMAP7). This analysis quantified the AUC to be 89.8% for the two combined genes and 84.4% for only the top gene. Specifically, the mean and median read counts for each of these two genes were approximately three times higher in the samples with low disease severity than in the samples with high disease severity.
Discussion
Previous work has shown that combining multiple datasets in meta-analyses can augment the signal-to-noise ratio in order to gain new biological and mechanistic insight(s) [9,40–43]. As such, the goal of our study was to predict prognostic markers of SARS-CoV-2 disease severity through a transcriptomics meta-analysis of human blood samples. We found statistically significant differentially expressed genes, enriched Gene Ontology terms, modulated signaling pathways, and a ranked list of biomarkers that could potentially be combined to predict which patients are at risk of severe disease.
At least one prior study has generated single-cell RNA-sequencing data to better understand the host response to SARS-CoV-2 infection [44]. Their findings showed adaptive immune components play a role in disease severity. Interestingly, our data confirm results from some prior experiments that show certain aspects of the T-cell response may be downregulated during SARS-CoV-2 infection [45–47]. A modified distribution of the ZAP70 kinase on the plasma membrane of T-cells contributes to the signal transduction and amplification of the TCR [48]. Interestingly, CD3/ZAP70 protein has been shown to interact with TREM-2 in the T-cells of patients infected with SARS-CoV-2 [49]. The ASPH and C5orf30 genes, which were identified as significant DEGs in our study, have been identified previously as markers of severe infection [50].
These prior studies support our finding that GIMAP7 could be important in severity given its presence on the surface of T-lymphocytes [51]. In the time since GIMAP7 was initially identified [52], it has subsequently been shown to be a potential marker for various cancer types, which further supports its immune-related role [53–56]. Although the underlying mechanism for this protein is yet to be characterized, we are not surprised by its role in other immune-related diseases or in SARS-CoV-2 disease severity. Information about the S1PR2 gene is more sparse, although it appears to have a role in inflammation and other immune-related functions [57–61]. Specifically, S1PR2 on T-cells recruits lymphocytes to damaged tissues and may contribute to the recirculation of cells in the adaptive immune system to the lymphatic system [62].
The GIMAP7 gene, together with a subset of the enriched GO terms, had been found in an earlier study on transcriptional biomarkers for SARS-CoV-2 [63]; however, it was not highly ranked. This is logical since DEGs are often identified as biomarkers. In addition, as the sample size increases among a more diverse patient population, we expect that the genes most capable of differentiating disease severity should become more accurate.
It is important to note that these samples were taken while the early variants were circulating. As such, further testing would be required to confirm whether these biomarkers are still consistent predictors of infection severity in patients infected with more recent variants. We anticipate that qRT-PCR (quantitative Reverse Transcriptase Polymerase Chain reaction), and possibly flow cytometry could be used to quantify these biomarkers in relevant samples. Additional experiments are needed to confirm whether these findings can be replicated in samples across different age and/or risk groups.
These results present a potential predictor for the severity levels of patients infected with SARS-CoV-2. Considering the diversity of reactions that patients have to the virus, incorporating these biomarkers as additional data points to assess patient risk of severe disease could be pivotal in augmenting both personal and public health decision making processes [64–66]. This is especially relevant when resources may be limited and priority must be given to those with the greatest risk of severe infection. These results in particular are useful because they illustrate both up and down regulation association that best differentiate each patient in a population, rather than just identifying genes with statistically significant changes across the populations being compared. This approach allows us to detect the directionality of biomarkers such as GIMAP7 and S1PR2 that were not as highly ranked by edgeR.
In conclusion, the findings from this study could contribute to ongoing efforts relating to triaging patients when hospitals are approaching their capacity limit. We envision that creating an assay to quantify the presence of a subset of the features identified in this study could be useful in identifying patients who are at higher risk of developing severe disease. The results of such a prognostic assay could contribute to triage efforts and to treatment decisions being made in the clinic.
Data Availability
Publicly available RNA-seq fastq files were obtained from the NCBI Gene Expression Omnibus. The relevant studies include: GSE152418, GSE157103, and GSE166424.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE152418
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157103
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE166424
Acknowledgements
We are grateful to the BYU Office of Research Computing for providing the computational resources needed to complete this study. We also thank the original clinicials, patients, and others who provided the RNA-sequencing data.