Abstract
Understanding who is at risk of progression to severe COVID-19 is key to effective treatment. We studied correlates of disease severity in the COMET-ICE clinical trial that randomized 1:1 to placebo or to sotrovimab, a monoclonal antibody for the treatment of SARS-CoV-2 infection. Several laboratory parameters identified study participants at greater risk of severe disease, including a high neutrophil-lymphocyte ratio (NLR), a negative SARS-CoV-2 serologic test and whole blood transcriptome profiles. Sotrovimab treatment in these groups was associated with normalization of NLR and the transcriptomic profile, and with a decrease of viral RNA in nasopharyngeal samples. Transcriptomics provided the most sensitive detection of participants who would go on to be hospitalized or die. To facilitate timely measurement, we identified a 10-gene signature with similar predictive accuracy. In summary, we identified markers of risk for disease progression and demonstrated that normalization of these parameters occurs with antibody treatment of established infection.
Introduction
Sotrovimab is a human monoclonal antibody (mAb) derived from an antibody isolated from a person recovered from SARS-CoV infection. This mAb broadly neutralizes SARS-CoV-2, SARS-CoV and other related animal sarbecoviruses1-3. Sotrovimab targets a highly conserved epitope in the SARS-CoV-2 Spike protein located in a site outside the receptor-binding motif (RBM), thus retaining in vitro activity against current SARS-CoV-2 variants of concern (VOC) (Alpha, Beta, Gamma, Delta, Omicron) and variants of interest (VOI)4,5. The majority of mAbs developed for COVID-19 bind to the RBM, which engages the angiotensin-converting enzyme 2 (ACE2) receptor. The RBM is one of the most mutable and immunogenic regions of the virus6. As a result, RBM antibodies as single agents and even when used in combination have not retained activity against some VOC/VOI7-10, particularly against Omicron VOC 5,11,12.
Sotrovimab was recently tested in a multicenter, double-blind, phase 3 clinical trial (COMET-ICE, ClinicalTrials.gov NCT04545060) that recruited non-hospitalized participants with symptomatic COVID-19, and at least one known risk factor (age and/or comorbidities) for severe disease progression. Participants were randomized to a single intravenous infusion of sotrovimab 500 mg or placebo. In the interim analysis of the trial, sotrovimab significantly reduced the risk of all-cause hospitalization (>24 hours) or death from COVID-191. The final data, now presented in preprint2, shows that among 1057 participants randomized (sotrovimab, 528; placebo, 529), all-cause hospitalization longer than 24 hours or death was significantly reduced with sotrovimab (6/528 [1%]) vs placebo (30/529 [6%]) by 79% (95% CI, 50% to 91%; p<0.001).
While the impact of sotrovimab was profound, the relatively low rate of hospitalization or death amongst participants considered at risk for poor disease outcomes in the placebo arm led us to investigate if additional biomarkers or biomarker profiles beyond the known demographic and comorbid conditions could be identified. The setting of a randomized, controlled clinical trial presented a unique opportunity to identify signals of disease progression that resolved in response to treatment and could thus be used both to provide insights into COVID-19 pathogenesis, and as potential surrogate endpoints in the design of future trials. Thus, the present study aimed at identifying baseline correlates of hospitalization and severe disease/death in an at-risk population based on routine laboratory parameters, SARS-CoV-2 serology and on transcriptome analysis. It then sought to assess the impact of antibody treatment on these parameters. This approach enabled assessment of the impact of antibody treatment on populations with different intrinsic risks of disease progression, and the identification and testing of surrogates of treatment response.
Results
Identification of study participants at high risk of progression of COVID-19 using clinical laboratory values
COMET-ICE2 included 1057 adults with a positive local polymerase-chain-reaction or antigen SARS-CoV-2 test result and onset of symptoms within the prior 5 days (Table 1). We analyzed 63 available central laboratory parameters for their association to hospitalization or death (Suppl. File 1). On Day 1, pre-dose, white blood cell proportions were most predictive of eventual hospitalization or death (Table 2). White blood cell proportions were also quantified by the neutrophil to lymphocyte ratio (NLR; area under a receiver operating characteristic curve, AUC=0.81). We found that an NLR greater or equal 6 provided an optimal cutoff for the highest enrichment for disease progression and is hereafter defined as ‘high NLR’ (Suppl. Fig. S1). Of the 36 hospitalizations or deaths that occurred in the COMET-ICE study, 29 of these were observed among the 903 participants with available NLR at Day 1, the day of dosing. NLR had a sensitivity of 45% [28%-63%] and a specificity of 95% [93%-96%] for the prediction of all-cause hospitalization or death (Fisher’s exact p<0.001). NLR normalized more rapidly in participants receiving sotrovimab (Figure 1).
The role of SARS-CoV-2 serostatus in defining risk of progression of COVID-19
Having SARS-CoV-2 anti-nucleocapsid antibodies provides protection against SARS-CoV-2 re-infection13. There is more limited information, however, on how serostatus may associate with severity of disease during acute infection. In the current study, seropositivity at baseline may indicate prior infection by SARS-CoV-2 or that a participant is already seroconverting during an acute infection episode. Seropositivity rates varied significantly by race/ethnicity, Table 3. Seropositivity was also associated with lower viral RNA in nasopharyngeal swabs at baseline: mean 4.2 vs 6.4 log10 viral RNA in seropositive versus seronegative participants (Mann-Whitney U p=1e-36); Table 3. Of the participants who received placebo and the serology results were available, 4/97 (4%) of seropositive participants progressed to hospitalization and/or death compared with 25/375 (7%) of the seronegative who progressed to hospitalization and/or death before Day 29. In participants who received placebo, there were no deaths or ICU admissions in those who were seropositive at baseline compared to 4 deaths (2 deaths before Day 29 and 2 additional deaths that occurred after Day 29) and 9 ICU admissions (2.4%) in those who were seronegative at baseline.
Of the 202 seropositive participants at baseline, 6 (3%) were hospitalized or died: 4/96 (4.2%) received placebo and 2/106 (1.9%) received sotrovimab. Notably, as the COMET-ICE study captured all-cause hospitalizations or death, all 4 seropositive participants in the placebo arm were hospitalized with COVID-19 diagnoses, while the two seropositive participants in the sotrovimab arm were hospitalized with events potentially unrelated to COVID-19 (one instance of diabetic foot, and one instance of non-small cell lung cancer). No sotrovimab-treated participants died or were admitted to the ICU.
Identification of high-risk cluster using transcriptomics
We used whole blood transcriptomics to define additional laboratory-based predictors of disease progression and response to treatment. In theory, such transcriptome signatures could provide complementary insight into the biology of risk and recovery. The sub-study included samples collected prior to treatment on Day 1 and at Day 8 from 303 patients. Among these 303 patients, 6/152 (3.9%) participants were hospitalized in the placebo group and 2/151 (1.3%) participants were hospitalized in the sotrovimab group.
We visualized the transcriptomes of each patient using Uniform Manifold Approximation and Projection (UMAP). We noted that from Day 1 to Day 8, the distribution of all transcriptome profiles tended to shift towards higher values of UMAP component 2 (Figure 2a). We defined a putative risk cluster based on the differences in the distributions of Day 1 and Day 8 samples in the UMAP (see Methods). The described risk cluster includes Day 1 and Day 8 transcriptomics profiles for 6 of 8 hospitalized participants (Figure 2b). Participants in the high-risk cluster were significantly older, white, with a higher NLR, and higher viral RNA levels in nasopharyngeal samples (Suppl. Table S1). The cluster analysis also highlighted that baseline seropositive participants were less likely to be associated with the high-risk transcriptome cluster on Day 1 and no seropositive patient remained in the high-risk cluster by Day 8 (Figure 2c).
The two hospitalized participants mis-identified by the baseline transcriptome analysis were in the sotrovimab arm. One of the two participants had undetectable viral RNA in nasopharyngeal swabs at enrollment, and through 8 days post-enrollment when blood was drawn for the transcriptome analysis. This patient was then hospitalized by Day 21 with elevated viral load, suggestive of a unique clinical course. The second misidentified patient treated with sotrovimab was hospitalized due to a small intestinal obstruction deemed unrelated to COVID-19. Therefore, we found support for the hypothesis that the area outlined in red in Figure 2 corresponded to an UMAP-defined high-risk cluster for COVID-19 progression where protective responses had failed to engage appropriately between Day 1 and Day 8. Although statistical power was limited due to only 8 hospitalizations in the transcriptomic sub-study, the transcriptome high risk group demonstrated a strong association to all-cause hospitalization and death (Fisher’s exact p=0.004) with a sensitivity of 75% [41%-94%] and a specificity of 76% [71%-80%].
Response to treatment identified by transcriptomics
Given the effect of sotrovimab demonstrated in COMET-ICE, we determined whether treatment altered the probability of remaining in the transcriptome-defined high-risk cluster. To perform this analysis, we compared the rate of exiting the high-risk cluster for participants receiving sotrovimab versus placebo (Figure 2d). Among those who were high risk on Day 1, on Day 8, 29% of placebo-treated (n=11, including the hospitalized participants) versus 10% of sotrovimab-treated participants (n=4) remained high risk as defined by the transcriptome analysis. This corresponds to a 2.8-fold lower prevalence of risk-correlated transcriptional signatures for sotrovimab relative to placebo (Fisher’s exact p=0.045). Receipt of sotrovimab was also associated with a more rapid decline in viral RNA in nasopharyngeal samples by Day 8 (Figure 3).
Examining the biology of the transcriptome-defined high-risk cluster
For both Day 1 and Day 8 visits, we scored genes for differential expression between high versus low-risk clusters. We found a widespread transcriptional shift with thousands of genes identified as differentially expressed after adjusting for multiple comparisons (Figure 4a, Suppl. File 2). We characterized differentially expressed genes via gene-set enrichment analysis using the MSigDB Hallmark Gene Set annotation14. The most enriched Hallmark Gene Sets were associated with innate immune responses, in particular the complement system set, inflammatory response set, as well as the interferon alpha and gamma response gene expression modules (Figure 4b). Overexpression of genes in these pathways agrees well with previous work showing strong associations between increased innate immune system activation and disease severity15. In summary, whole transcriptome analysis is consistent with a significant inflammatory response and identifies participants on Day 1 that have a high risk of disease progression, a finding that is further supported by the lack of normalization of the high-risk transcriptome profile in participants who were subsequently hospitalized.
Identifying a set of genes whose expression captures the risk-defining elements of the overall transcriptome
Having established a transcriptomic profile associated with risk of COVID-19 progression, recovery, and treatment response, we next determined whether a smaller number of mRNAs that might practically be measured by RT-PCR captured the predictive power of the overall transcriptome. Such an approach is preferable due to lower cost and greatly reduced turnaround time relative to whole transcriptome sequencing. To select a gene panel, we clustered genes into 10 groups according to their co-expression patterns across participants (Suppl. Fig. S2). This was accomplished using UMAP and K-means clustering. We then selected the top gene from each group as candidate for identification of a risk-predictive set of 10 genes. The 10 gene panel (CD38, DAB2, EFHC2, EIF2D, EIF4B, MYO18A, NUDT3, OAS2, RPL10, TADA3) accurately recapitulated the whole transcriptome risk clusters at both Day 1 (AUC=0.96) and Day 8 (AUC=0.99; Figure 5a). The expression of each gene in the panel is shown in Figure 5b. Expression of the 10-gene-panel was highly associated to viral load and hospitalization, and strongly affected by sotrovimab (Suppl. Fig. S3). On the transcriptomic subset, the 10-gene risk stratification had a sensitivity of 75% [41%-94%] and a specificity of 76% [72%-82%] for the prediction of all-cause hospitalization and death (Fisher’s exact p-value=0.003). The sensitivity increased to 83% and specificity to 80% when the analysis was limited to the placebo arm, reflecting the real case scenario where there is no modification of the outcome by sotrovimab. A comparison of this performance across risk predictors is presented in Supp Table S2.
Discussion
We defined clinical laboratory and molecular biomarkers that can potentially identify participants with mild to moderate COVID-19 who are at highest risk of progression to severe disease and hospitalization or death using data collected in the prospective, phase 3 pivotal study COMET-ICE. Baseline NLR and a 10-gene transcriptomic signature associated with all-cause hospitalization or death with respective sensitivity and specificity of 33% and 93% (NLR) and of 87% and 64% (10-gene panel) on the transcriptomics sub-cohort. Changes in these biomarkers were also associated with response to treatment with the monoclonal antibody sotrovimab.
Currently, the risk of developing severe COVID-19 has been associated with a number of demographic factors such as age and specific comorbidities. However, there is considerable heterogeneity in disease outcome that would benefit from additional stratification of risk. NLR, the simple ratio of neutrophil over lymphocyte counts could be informative and easy to implement, an observation supported by other studies16-18 Though NLR sensitivity is low, the high specificity suggests that high NLR could be used as a triage test for persons at high risk of progressing and could prioritize those individuals for closer monitoring. Serostatus, defined here as IgG antibody response to nucleocapsid, is also a predictor of disease severity. None of the participants that were seropositive at baseline, whether because of previous infection or because of ongoing seroconversion19 died or were admitted to the ICU. Seropositive participants had lower levels of viral RNA in nasopharyngeal samples and were less likely to present or maintain a risk transcriptome profile. One caveat of this analysis is that VOCs that have decreased sensitivity to antibodies induced by vaccination (e.g., Delta and Omicron) were not circulating at the time of enrollment in the study. Therefore, additional analysis may be needed to confirm the protective effect of seropositivity in the current phase of the pandemic.
Whole blood transcriptome analysis revealed a signature of disease severity that encompassed overexpression of genes involved in interferon response, inflammation and the complement system. We showed that a full transcriptome signature can be captured faithfully with a 10-gene panel. Use of a simple expression signature lowers the bar for an eventual implementation, as recently shown by work to make a three gene tuberculosis signature using point-of-care rapid testing20.
An important effort of the present work was to define whether the set of predictive parameters of hospitalization and disease severity was also modified by treatment with sotrovimab; i.e., whether these parameters could serve as surrogate markers of sotrovimab response because they are modified by treatment and strongly associated to the study clinical endpoints of interest. Indeed, sotrovimab accelerated the normalization of NLR and the transcriptome profiles in a statistically significant manner. In particular, hospitalized participants in the placebo group retained the transcriptome profile associated with risk by Day 8 at a time when the majority of study participants normalized their peripheral blood gene expression profiles.
One topic of debate in the field is the value of measuring the levels of viral RNA in nasopharyngeal samples. In the present study, viral RNA levels measured by RT-PCR were of modest value as a baseline predictor. However, there was an association between viral RNA levels and the predictors of risk that we explored: NLR, serology and transcriptome profiles. Participants at risk of severe disease and hospitalization present higher levels of baseline nasopharyngeal viral RNA21. Weinreich et al.22 reported that mAb therapy had a significant effect on participants with a high viral load at baseline. Chen et al.23 reported a decreased viral load at day 11 did not appear to be a clinically meaningful end point, since the viral load was substantially reduced from baseline for the majority of patients, including those in the placebo group, a finding that is consistent with the natural course of the disease. Gottlieb et al.24 reported that treatment with mAb combination therapy, but not monotherapy, resulted in a reduction in SARS-CoV-2 log viral load at day 11 in participants with mild to moderate COVID-19. In the present work, resolution of disease was associated with decrease of viral load, in particular for participants receiving sotrovimab. However, taken together, these data indicate that viral load decline in upper airways is not a strong surrogate for clinical efficacy.
The strengths of the current study reside on the well-characterized, geographically diverse prospective pivotal clinical trial participant cohort with whole blood/RNA collected at multiple time points to evaluate response to treatment and disease characteristics over time. This dataset enabled the implementation of machine learning methods for predicting disease severity from both clinical and transcriptomic markers. A limitation of the study is that it included a pre-defined ‘risk population” of adults based on demographic and comorbid factors. A second limitation is the low number of study endpoints in the transcriptomic sub-study. However, if validated in additional studies, this approach could expand the definition of risk to include participants that might otherwise not be considered for treatment based on risk criteria currently in use25. In conclusion, this study identifies laboratory parameters associated with COVID-19 disease progression and hospitalization and shows that sotrovimab treatment effectively contributes to normalization of these parameters.
Online Methods
Characteristics of clinical trial population
COMET-ICE2 included 1057 adults with a positive local polymerase-chain-reaction or antigen SARS-CoV-2 test result and onset of symptoms within the prior 5 days (Table 1). Recruitment was between August 2020 and March 2021. Screening occurred within 24 hours prior to drug administration. Participants were required to be at risk for COVID-19 progression based on previously identified clinical parameters: age ≥55 years or adults with at least one of the following comorbid conditions: diabetes requiring medication, obesity (body-mass index >30 kg/m2), chronic kidney disease (estimated glomerular filtration rate <60 mL/min/1.73 m2), congestive heart failure (New York Heart Association class II or higher), chronic obstructive pulmonary disease, or moderate to severe asthma. Participants with already severe COVID-19, defined by shortness of breath at rest, oxygen saturation less than 94%, or requiring supplemental oxygen, were excluded. Participants were randomized 1:1 to receive either a single 500-mg infusion of sotrovimab or equal volume saline placebo. A subset of participants (n=303) consented for peripheral whole blood transcriptome analysis. Participants who opted-in to the transcriptome sub-study had similar demographic, clinical and laboratory characteristics to those in the overall study. They were evenly divided between placebo and sotrovimab arms (Table 1). In-person study visits occurred on days 1, 5, 8, 11, 15, 22 (W3), and 29 (W4) to assess adverse events and worsening of COVID-19. During study visits, blood samples and nasopharygeal swabs were collected for routine laboratory assessments and viral load, respectively. Samples for transcriptome analysis were collected twice: at the time of treatment (referred to as Day 1 herein) and a week later at the Day 8 visit. There was no analysis plan for this work in COMET-ICE; this was post-hoc analysis of the trial data.
Clinical data analysis
The associations between laboratory values, and treatment response and hospitalization were measured using the area under a receiver operating characteristic curve (AUC). For single variable analyses, this metric was computed by directly ranking participants with no model fitting step, to avoid overfitting. Significance of AUC was assessed by the Mann-Whitney U test, relying on the equivalence between the Mann-Whitney U statistic and AUC. All reported AUCs were significant after a Bonferroni adjustment for multiple comparisons. The significance threshold was calculated as 0.05 divided by the number of clinical variables tested. For binary variables such as baseline risk factors, significance was assessed by Fisher’s exact test. Assessing complementarity of features was complicated by varying missingness patterns, leading to sample size loss. This was minimized by looking at only pairs of variables, and by median imputation of missing values26. Neither approach significantly improved on the single most predictive variable, indicating that single variable predictors are sufficient for accurate risk prediction.
SARS-CoV-2 viral load and serology
Nasopharyngeal swabs were collected in universal transport media, and viral load was measured using the CDC 2019-nCoV Real-Time RT-PCR method run at central lab (https://www.fda.gov/media/134922/download). Serum samples were analyzed for anti-SARS-CoV-2 antibody by the Abbott SARS-CoV-2 IgG assay run on the Architect i2000SR immunoassay analyzer (https://www.fda.gov/media/137383/download). This assay qualitatively measures IgG anti-SARS-CoV-2 antibodies against the nucleocapsid protein. Due to the potential for cross-reaction of sotrovimab with anti-spike antibody assays, only analysis of anti-nucleocapsid serostatus was conducted.
RNA isolation and sequencing
Peripheral whole blood was collected into PAXgene Blood RNA tubes (PreAnalytiX), identified by a sample accession number, and stored according to manufacturer recommendations. Day 1 and Day 8 samples for the same patient were sent for processing in the same batch. RNA purification, library preparation and sequencing were performed by Q2 Solutions – EA Genomics (Morrisville, NC). Total RNA was isolated (>1.25 ug required) and was depleted of globin mRNA using the GLOBINclear kit (Invitrogen). RNA quantity and quality was assessed using an Agilent Bioanalyzer (RIN score > 7.0 required). The globin-depleted RNA was used to generate a sequencing library using the TruSeq stranded mRNA method (Illumina). Briefly, poly(T) oligonucleotides are used to select poly-adenylated RNAs from the total RNA after globin mRNA reduction which are then fragmented and converted to cDNA using random primers in two steps to maintain strand specific information. Libraries were sequenced on Illumina NovaSeq 6000 (multiple sequencing runs, multiple instruments) to a target sequencing depth of 25 million paired-end reads per sample at a minimum read length of 50 bp. Samples were automatically selected by Q2 Solutions for repeat library preparation and sequencing based on pre-defined quality control metrics for ribosomal RNA fraction (>10% rRNA aligned reads threshold for repeat).
RNA-seq analysis
701 sequenced libraries from 638 whole blood samples were delivered from Q2 solutions by processing batch and sequencing run. Reads were cropped to a common read length of 50 bp and low-quality bases and adapters were further trimmed using Trimmomatic (v. 0.39)27. Trimmed reads less than 31bp were discarded. Trimmed sequenced reads per library were then aligned to a custom reference genome using STAR (v 2.7.3a)28 and to a custom reference transcriptome using Salmon (v. 1.0.0)29. The custom reference genome and transcriptome annotation was based on combining the human reference genome and annotation from Gencode (GRCh38, release 30) with the SARS-CoV-2 reference genome and annotation from Ensembl (ASM985889v3 version). Libraries were assessed for total reads (minimum, maximum, and median of 23.4, 94.4, and 33.2 million read pairs), average read length (49 bp), and adapter content, post-trimming with FASTQC (v. 0.11.8). Alignment metrics, such as total aligned reads, aligned reads by feature type, gene body coverage and 3’ bias, were assessed using Picard CollectRnaSeqMetrics (v 2.20.2—0). We profiled duplication rate versus reads per kbp and verified that low read counts were not associated with high duplication at the library level using DupRadar (v 1.12.1). Known junction saturation rate as a function of sequencing depth was also profiled for all libraries using RSeQC (v 3.0.1). Transcript-level counts from Salmon were converted to gene-level counts and gene-level transcripts-per-million (TPM) using the R package tximport (v. 1.20.0). As genes with consistently low supporting read counts across libraries are unlikely to be called differentially expressed, a filtering step to remove genes with few to no supporting read counts across libraries was performed30. Conservatively, only genes with a minimum of 10 read counts in at least 4% (n=24) of the libraries were considered for further analysis (n = 23,540 genes). When multiple libraries (due to repeated library preparation and sequencing) were available for the same sample accession number (whole blood sample), a representative library with the higher median TPM value was selected as libraries with outlier values for alignment quality metrics were associated with (low) outlier median TPM values. Only libraries representing matched Day 1 and Day 8 whole blood samples for a patient were included for downstream analysis. No other library exclusion criteria were applied.
Data analysis transcriptome
Using the R package DESeq2 (v 1.32.0)31, variance-stabilizing transformation was applied to gene-level counts.30 For exploration of transcriptome signatures, UMAP was run on the variance-stabilizing transformed RNA-seq count data. Prior to UMAP projection, data were pre-conditioned and de-noised using principal component a analysis (PCA). The first 20 PCs were selected based on the point at which explained variance tended towards zero (Suppl. Fig. S4). For this baseline analysis, UMAP (from the umap-learn python package; https://umap-learn.readthedocs.io/en/latest/) was run with default parameters.
To test the robustness of the embedding, this analysis was repeated for the full transcriptome (without PCA) for pathogen-associated transfer genes identified by di Iulio et al.32, and for immune-related pathway gene sets (Hallmark Gene Sets annotation).14 In each case, a variety of nearest neighbor values were tried, and the embedding was run multiple times to ensure repeatability. In all cases, the embeddings were similar. For example, the observed gradient between Day 1 and Day 8 samples, as well as the relative placement of the hospitalized was always consistent.
High risk due to laboratory parameters vs low risk categorizations were derived as follows. Two-dimensional kernel density estimation with a bandwidth of 1 was applied to Day 1 and Day 8 UMAP values separately. High risk participants were defined as those within an area where the Day 1 density exceeded the Day 8 density by 0.005. This cutoff was derived by choosing a round positive number near the beginning of the tail of the distribution (Suppl. Fig. S5). As a validation of this approach, we also performed a line search on this cutoff to optimize the separation between Day 1 and Day 8, as measured by Fisher’s exact p-value. This yielded an optimal cutoff of 0.006. To be conservative, we did not use this optimized value since the selection of cutoffs for the line search could be influenced by information beyond Day 1 vs Day 8 status.
Differentially expressed genes associated with the putative high-risk cluster were scored using a model accounting for gender and visit day (DESeq2, v. 1.32.0)31. Differentially expressed genes were characterized via gene set enrichment analysis (fgsea, v 1.18.0)33 using the Hallmark Gene Sets annotation (msigdbr, v. 7.4.1)14. For selection of a gene signature, we conducted diversity-based selection according to top ANOVA F-scores within 10 empirically identified gene clusters. Gene clusters were derived by performing UMAP dimensionality reduction on the transpose of the transcriptome matrix (with genes as rows instead of patients). This created a similarity map of genes based on their co-expression patterns across patients. We then defined 10 gene clusters from this map using K-means clustering (Suppl. Fig. S2). From each cluster, we selected the gene most associated with risk according to its ANOVA F-score. Diversity-based selection using gene clustering significantly improved on greedy selection based on F-score alone (Suppl. Fig. S6a) and yielded comparable performance to transfer-learned32 and hallmark gene sets (Suppl. Fig. S6b). To assess performance of this set of 10 genes representing co-expression clusters, we repeated this entire process within a five-fold cross-validation loop, including gene clustering. In this procedure the dataset is partitioned into five folds. For each of the five folds, we trained a model on the other four chunks to predict its values in an unbiased manner. To avoid overfitting due to patient-specific attributes, samples from the same patient on different days were always kept in the same fold.
Data Availability
Data availability is subject to the clinical trial charter
Disclosures
This study was sponsored by Vir Biotechnology, Inc., in collaboration with GlaxoSmithKline. M.C.M., L.B.S., J.diI., S.L., M.J.S., A.A., K.M., D.S., M.A., E.A., L.P., X.D., E.M., W.Y., D.C., H.W.V., P.S.P. and A.T. are employees of Vir Biotechnology Inc. and may hold shares in Vir Biotechnology Inc. D.A. and A.P are employees of and hold shares in GlaxoSmithKline.