Abstract
End-stage kidney disease (ESKD) patients are at high risk of severe COVID-19. We performed dense serial blood sampling in hospitalised and non-hospitalised ESKD patients with COVID-19 (n=256 samples from 55 patients) and used Olink immunoassays to measure 436 circulating proteins. Comparison to 51 non-infected ESKD patients revealed 221 proteins differentially expressed in COVID-19, of which 69.7% replicated in an independent cohort of 46 COVID-19 patients. 203 proteins were associated with clinical severity scores, including IL6, markers of monocyte recruitment (e.g. CCL2, CCL7), neutrophil activation (e.g proteinase-3) and epithelial injury (e.g. KRT19). Random Forests machine learning identified predictors of current or future severity such as KRT19, PARP1, PADI2, CCL7, and IL1RL1 (ST2). Survival analysis with joint models revealed 69 predictors of death including IL22RA1, CCL28, and the neutrophil-derived chemotaxin AZU1 (Azurocidin). Finally, longitudinal modelling with linear mixed models uncovered 32 proteins that display different temporal profiles in severe versus non-severe disease, including integrins and adhesion molecules. Our findings point to aberrant innate immune activation and leucocyte-endothelial interactions as central to the pathology of severe COVID-19. The data from this unique cohort of high-risk individuals provide a valuable resource for identifying drug targets in COVID-19.
Main
Coronavirus disease (COVID-19), caused by the SARS-CoV-2 virus, displays marked clinical heterogeneity from asymptomatic to fatal disease. Patients with severe disease exhibit marked inflammatory responses and immunopathology. The mechanisms underlying this remain incompletely characterised and the key molecular mediators are yet to be determined. To date, the only treatment shown to reduce mortality in randomised trials is dexamethasone [1], a corticosteroid which has broad non-specific effects on the immune system. Even with corticosteroid treatment, mortality in severe COVID-19 remains significant. There is a wide armamentarium of existing drugs that target inflammation more selectively, providing potential repurposing opportunities for the treatment of COVID-19. In order to select the most promising agents for trials, we urgently need to better understand the molecular drivers of severe disease. Proteins are the effector molecules of biology and the targets of most drugs. Therefore, proteomic profiling to identify the key mediators of severe disease provides a valuable tool for identifying and prioritising potential drug targets [2].
Risk factors for severe or fatal COVID-19 include age, male sex, non-European ancestry, obesity, diabetes mellitus, cardiovascular disease, and immunosuppression [3]. End-stage kidney disease (ESKD) is one of the strongest risk factors for severe COVID-19 (estimated hazard ratio for death 3.69) [3], and ESKD patients hospitalised with COVID-19 have a mortality of approximately 30% [4–7]. ESKD patients have a high prevalence of vascular and cardiometabolic disease (e.g. hypertension, ischaemic heart disease, diabetes), either as a result of the underlying cause of their renal disease and/or as a consequence of renal failure. In addition, ESKD results in both relative immunosuppression and chronic low-grade inflammation, which may impact viral defence and the host inflammatory response.
Here we performed proteomic profiling of serial blood samples of ESKD patients with COVID-19, leveraging the unique opportunity for longitudinal sampling in both the outpatient and inpatient settings afforded by a large multi-ethnic haemodialysis cohort (Fig. 1a). These data revealed 221 proteins that are dysregulated in COVID-19 versus matched non-infected ESKD patients. Using linear mixed models, joint models and machine learning, we identified proteins that are markers of COVID-19 severity and risk of death. Finally, we characterised the temporal dynamics of the blood proteomic response during COVID-19 infection in ESKD patients, uncovering 32 proteins that display altered trajectories in patients with severe versus non-severe disease.
Results
We recruited 55 ESKD patients with COVID-19 (Supplementary Table 1). All patients were receiving haemodialysis prior to acquiring COVID-19. Blood samples were taken as soon as feasible following COVID-19 diagnosis. At time of initial sample, 30 patients were outpatients attending haemodialysis sessions and 25 were hospitalised inpatients (Methods, Fig. 1). Following the initial blood sample, serial sampling was performed for 51/55 patients. We also recruited 51 non-infected haemodialysis patients as ESKD controls, mirroring the age, sex and ethnicity distribution of the COVID-19 cases (Extended Fig. 1a-c). We used the Olink proteomics platform to measure 436 proteins (Supplementary Table 2) in 256 plasma samples from the COVID-19 patients and the 51 control samples. The proteins measured consisted of 5 multiplex ‘panels’ focussed on proteins relevant to immuno-inflammation, cardiovascular and cardiometabolic disease. In addition, we assayed 52 serum samples from an independent cohort of 46 COVID-19 positive ESKD patients (‘validation cohort’)(Methods).
The Olink platform demonstrated highly consistent measurements in two technical replicates (Pearson’s correlation of 0.999). Principal components analysis (PCA) demonstrated differences between COVID-19 positive cases and controls, although the two groups did not separate into discrete clusters (Extended Fig. 2). Categorising samples by either the WHO severity score at time of sampling (Extended Fig. 2a) or the patient’s overall clinical course (measured by peak WHO severity score) (Extended Fig. 2b), demonstrated a gradient of COVID-19 severity, best captured by principal components 1 and 3.
Proteomic changes triggered by COVID-19
To examine the effects of COVID-19 on the plasma proteome, we performed a differential expression analysis between COVID-19 cases (n=256 samples passing quality control from 55 patients) and non-infected ESKD controls (n=51) using linear mixed models, which account for serial samples from the same individual (Methods). This revealed 221 proteins associated with COVID-19 (5% false discovery rate, FDR); the vast majority were upregulated, with only 40 downregulated (Fig. 2a, Supplementary Table 3). In order to provide a succinct and standardised nomenclature, we report proteins by the symbols of their encoding genes (see Supplementary Table 2 for a mapping of symbols to full protein names). The most strongly upregulated proteins (in terms of fold change) were DDX58, CCL7, IL6, CXCL11, KRT19 and CXCL10, and the most strongly downregulated were SERPINA5, CCL16, FABP2, PON3, ITGA11 and MMP12 (Extended Fig. 3). Notably, many of the upregulated proteins were chemotaxins. We compared our findings to those of 3 published studies [8–10] and a preprint (Supplementary Table 4). 30 of our differentially expressed proteins had been identified in the published papers, and 99 in the preprint. In total 116 had been previously reported.
In the smaller validation cohort (n=52 serum samples from 46 patients with COVID-19; Methods), we found 201 proteins that were differentially abundant between cases and controls (5% FDR) (Supplementary Table 3). PCA revealed clearer separation of infected and non-infected patients than in the primary cohort, likely reflecting the higher proportion of hospitalised patients (41 of 46 patients) in the validation cohort (Extended Fig. 4). Of the 221 differentially abundant proteins from the primary cohort, 150 (69.7%) were also identified in the validation cohort at 5% FDR (Extended Fig. 5a). Effect sizes in each dataset showed a strong correlation (r=0.804, Extended Fig. 5b). This demonstrates that our findings are highly reproducible despite differences in sample sizes and blood materials (plasma versus serum in the primary versus validation cohorts, respectively).
Proteomic markers of COVID-19 severity and predictors of death
To determine the proteomic effects of COVID-19 severity, we tested for associations between proteins and clinical severity at the time of sampling (defined by WHO severity scores, graded as mild, moderate, severe or critical), using linear mixed models with severity encoded as an ordinal predictor (Methods). This analysis revealed 203 proteins associated with severity (Fig. 2b, Supplementary Table 5). The majority of these were upregulated in more severe disease, with only 42 down-regulated. Consistent with previous reports, we found that severe COVID-19 was characterised by elevated IL6. In addition, we observed a signature of upregulated monocyte chemokines (e.g. CCL2, CCL7, CXCL10), neutrophil activation and degranulation (e.g. PRTN3, MPO) and epithelial injury (e.g. KRT19, AREG, PSIP1, GRN). (Fig. 2b,c, Fig. 3). SERPINA5 and leptin showed the greatest downregulation as COVID-19 severity increased (Fig. 2b,c). 140/203 (69%) of the severity-associated proteins were also identified as differentially abundant in COVID-19 positive versus negative analysis (Extended Fig. 6a). Most proteins upregulated in COVID-19 versus non-infected patients were also associated with more severe disease (Extended Fig. 6b). However, there were some notable exceptions (eg CCL20, IL17C, OSM) that were strongly associated with severity that not differentially expressed in infected versus non-infected patients. Comparisons to previously reported associations are reported in (Supplementary Table 6).
Molecular changes may predate clinical deterioration. Using machine learning approaches, we tested if the proteomic signature of the first blood sample for each patient in our dataset could identify whether the patient either had clinically severe COVID-19 at the time of sampling or would develop severe disease in the future. Whereas differential expression analyses consider each protein marker separately, machine learning techniques allow examination of all proteins concurrently, thus capturing non-linear relationships in the dataset. Using Random Forests, we trained a classifier on the first sample for each COVID-19 patient to predict the overall clinical course, defined by peak WHO severity. For the purposes of this analysis, we binarised clinical course into either WHO mild/moderate or severe/critical. The Random Forests method achieved 71.3% accuracy in predicting peak severity. Since the features selected by the classifier can highlight proteins of biological importance, we interrogated the model to identify key proteins by calculating feature importance metrics (Methods, Supplementary Table 7). The most important proteins in indicating the presence of current or future severe disease were KRT19, PARP1, PADI2, CCL7, IL1RL1 (ST2) and CTSL (Procathepsin L) (Fig. 4a).
9/55 patients in the primary cohort died. We therefore sought to identify proteins associated with risk of death. To utilise the dynamic nature of repeated protein measurements for prediction of death, we utilised joint models, which combine linear mixed models and Cox proportional hazards models [11,12] (Methods). This analysis identified 44 proteins for which increased concentration was associated with increased risk of death (Fig. 4b, Supplementary Table 8), including CST3, IL22RA1, AZU1, CCL28 and SPON1, and 25 proteins for which increased concentration was associated with reduced risk of death, including CD84, TNFSF12, TANK, PRKCQ and ADM.
Associations with clinical laboratory tests
A number of routine clinical laboratory tests have well characterised associations with COVID-19 (e.g. elevated inflammatory markers, d-dimer and reduced lymphocyte count) [13]. We therefore compared our proteomic data at each timepoint to contemporaneous clinical laboratory measurements using linear mixed models (Methods). We found associations between plasma proteins and all clinical measurements except troponin (Extended Fig. 7, Supplementary Table 9). Many of these proteins were also markers of severity (e.g. IL6, KRT19, IFN-gamma and CXCL10 were strongly associated with raised CRP and ferritin and reduced lymphocyte counts). Of note CCL7, a monocyte chemokine which was also identified as an important marker of severity by the Random Forests classifier, was associated with lower monocyte count and raised inflammatory markers. Elevated neutrophil count was associated with Oncostatin-M, which regulates IL6, GCSF and GMCSF production, and with the proteases MMP9 and defensin.
Longitudinal analysis reveals proteins with distinct temporal profiles according to severity
The immune response to infection is dynamic, and therefore snapshot measurements provide only partial insights. Leveraging the dense serial sampling in our dataset (Fig. 1), we modelled the temporal trajectory of each protein and asked whether or not any protein trajectories differed in patients with a severe/critical versus mild/moderate overall clinical course. This was achieved using linear mixed models with a time x severity interaction term (Methods). Using this approach, we identified 32 proteins for which there was significant interaction between time and severity, i.e. differential temporal trajectories between mild/moderate and severe/critical infections (Supplementary Table 10, Fig. 5). Among the proteins with the strongest differences according to clinical course were the integrins ITGA11 and ITGB6, the adhesion molecule ICAM1, TNFRSF10B (a receptor for TRAIL) and PLAUR, the receptor for urokinase plasminogen activator. Most of these proteins exhibited rising profiles in the more severe patients but flat profiles in milder cases. ACE2, the receptor for SARS-CoV-2, and S100A12 also displayed this pattern (Fig. 5). In contrast, abundance of ITGA11, which was also identified as reduced in the analysis of infected versus non-infected patients, fell over time in the severe group.
Discussion
Our longitudinal proteomic analysis of ESKD patients with COVID-19 has shown that patients with a trajectory towards severe clinical deterioration exhibit dynamic proteomic changes consistent with activation of the endothelium associated with integrin-mediated cell adhesion abnormalities. The analysis of this unique population of high-risk patients demonstrates that in addition of the well-known IL-6 mediated pro-inflammatory responses, severe disease is associated with dysregulation of epithelial tissue repair pathways (e.g. KRT19, PSIP1, AREG, GRN (progranulin), ITGA11), most likely a reflection of the lung and vascular damage. Our study thus puts emphasis on the importance of studying and targeting mechanisms that reduce the endothelial damage to both alleviate the severity of the infection and to reduce the chance of long-lasting complications.
ESKD is one of the strongest risk factors for death from COVID-19. In part, this likely reflects the fact that ESKD patients are enriched for cardiometabolic traits that predispose to severe COVID-19. However, in multivariable analyses adjusting for these factors, ESKD remains an independent risk factor for severe COVID-19 [3]. Moreover, there is an inverse relationship between renal function and risk of death from COVID-19 across the spectrum of chronic kidney disease. These observations support the notion that the state of ESKD per se is an important determinant of outcome in COVID-19. ESKD is well recognised as an immunosuppressed state [14–16], with defects in both innate and adaptive immunity [17–20]. Accordingly, ESKD confers increased vulnerability to viral infections including influenza and respiratory syncytial virus [21–24]. In addition, ESKD results in a chronic low-grade inflammatory state [25]. This tendency to a pro-inflammatory state, combined with reduced ability to respond to viruses, may contribute to the abnormal host response to SARS-CoV-2 infection, producing the immunopathology that leads to severe COVID-19.
Analysis of ESKD patients with or without COVID-19 revealed many proteins associated with both the presence of COVID-19 and with disease severity. 140 of these associations were common to both analyses. Members of the CCL and CXCL chemokine family featured prominently, for example CCL2 and CCL7, which are both chemokines for monocytes. Interesting these chemokines were negatively correlated with monocyte counts, suggesting recruitment of these innate cells into damaged tissues or ‘exhaustion’. Consistent with the latter, flow cytometric analyses in COVID-19 have shown evidence of phenotypic changes in the circulating monocytes such as expression of cell proliferation markers that occurs when there is premature release of monocytes from the marrow due to an emergency myelopoesis [26]. In agreement with previous studies [10,27], CXCL10, a chemokine secreted in response to IFN-gamma, was elevated in COVID-19 patients and correlated with severity. DDX58 and IL6 were also among the most strongly upregulated proteins in COVID-19 positive patients and correlated with disease severity, demonstrating the presence of similar immune responses to the ones seen in non-ESKD COVID-19 patients [27]. Of note, the neutrophil proteases PRTN3 and MPO (Fig. 3), and the neutrophil-derived protein AZU1, a chemotaxin for monocytes and fibroblasts, were associated with severe disease (Supplementary Table 5), and AZU1 was one of the strongest predictors of death, indicating that neutrophil activation and degranulation are features of severe COVID-19. Together, these chemokines, inflammatory mediators and neutrophil proteins point to an important role for monocyte recruitment and innate immune activation.
Association of IL6 with severe disease is well-established and has already received considerable attention and discussion [28,29]. Despite promising initial case reports of IL6R receptor blockade in COVID-19, convincing efficacy has not so far been demonstrated in randomised trials [30]. One explanation for this is that elevated IL6 is a downstream read-out rather than a primary driver of pathology. Alternatively, blocking IL6 signalling may be necessary but not sufficient to stop immunopathology. Interestingly, despite the association of IL6 with severity in our cross-sectional analysis, we did not identify a significant interaction between time and severity in our longitudinal analysis.
Several proteins associated with risk of death lie in pathways targeted by existing drugs. PARP1 was identified as an important marker of current or future severe COVID-19 by the Random Forests supervise learning method, and also was associated with risk of death. PARP1 inhibitors are in use for cancer [31]. PARP1 has also been associated with inflammatory and vascular disease [32]. While we cannot determine whether the association we observed is causal or simply reflects increased cell death, our data suggest that re-purposing of PARP1 inhibition in COVID-19 should be explored further. IL22RA1 was also predictive of death. This protein forms part of the receptor complex for IL22, enabling downstream signalling via JAK/STAT pathways. JAK/STAT inhibitors are in clinical use for rheumatoid arthritis and are being trialled in COVID-19 [33].
The host immune response to COVID-19 is a dynamic process, and clinical deterioration typically occurs 7-10 days after first symptoms. This underlines the importance of longitudinal analysis of serial samples rather than single snapshot measurements. By modelling temporal trajectories, we identified proteins that exhibit distinct temporal profiles in severe versus mild cases. One such protein was S100A12 (EN-RAGE) (Fig. 5). This protein was identified as associated with COVID-19 severity in another study also using Olink assays [10]. Interestingly, this protein is also associated with Kawasaki disease [34], a form of paediatric vasculitis with clinical similarities to a COVID-19-associated hyperinflammatory syndrome in children [35]. IL1-beta has been shown to be a driver of S100A12-induced sterile inflammation and activation of endothelial cells [36]. This provides support for the suggestion of targeting IL1-beta with Canakinumab in COVID-19 [37]. Other examples include the adhesion molecules ITGB6 and ICAM1, highlighting the role of leucocyte-endothelial interactions in severe COVID-19. Microvascular thrombosis is a key feature in the pathology of COVID-19, likely that this is driven by endothelial activation due to viral infection, effectively producing vasculitis. Currently, management of microthrombosis consists of anticoagulation. Our results suggest that disrupting leucocyte-endothelial interactions may be a complementary therapeutic strategy. Complement activation is associated with severe COVID-19 and may contribute to thrombo-inflammation in dialysis patients. We have previously demonstrated elevated complement activation markers, C3a and C5a, in ESKD patients with severe COVID-19 [38]. In our present analysis, complement factor H-related protein 5 (CFHR5), a protein that can increase surface complement activation [39] displayed a rising trajectory in severe disease (Fig. 5). Complement C5 inhibition is under investigation in the treatment of severe COVID-19 infection [33,40] and our findings support this approach. An intriguing association is the prominence of KRT19 in our analyses and a pre-print [27] (Supplementary Table 4,6). KRT19 is an intermediate filament protein, important for the structural integrity of epithelial cells [41]. Strikingly, evaluation of its expression in the immune system (immgen.org) shows it to be almost exclusively expressed in alveolar macrophages and this may be worthy of further investigation.
Other groups have studied the plasma or serum proteome in COVID-19 [8–10] using either mass spectrometry or immunoassays including the Olink platform. We observed greater similarities with other studies that used immunoassays [9,10,27], including the associations between severity and leucocyte chemokines (Supplementary Tables 4,6). However, we observed very little overlap with the mass spectrometry study of Shen et al [8]. This may reflect the lack of commonality in terms of proteins assayed. Mass spectrometry is less sensitive than immunoassays and so is unlikely to capture the cytokines studied here, and conversely is likely to measure many proteins that our immunoassays did not target. However, consistent with their data, we did observe increased von Willebrand factor and reduced apolipoprotein M in more severe disease.
Our study had a number of strengths in comparison to other studies. Many previous studies of COVID-19 have been limited to hospitalised cohorts and lack serial sampling. Consequently, they fail to capture either the full spectrum disease or its temporal evolution. In contrast, we were able to perform serial blood sampling in both the outpatient and inpatient setting, including longitudinal samples from the same individual before and after hospitalisation. This was possible because haemodialysis patients are unable to fully isolate as they must continue to attend for regular dialysis sessions. Another strength of our study was careful matching of ESKD control patients in terms of age, sex and ethnicity (Extended Fig. 1a-c) and the large proportion of black, asian and minority ethnic (BAME) patients who are disproportionately affected by COVID-19. Finally, we ensured appropriate statistical analysis of longitudinal data using linear mixed models and joint models. However, a limitation of our study was that we used Olink panels that measured specific proteins selected on their relevance to inflammation, immunity, cardiovascular and metabolic disease. This bias precluded formal pathway enrichment analysis.
In summary, this study reveals proteins associated with COVID-19 infection and severity, and demonstrates altered dynamic profiles between patients with severe disease and those with a more indolent course. These data provide a valuable resource for therapeutic target prioritisation.
Online Methods
Subjects and samples
Ethical approval
All participants (patients and controls) were recruited from the Imperial College Renal and Transplant Centre and its satellite dialysis units, London, and provided written informed consent prior to participation. Study ethics were reviewed by the UK National Health Service (NHS) Health Research Authority (HRA) and Health and Care Research Wales (HCRW) Research Ethics Committee (reference 20/WA/0123: The impact of COVID-19 on patients with renal disease and immunosuppressed patients). Ethical approval was given.
Primary cohort
We recruited 55 COVID-19 positive haemodialysis patients, either as outpatients or as inpatients (Supplementary Table 1). All patients were receiving chronic hospital-based outpatient haemodialysis prior to COVID-19 diagnosis. COVID-19 was confirmed in all cases with positive nasal PCR for the SARS-CoV-2 virus. Blood was collected in EDTA tubes and centrifuged to obtain plasma, and stored at –80°C. Sample processing was performed within 4 hours of venepuncture. The initial sample was taken as an outpatient for 30 patients and as an inpatient for 25. Where feasible, serial blood samples were taken. In total, 259 samples were taken (3 subsequently failed QC – see below). The median number of serial samples was 5 (range 1-10) (Extended Fig. 1d). 8 patients who were recruited as outpatients were subsequently admitted to hospital with COVID-19 over the course of the study. 27 of 55 (49.1%) patients had severe or critical disease (defined by peak WHO severity). 9 (16.4%) patients died.
In addition, we recruited 51 COVID-19 negative haemodialysis controls. COVID-19 negative haemodialysis controls were selected to mirror the cases in terms of demographic features (age, sex, ethnicity) (Extended Fig. 1a-c).
Validation cohort
We also recruited an independent cohort of 46 COVID-19 positive haemodialysis patients. 5 were outpatients and 41 were inpatients. 33 of 46 patients (71.7%) had severe or critical disease (by peak WHO severity), and 9 (19.6%) patients died. For these patients, blood was collected in serum tubes and centrifuged to obtain serum. Plasma was not collected from these patients. For 40 patients, only one sample from a single timepoint was collected, and for 6 patients, 2 samples were collected. To provide controls, we used serum samples from 11 non-infected ESKD patients (collected at the same time as plasma from a subset of the ESKD control group described above).
Clinical severity scores
Severity scoring was performed based on WHO classifications (WHO clinical management of COVID-19: Interim guidance 27 May 2020) adapted for clinical data available from electronic medical records. ‘Mild’ was defined as COVID-19 symptoms but no evidence of pneumonia and no hypoxia. ‘Moderate’ was defined as symptoms of pneumonia or hypoxia with oxygen saturation (SaO2) greater than 92% on air, or an oxygen requirement no greater than 4L/min. ‘Severe’ was defined as SaO2 less than 92% on air, or respiratory rate more than 30 per minute, or oxygen requirement more than 4L/min. ‘Critical’ was defined as organ dysfunction or shock or need for high dependency or intensive care support (i.e. the need for non-invasive ventilation or intubation). Severity scores were charted throughout a patient’s illness. We defined the overall severity/clinical course for each patient as the peak severity score that occurred during the patient’s illness.
Proteomic assays
Plasma and serum proteomic measurements were performed using Olink proximity extension immunoassays (https://www.olink.com/products/). Five 92-protein multiplex Olink panels were run (‘inflammation’, ‘immune response’, ‘cardiometabolic’, ‘cardiovascular 2’ and ‘cardiovascular 3’), resulting in 460 measurements per sample. Since a small number of proteins were measured on more than one panel, we measured a total of 436 unique proteins. The Olink assays were run using 88 samples/plate. All plates were run in a single batch. Plate layouts was carefully designed to avoid confounding of potential plate effects with biological or clinical variables of interest. To achieve this, we used an experimental design that combined ensuring case/control balance across plates with random selection of samples from each category and random ordering of allocation to wells. This is outlined in more detail as follows. We ensured that each plate contained a mixture of control and case samples. Specifically, a fixed proportion of each plate was designated for control samples. The allocation of specific control samples to each plate was performed using randomisation. For the case samples, we again used randomisation for plate assignment, with the constraint that once one sample from a given patient was allocated to a plate, all other longitudinal samples from that patient were assigned to same plate. Finally, once all the samples had been allocated to plates, the layout of samples within each plate was determined through a further randomisation step for well allocation.
Normalisation and quality assessment and control
The data was normalised using standard Olink workflows to produce relative protein abundance on a log2 scale (‘NPX’). Quality assessment was performed by a) examination of Olink internal controls, and b) inspection of boxplots, relative log expression (RLE) plots [42] and PCA. Following these steps, 3 poor quality samples were removed. In addition, 5 samples failed quality control on a single proteomic panel only, with the remaining panels passing QC. For these samples, proteins on the panel that failed QC were set to missing, and the data for the remaining proteins was retained.
Principle components analysis revealed no plate effects (Extended Fig. 8a). One serum sample was assayed twice as a technical replicate to assess assay performance (Extended Fig. 8b). We found a Pearson’s correlation between the samples of 0.999. 13 proteins were assayed more than once due to their inclusion in multiple Olink panels. For plasma, the median correlation between the assays was 0.986 with an IQR of 0.019 and a range of 0.925 to 0.998. For serum, the median correlation between the assays was 0.991 with an IQR of 0.043 and a range of 0.737 to 0.999. We removed duplicate assays at random prior to subsequent analyses.
For 11 ESKD controls, we had contemporaneous plasma and serum samples. To assess the comparability of these two matrices, we fit a linear model to compare duplicate measurements, finding low bias (slope = 0.985) and a high Pearson’s correlation of 0.957 (Extended Fig. 8c).
Missing values
Following QC, 0.22% data points were missing for the plasma dataset and 0.35% for the serum dataset. For analyses that required no missing values (PCA and supervised learning), we imputed missing values as follows. The dataset was first scaled and centred, and missing values imputed using caret’s k-nearest neighbours (kNN) method [43]. The 5 closest samples (by Euclidean distance) were used to estimate each missing value.
Differential abundance analysis
Differential protein abundance analyses between COVID-19 positive and negative samples were performed using linear mixed models, to account for the use of serial samples from the same individuals (R lme4 package [44]). This analysis compared 256 samples from 55 COVID-19 patients with 51 non-infected patients (1 sample per non-infected patients). Age, sex and ethnicity were included as covariates. We used a random intercept term to estimate the variability between individuals in the study and account for repeated measures. The regression model in R notation was: Where NPX represents the protein abundance and covid_status was a categorical variable (infected/non-infected). Sex and ethnicity were also categorical variables. Age was a quantitative variable. We calculated P values using a type 3 F test in conjunction with Satterthwaite’s method for estimating the degrees of freedom for fixed effects [45]. The regression model was fitted for each of the 436 proteins individually. Multiple testing correction was performed using the Benjamini-Hochberg method and a 5% FDR used for the significance threshold.
The same approach was used for the validation cohort. This analysis comprised 52 serum samples from 46 COVID-19 positive patients versus 11 samples from non-infected patient samples (1 sample per non-infected patient).
For testing the association of plasma proteins with the 4-level WHO severity rating (mild, moderate, severe and critical) within COVID-19 positive cases (n=256 samples from 55 patients), we used a similar linear mixed modelling approach. For this analysis, the covid_status term was replaced by a severity variable encoded using orthogonal polynomial contrasts to account for ordinal nature of severity levels.
The linear mixed modelling strategy was also employed for testing association of temporal clinical laboratory variables and protein levels, with the value of the clinical variable (as a quantitative trait) used in place of covid_status. Contemporaneous lab measurements were not available for all samples. This varied according to the clinical lab parameter. Some (eg troponin, d-dimer) were measured less frequently than full blood count and CRP. Details of the proportion of missing values for each lab parameter are included in (Supplementary Table 9).
Supervised learning
Random forest models were fit using R’s randomForest and caret packages [43,46]. Data was centred, scaled and imputed as in Data Preparation with the caveat that, during cross-validation, the pre-processing procedure was first applied on the resampled (training) data before the same method was applied without re-calculation to the holdout (test) set. To estimate model accuracy, cross-validation was carried out with 70% of the data in the training set for 10 resampling iterations; the training set was re-shuffled and the cross-validation procedure repeated 10 times. The model’s parameters were kept constant at 500 trees and an mtry value (number of proteins randomly sampled as candidates at each node) calculated as the square root of the number of features. After parameter estimation, we fit a final model trained using the entirety of the dataset. This model was used for subsequent feature extraction. Random forest feature extraction was carried out using the R randomForestExplainer package [47]. We made use of the following importance measures: accuracy decrease (the average decrease in prediction accuracy upon swapping out a feature), number of trees (the number of trees with a node corresponding to a feature) and mean minimal depth (the average depth at which a node corresponding to a feature occurs).
Survival analysis using joint modelling
Following scaling and centering, we fit linear mixed models for each protein to capture the temporal trajectories of each individual. As previously, a polynomial spline of degree 2 was used to model protein concentration with respect to time. We estimated both random intercepts and random slopes for each individual. These were joined to a Cox regression model using the jointModel package [12] in order to estimate the association of each protein with risk for death. P values were calculated using a Wald test for the association between the linear mixed model and Cox regression. FDR adjustment was applied, with 5% used as the significance threshold.
Longitudinal analysis
We also used linear mixed models to estimate the temporal profile of each protein We tested whether patients with severe disease had different temporal protein profiles from those with non-severe disease. For the purposes of this analysis, we binarised patients into severe or non-severe according to the peak WHO severity disease of their illness. Patients with a peak WHO score of mild or moderate were considered non-severe and those with a peak score of severe or critical were considered severe. We then used R’s bs function to fit a polynomial spline of degree 2 to model protein concentration with respect to the interaction between time (from symptom onset, measured in days) and disease severity (severe or non-severe) [48]. We estimated random slopes with respect to time, in addition to random intercepts, to account for each individual’s unique disease course. For each protein, we fitted the following model (R notation:) To identify proteins with distinct temporal profiles between severe and non-severe cases, we examined the P values for the time x severity interaction term. P-values were adjusted for the multiple proteins tested using the Benjamini-Hochberg method. 5% FDR used as the significance threshold.
Data Availability
Full summary statistics are included in the Supplementary Tables. Individual-level proteomic and clinical phenotyping data are available as Source Data Files. Code is available in the following GitHub repository: https://github.com/jackgisby/longitudinal_olink_proteomics
Data availability statement
Individual-level proteomic and clinical phenotyping data are available as Source Data Files 1-4.
Code availability
code is available in the following GitHub repository: https://github.com/jackgisby/longitudinal_olink_proteomics
Competing interests
The authors declare no competing interests
Author contributions
Conceived and designed the study: MCP, MB, MW, DCT, JEP.
Patient recruitment and sample collection: CLC, NM-T, ES, SPM, MFP, MW, DCT.
Led and coordinated patient recruitment: MW.
Sample processing: THM, PMM, NBB, SL, MP, FT, EF, M-A M, ED, LT, MB.
Clinical phenotyping: CLC, NM-T, SL, ES, MFP, MCP, JEP.
Data analysis: JG, AP.
Supervised the analysis: JEP.
Statistical support: PK.
Wrote the paper: JG, JEP
Results interpretation and editing the paper: MB, MCP, DT, JB.
All authors critically reviewed and approved the manuscript before submission.
Acknowledgements
The authors thank the patients who volunteered for this study and the staff at Imperial College Healthcare NHS Trust (the Imperial College Healthcare NHS Trust renal COVID-19 group and dialysis staff):
Appelbe M, Ashby DR, Brown EA, Cairns T, Charif R, Condon M, Corbett RW, Duncan N, Edwards C, Frankel A, Griffith M, Harris S, Hill P, Kousios A, Levy JB, Loucaidou M, Lightstone L, Liu L, Lucisano G, Lynch K, Mclean A, Moabi D, Muthusamy A, Nevin M, Palmer A, Parsons D, Prout V, Salisbury E, Smith C, Tam F, Tanna A, Tansey K, Tomlinson J, Webster P.
We also acknowledge the efforts of renal specialist doctors in training for assistance with recruiting patients to this study.
We also thank
- Hari and Rachna Murgai and Milan and Rishi Khosla for generous support with sample transport.
- Dr Kerry Rostron for exceptional support with laboratory facilities in challenging circumstances and the Department administrators for their help.
- Dr Brian Tom and Dr Jessica Barrett (MRC Biostatistics Unit, University of Cambridge) for statistical advice.
- Dr Arianne Richard (University of Cambridge) for critical feedback on the manuscript.
Funding statement
This research was partly funded by Community Jameel and the Imperial President’s Excellence Fund and by a UKRI-DHSC COVID-19 Rapid Response Rolling Call (MR/V027638/1) (to JEP). We also acknowledge a contribution from UKRI/NIHR through the UK Coronavirus Immunology Consortium (UK-CIC) and the National Institute for Health Research (NIHR) Biomedical Research Centre based at Imperial College Healthcare NHS Trust and Imperial College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. JEP is supported by UKRI Innovation Fellowship at Health Data Research UK (MR/S004068/2). DCT is funded by a Stage 2 Wellcome-Beit Prize Clinical Research Career Development Fellowship (20661206617/A/17/Z and 206617/A/17/A). MCP is a Wellcome Trust Senior Fellow in Clinical Science (212252/Z/18/Z). NM-T and ES are supported by Wellcome Trust and Imperial College London Research Fellowships, and CLC by an Auchi Clinical Research Fellowship.
Supplementary Material
Legends for Supplementary Tables 2-10
Supplementary Table 2. Protein Annotation.
List of the 436 proteins measured. GeneID = gene symbol of the gene encoding the protein (used as the main identifier in the manuscript); UniProt = UniProt ID; Olink Assay Name = protein id used by Olink; Protein Name = full protein name; Panel name = the name of the 92 protein multiplex Olink panel on which the protein was measured.
Supplementary Table 3. Differential abundance analysis for COVID-19 positive vs negative ESKD patients in the primary and validation cohorts.
Summary statistics for all 436 proteins are shown. Pvalue = nominal p-value from linear mixed model. Adjusted Pvalue = p-values after Benjamini-Hochberg correction. Fold change = estimated fold change from regression coefficient. Proteins are ordered based on results in the primary cohort: first by whether they are significant or not (at 5% FDR), then by fold change (from positive to negative). Note the associations are not ordered by p-value so strong associations do not necessarily appear at the top of the table. Significant adjusted p-values in green and non-significant in red grey. Estimated fold changes are coloured in a gradient from red to blue for up or downregulated in COVID-19 +ve versus –ve, respectively.
Sample size for primary cohort: n= 256 plasma samples from 55 COVID-19 positive ESKD patients, versus n= 51 ESKD controls (1 sample per control patient).
Sample size for validation cohort: 52 samples from 55 COVID-19 patients and 11 non-infected patient samples (single time-point).
Supplementary Table 4. Comparison to other proteomic studies of COVID-19 positive vs negative patients. Proteins that were differentially abundant in COVID-19 +ve vs -ve patients in our data are listed (5% FDR). TRUE indicates that the protein was reported as differentially abundant in the relevant previous proteomic study. The final column summarises whether the association was previously reported in any 1 or more of the 4 studies. We have not harmonised significance thresholds between studies: we simply report whether the authors declared the protein significant by the threshold of their study.
Supplementary Table 5. Associations of proteins and COVID-19 severity (primary cohort). Summary statistics for all 436 proteins are shown. Pvalue = nominal p-value from linear mixed model. Adjusted Pvalue = p-values after Benjamini-Hochberg correction. Fold change = estimated fold change from regression coefficient. Proteins are ordered first by whether they are significant or not (at 5% FDR), then by linear gradient (effect size) from positive to negative. Note the associations are not ordered by p-value so strong associations do not necessarily appear at the top of the table.
Supplementary Table 6. Comparison to other proteomic studies of COVID-19 severity. Proteins that were associated with severity in our data are listed (5% FDR). TRUE indicates that the protein was reported as associated with severity in the relevant previous proteomic study. The final column summarises whether the association was previously reported in any 1 or more of the 4 studies. We have not harmonised significance thresholds between studies: we simply report whether the authors declared the protein significant by the threshold of their study.
Supplementary Table 7. Predictors of clinical course from Random Forests. Importance metrics for each protein for prediction according to a random forest model trained to predict current or future severe/critical disease using the first sample of each patient. Proteins are ordered by mean minimal depth across all trees – this was used as the primary importance metric.
Supplementary Table 8. Proteomic predictors of fatal COVID-19. Summary statistics from joint models for fatal disease. Results for all 436 proteins are shown. “Is significant” indicates significance (green) or not (grey) at 5% FDR. The association coefficient for each protein indicates the direction and magnitude of the estimated log relative risk for death (red: higher protein levels increases risk of death, blue the opposite). 95% confidence intervals are provided.
Supplementary Table 9. Associations of proteins and clinical laboratory measurements.
Clinical variable = clinical lab tests: white cell count, lymphocyte count, neutrophil count, monocyte count, C-reactive protein, ferritin, d-dimer, troponin.
Results are shown for all 436 proteins against all 8 lab measurements.
Adjusted p-value = p-value from linear mixed model after Benjamini-Hochberg correction.
Gradient indicates effect size and direction. A positive gradient (red) indicates higher “Is significant” indicates significance (green) or not (grey) at 5% FDR.
Contemporaneous clinical laboratory tests were not available for all plasma samples. The proportion of samples for which contemporaneous lab tests were available were: white cell count 66%, neutrophils 66%, monocytes 66%, lymphocytes 66%, CRP 64%, ferritin 36%, troponin 35%, d-dimer 30%.
Supplementary Table 10. Longitudinal proteomic profiling with linear mixed models.
Summary statistics from the linear mixed models used to identify proteins with differential temporal trajectories between mild/moderate (n=28) and severe/critical COVID-19 patients (n=27).
Summary statistics for all 436 proteins are shown.
Pvalue = nominal p-value from linear mixed model for the interaction term between time from symptom onset (days) and severity (as a binary variable: mild-moderate or severe-critical). Adjusted Pvalue = p-values after Benjamini-Hochberg correction. “Is significant” indicates significance (green) or not (grey) at 5% FDR.