ABSTRACT
Background and objectives An unmet need exists for validated quantitative tools to measure multiple sclerosis (MS) disease activity and progression. We developed a custom immunoassay-based MS disease activity (MSDA) Test incorporating 18 protein concentrations into an algorithm to calculate four Disease Pathway scores (Immunomodulation, Neuroinflammation, Myelin Biology, and Neuroaxonal Integrity) and an overall Disease Activity score. The objective was to clinically validate the MSDA Test based on associations between scores and clinical/radiographic assessments.
Methods Serum samples (N=614) from patients with MS at multiple sites were split into Train (n=426; algorithm development) and Test (n=188; evaluation) subsets. Subsets were stratified by demographics, sample counts per site, and gadolinium-positive (Gd+) lesion counts; age and sex were used to demographically adjust protein concentrations. MSDA Test results were evaluated for potential association with Gd+ lesion presence/absence, new and enlarging (N/E) T2 lesion presence, and active versus stable disease status (composite endpoint combining radiographic and clinical evidence of disease activity).
Results A multi-protein model was developed (trained and cross-validated) using the Train subset. When applied to the Test subset, the model classified the Gd+ lesion presence/absence, N/E T2 lesion presence, and active versus stable disease status assessments with an area under the receiver operating characteristic (AUROC) of 0.781, 0.750, and 0.768, respectively. In each case, the multi-protein model had significantly (bootstrapped, one-sided p<0.05) greater AUROC performance when compared with the top-performing, demographically adjusted (by age and sex) single-protein model based on neurofilament light polypeptide chain. Algorithmic score thresholds corresponded to low, moderate, or high levels of disease activity. Based on the Test subset, the diagnostic odds ratios determined that the odds of having ≥1 Gd+ lesions among samples with a moderate/high Disease Activity score were 4.49 times that of a low Disease Activity score. The odds of having ≥2 Gd+ lesions among samples with a high Disease Activity score were 20.99 times that of a low/moderate Disease Activity score.
Discussion The MSDA Test was clinically validated; the multi-protein model had greater performance compared with the top-performing single-protein model. The MSDA Test may serve as a quantitative and objective tool to enhance care for MS.
INTRODUCTION
Multiple sclerosis (MS) is an autoimmune, chronic, neuroinflammatory disease of the central nervous system,1, 2 with a complex disease course and variable symptoms or manifestations.1 The clinical course following the first clinical manifestation of MS, or clinically isolated syndrome (CIS), can vary, but most patients transition to having an MS diagnosis.3, 4 Approximately 15% of patients have primary progressive MS, which is usually diagnosed following symptom onset, has no periods of remission, and has a worse prognosis compared with other types of MS.5-7 More than 85% of patients have relapsing-remitting MS (RRMS),7-9 which is characterized by clinical exacerbations, or relapses, followed by periods of clinical remission, or recovery, as inflammation resolves and remyelination occurs.1, 5, 7, 9, 10 Most patients with RRMS enter a progressive phase, which presents as accumulating loss of neurological function over time, a result of demyelination, neuroinflammation, accumulation of neuroaxonal damage, and brain atrophy.11, 12 These manifestations contribute to a progressively worsening disability, namely, secondary progressive MS, which can occur with or without further relapses.5, 8, 13
The heterogenous variations in the clinical disease course of MS have made diagnosis and prognosis difficult.9, 14 Although early diagnosis was based primarily on clinical evidence, the 2017 revision of the McDonald criteria for MS diagnosis has been updated to place greater emphasis on radiographic evidence using magnetic resonance imaging (MRI; eg, dissemination or spread over a minimum of two distinct areas of the central nervous system [CNS], including the brain, spinal cord, and optic nerves and at two different time points), as well as the presence of oligoclonal bands in the cerebral spinal fluid (CSF). Dissemination of lesions can be evaluated by gadolinium enhanced or T2-weighted imaging.15, 16 Although the McDonald criteria was recently updated to combine clinical manifestations and radiographic imaging,15 these criteria do not always accurately predict disease course, activity, progression, recurrence, or treatment response.13, 17, 18 To date, there are no validated clinical tests that leverage multiple serum biomarkers to track disease activity or disease progression in patients with MS. As such, there is an unmet need for clinically validated, objective, quantitative tests that can accurately monitor MS disease activity and progression.14, 19
A multi-protein, serum-based biomarker assay was developed to quantitatively measure disease activity using protein concentrations of 18 biomarkers in the serum of patients with all types of MS. The custom multi-protein assay panel was developed and analytically validated using the Olink® Proximity Extension Assay (Olink Proteomics, Uppsala, Sweden) technology.20 The comprehensive analytical characterization of this MS disease activity (MSDA) Test was described previously. Briefly, 18 proteins were selected for inclusion into the panel based on results from previously performed research and development studies and incorporated into a final algorithm for calculating four Disease Pathway scores (Immunomodulation, Neuroinflammation, Myelin Biology, and Neuroaxonal Integrity) scores and an overall Disease Activity score (Supplementary Table 1).21 The objective of the study was to clinically validate the MSDA Test by evaluating the associations of the overall Disease Activity score and the four Disease Pathway scores with gadolinium-positive (Gd+) lesions, new and enlarging T2 (N/E T2) lesions, and active or stable disease status. MS disease status was a combination assessment of Gd+ lesions, N/E T2 lesions, and clinical relapse status.
METHODS
Patient samples
A total of 614 serum samples were included from two different sources: 448 retrospective samples from the Serially Unified Multicenter MS Investigation [SUMMIT] consortium; 166 prospective samples from the Rocky Mountain Multiple Sclerosis Clinic (RMMSC). SUMMIT samples were sent to Octave Bioscience, Inc. from three independent sites and studies: Comprehensive Longitudinal Investigation of Multiple Sclerosis at the Brigham and Women’s Hospital (CLIMB; n=195), American University of Beirut Medical Center Study (AMIR; n=202), and University of Massachusetts MS Center—Family Study of Demyelinating Disease (FSDD; n=51). Patient samples selected for this study were intentionally enriched for the presence of Gd+ lesions compared with the general MS population.
SUMMIT was an international multicenter, prospectively enrolled MS cohort study with standardized data structure and analysis groups that can be stratified by demographics, clinical measures, disease relapses, MRI measures, or blood sampling.22 Samples from RMMSC were collected as part of the matched serum and MRI for the Disease Activity Test Development Study. Cross-sectional (samples from a single time point) and longitudinal samples (samples from multiple time points from the same patient) from both SUMMIT and RMMSC were included in the analysis. Serum specimens were collected using standard venipuncture and processing protocols. Samples were transferred to Octave Bioscience, Inc. for analysis and stored at −65□C.
Clinical and radiographic data for biostatistical analysis
An annotated data set containing clinical and radiographical data was obtained for each serum sample. Individual data sets with additional demographic, clinical, and radiographic fields were collected; a combined data set was matched and adjusted so that the annotations were consistently labeled across the study sites and stratified, meaning that the data were split into the Train and Test subsets with a balanced distribution across demographic characteristics, sample counts per site, and Gd+ lesion counts.
Bioanalytical analysis
All serum samples were analyzed as part of a single experiment performed over several days. Assay plates contained up to 72 samples analyzed in a single well each; four serum controls, three calibrators, and a blank control, were described previously21 and assayed in triplicate. Analytical runs were stratified independently of the clinical Train/Test randomization, ensuring that assay plates had balanced distribution across sites, demographics, and presence/absence of or number of Gd+ lesions. Analysis was performed cross-sectionally, using the protein concentrations as predictors and algorithmic features.
Pre-processing and quality control was performed using the Olink® Normalized Protein Expression (NPX) Manager software (Olink Proteomics, Uppsala, Sweden). Built-in quality control (three internal controls that were added into all samples and the external controls) enabled control over the technical performance of the assay at each step of the analysis. These internal controls consisted of an incubation control, extension control, and a detection control. Quality control was performed per assay run and for individual samples at each step of the analysis. Standard deviations for each internal control were established to be below a predetermined threshold (ie, 0.2 NPX) for the entire plate. Median values were calculated for the incubation and detection controls, respectively, for a sample plate. The result of each internal control was required to be within ±0.3 NPX from the plate median. If any of these internal controls deviated from this range, the sample failed quality control and was reanalyzed.
External controls consisted of serum pools with endogenous protein concentrations established at expected levels. Acceptability of a plate run was based on the percent recovery of the serum pools relative to their expected values (ie, ±3 SD). Individual samples or plates that failed the analytical quality control process were rerun.
Assessments
The primary and exploratory assessments of the study included evaluation of the association of the multi-protein and single-protein models to the radiographic (Gd+ and N/E T2 lesions) and clinical (active vs stable disease status) assessments. Radiographic annotations were derived from brain MRIs for all patients in the study and from spine MRIs when available. Association of single-protein and multi-protein models with the presence or absence of Gd+ lesions, as determined on a matched MRI performed within 60 days of the blood draw, was considered the primary endpoint. Next, the final models were evaluated relative to additional disease activity exploratory assessments (N/E T2 lesions and active vs stable disease status).
Samples were considered active if any Gd+ lesion was present, if any N/E T2 lesion was present, or if there was evidence of a clinical relapse within 30 days. Samples were otherwise considered stable (including samples missing N/E T2 lesion/clinical relapse data). This approach was used to further refine the overall Disease Activity and Disease Pathway algorithms (eg, Immunomodulation, Neuroinflammation, Myelin Biology, and Neuroaxonal Integrity). Association of the results from multi-protein and single-protein modeling with disease progression was an additional exploratory assessment, which is not reported in this manuscript. Classification models were fit to the data with proteomic results as the independent variables and presence or absence of Gd+ lesions as the dependent variable.
Statistical analysis
The study hypothesis was that a multi-protein model would significantly associate with clinical and radiographic disease activity endpoints and be superior to the highest performing, single-protein model based on the protein biomarkers included in the multi-protein model.
Inclusion criteria for the study required that information for the primary endpoint reference standard (eg, presence or absence of Gd+ lesions) was available for all patient samples. Upon completion of analytical quality control processes and any necessary reanalysis, index test data (eg, protein concentrations) were available for each sample. For secondary and exploratory endpoints, samples with missing data were excluded from the statistical analysis.
The entire data set (N=614) was split into Train and Test subsets. The Train subset was designed to optimize algorithms and included 70% of the total available samples; the Test subset established the performance specifications of the MSDA Test and comprised 30% of the total available samples. The subsets were stratified to ensure a balanced distribution across demographic characteristics that included age, sex, and disease duration; sample counts per site; and Gd+ lesion counts. Analysis was performed in the Train subset to assess unexpected differences in biomarker ranges resulting from preanalytical processing at the four sites from which the samples were obtained.
A two-layer stacked classifier using L2-penalized logistic regression models, which leveraged biological categorizations of biomarkers to calculate Disease Pathway scores, was ultimately developed. A score-based algorithm enabled the four Disease Pathway scores and an overall Disease Activity score to be derived from the probabilities from the Pathway and Disease Activity model and calculated for individual samples (Figure 1). Full details on the algorithm and model parameters are presented elsewhere.23
MSDA Test stacked classifier flow chart and biological characterizations model. The first layer of the model consisted of four Disease Pathway algorithms. The second layer of the model utilized the four Disease Pathway probabilities along with the individual age- and sex-adjusted biomarker concentrations as meta-features to determine an overall Disease Activity score that reflected the probability of disease activity. Thresholds were established based on the count of Gd+ lesions for the Disease Activity score, which corresponded to low (1.0–4.0), moderate (4.5–7.0), or high (7.5–10.0) levels of disease activity. MSDA, multiple sclerosis disease activity.
Single-protein models were fit using L2-penalized logistic regression with presence or absence of Gd+ lesions as the dependent variable and an intercept and the protein biomarker as independent variables.
Protein concentrations were limit of quantitation (LOQ)-imputed, log10-transformed, and demographically adjusted (with age and sex, based on Ordinary Least Squares [OLS] modeling) prior to being used in the Disease Pathway, Disease Activity, and single-protein models. Sex and age were selected as demographic adjustment variables (if there was a dependence with the protein biomarkers) since they are routinely collected in a clinical setting when blood samples are taken for analysis. Previous research and development studies, samples from a cohort of healthy controls, and those from the Train subset were used to establish the biomarker-specific demographic adjustment strategy, which included removing protein concentration outliers, accounting for OLS coefficient sign consistency across the three studies and establishing statistical significance related to both age and sex.23
Metrics for model performance including the area under the receiver operating characteristic (AUROC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and odds ratios were used to evaluate model performance. The prevalence of Gd+ lesions was enriched in this dataset, and it is important to note that PPV, NPV, and accuracy all depend on the prevalence. An L2-penalization was used to optimize the model and minimize overfitting when training to ensure generalizability to the Test subset. The multi-protein classification model was compared with the top-performing single-protein model (demographically corrected neurofilament light polypeptide chain [NfL] for all disease activity endpoints) to assess the statistical significance of differences in AUROC performance using a boot-strapped (1000 iterations), one-sided test for significance using the pROC24 package in R.25
The importance of each protein biomarker included in the Disease Activity model was evaluated using mlxtend26 by the mean decrease in the AUROC after permutation (repeated 1000 times) of only that protein marker in the trained Pathway and Disease Activity models, compared with the AUROC for predictions with no permutation. Proteins that displayed a larger decrease in the AUROC were more important for the Disease Activity model. NfL was identified as the most important biomarker, followed by tumor necrosis factor superfamily member 13B (TNFSF13B) (Supplementary Figure 1).
Statistical analysis was performed to characterize the algorithm at the model level (eg, prior to scoring individual samples) and at the score level. Thresholds were established for the score levels that corresponded to Disease Activity scores based on the number of Gd+ lesions. Patients with no lesions were considered to have low (1.0–4.0) disease activity, patients with 1 Gd+ lesion had moderate (4.5–7.0) disease activity, and patients with ≥2 Gd+ lesions had high (7.5–10.0) disease activity. The low versus moderate/high threshold was selected based on sensitivity. The rationale for selecting sensitivity for the low versus moderate/high threshold was that the presence of any number of Gd+ lesion(s) was an accurate and reliable indicator of active disease. The low/moderate versus high threshold was selected based on accuracy. The rationale for selecting accuracy for the low/moderate versus high threshold was that all samples within this score range were expected to have radiographic evidence of disease activity; the primary factor would therefore be optimizing the ability to distinguish a single lesion versus ≥2 Gd+ lesions.
Statistical analysis for the Train subset was performed by the Octave Bioscience Data Science Team who remained blinded to the analytical results (eg, protein concentrations) and clinical assessments in the Test subset until the algorithm was finalized. To investigate generalizability of the Disease Activity and Pathway algorithms determined using the Train subset, samples from earlier research and development studies were also evaluated prior to the Test subset analysis (analyses not reported here).
Standard protocol approvals, registration, and patient consents
The study was approved by the Mass General Brigham institutional review board (Somerville, MA, USA) and the WCG institutional review board (Puyallup, WA, USA). All patients provided written informed consent.
Data availability
Access to anonymized data not published within this article and the study protocol can be made available by request from any qualified investigator once a data-sharing agreement is in place.
RESULTS
Demographic and patient characteristics
A total of 614 serum samples were included from two different sources and split into Train (n=426; algorithm development) and Test (n=188; evaluation) subsets. Patient demographics and characteristics were well balanced between Train and Test subsets (Table 1).
Demographic and patient characteristics for the Train and Test subsets and the entire dataset.
Single-protein model evaluation
The AUROC of demographically corrected, log10-transformed, LOQ-imputed individual biomarkers evaluated in the Train and Test subsets ranged from 0.436 to 0.726. NfL was the highest performing protein, with an AUROC of 0.726 for the Test subset. The biomarkers that correlated with nominal significance (p<0.05, no multiple hypothesis testing correction) with Gd+ lesion presence included NfL, cluster of differentiation 6 (CD6), C-X-C motif chemokine ligand 13 (CXCL13), interleukin 12β (IL-12β), and TNFSF13B (Supplementary Table 2).
Multi-protein model performance and optimization
Prior to incorporation of the scoring algorithm, a threshold for the multi-protein stacked classifier was chosen to provide a sensitivity of at least 0.80 for the Train subset. This threshold led to a sensitivity of 0.684 and a specificity of 0.714 in the Test subset. The AUROC of the multi-protein stacked classifier was 0.807 and 0.781 for the Train and Test subsets, respectively, based on Gd+ lesion presence (Supplementary Table 3).
Clinical validation of the multi-protein model
The multi-protein model developed on the Gd+ lesion presence endpoint using the Train subset was then applied to the Test subset. The final model performance achieved an AUROC of 0.781 relative to the Gd+ lesion presence/absence endpoint, 0.750 relative to the N/E T2 lesion presence endpoint, and 0.768 relative to the active/stable disease endpoint. In each case, the multi-protein model was found to have significantly greater (p<0.05) performance when compared with the top-performing single-protein model based on demographically corrected NfL (p<0.05).
By comparison, the single-protein model for demographically corrected NfL had an AUROC of 0.726 for Gd+ lesion presence, 0.660 for N/E T2 lesion presence, and 0.683 for active/stable disease. The multi-protein model also outperformed NfL models based on log10-transformed, LOQ-imputed NfL protein concentrations with no demographic correction. The AUROCs for the single-protein, demographically uncorrected NfL models were 0.694, 0.619, and 0.645 for Gd+ lesion presence, N/E T2 lesion presence, and active versus stable disease, respectively (Figure 2).
Summary of final performance of the multi-protein model and top performing single-protein demographically corrected (NfL) model in AUROC based on primary (A: Gd+ lesion presence) and exploratory assessments (B: N/E T2 lesion presence and C: active/stable disease status). P-values were from a boot-strapped (1000 iterations), one-sided test for significance comparing the multi-protein and single-protein models. AUROC performance and p-values are also shown for a single-protein NfL model with no demographic correction. AUROC = area under the receiver operating characteristic curve; DC = demographic correction; Gd+ = gadolinium-positive; N/E = new/enlarging; NfL = neurofilament light polypeptide chain.
Once the final model was optimized, score-level performance of the model in the Train and Test subsets was evaluated using 2 × 2 confusion matrices. Score-level precision for the Train subset is shown in Table 2. Selection of the low versus moderate/high threshold was based on sensitivity (for the comparison of no lesions vs ≥1 Gd+ lesions); selection of the low/medium versus high threshold was based on accuracy (for the comparison of 0 or 1 vs ≥2 Gd+ lesions).
Score level evaluation of the model in the Train and Test subsets by number of Gd+ lesions.
For Test samples with ≥1 versus 0 Gd+ lesions, the sensitivity and NPV for the low versus moderate/high cutoff were determined to be 0.737 and 0.775, respectively. A diagnostic odds ratio demonstrated that the odds of having ≥1 Gd+ lesions among samples with a moderate/high Disease Activity score was 4.49 times the odds of having ≥1 Gd+ lesions among samples with a low Disease Activity score (Table 2). A comparison of the performance of the Disease Activity score in these samples using the multi-protein model with that of the highest performing single-protein model based on demographically corrected NfL demonstrated that the multi-protein demographically corrected model outperformed the single-protein demographically corrected NfL model (Figure 3).
Analysis of disease activity using the A) top-performing single-protein demographically corrected (NfL) model and B) multi-protein model in patients with ≥1 Gd+ lesions. Orange bars = Test subset; Blue bars = Train subset. Gd+ = gadolinium-positive; NfL = neurofilament light polypeptide chain.
For the Test samples with either 0 or 1 Gd+ lesions when compared with those samples with ≥2 Gd+ lesions, the accuracy at the Disease Activity score to predict low/moderate versus high cutoff was determined to be 0.894. The diagnostic odds ratio demonstrated that the odds of having ≥2 Gd+ lesions among samples with a high Disease Activity score were 20.99 times the odds of having ≥2 Gd+ lesions among samples with a low/moderate Disease Activity score (Table 2).
The overall performance of the Disease Activity score and four Disease Pathway scores is shown in Figure 4. The score distribution and respective box plots with Gd+ lesions for each of the four Disease Pathway scores are shown in Supplementary Figure 2. The centering and scaling strategy for the four Disease Pathway scores resulted in sufficient correlation to the overall Disease Activity score while retaining an independent signal. Finally, a stacked bar plot of the results for the Train and Test subsets demonstrated that the calculated Disease Activity score reflected both the likelihood and severity of radiographic disease activity, based on the presence or absence of and count of Gd+ lesions. As shown, patients without Gd+ lesions had a low Disease Activity score (in blue), patients with high Gd+ lesions had a high Disease Activity score (in orange) and those with moderate disease activity had a medium range of Disease Activity score (in yellow; Figure 5).
Correlation of the Disease Activity and Disease Pathway (A: Immunomodulation, B: Neuroinflammation, C: Myelin Biology, and D: Neuroaxonal Integrity) scores in the multi-protein model. The solid line indicates the linear regression fit between the Disease Activity and Disease Pathway scores, and the equations for the lines are given in the upper left corner of each figure. DA = Disease Activity; R2 = coefficient of determination.
Stacked bar plots of the Disease Activity score in the A) Train and B) Test subsets. Low disease activity (blue) was associated with no Gd+ lesions, moderate disease activity (yellow) was associated with ≥1 Gd+ lesions, and high disease activity (orange) was associated with ≥2 Gd+ lesions. Bars for Disease Activity scores for which there were no samples are blank. DA = Disease Activity; Gd+ = gadolinium-positive.
The distribution of individual biomarkers, and the Disease Activity and Disease Pathway scores obtained from the MSDA Test were analyzed for each of six disease-modifying therapy (DMT) categories (anti-CD20s, natalizumab, interferons, dimethyl fumarate, fingolimod, and glatiramer acetate) in 466 samples, reflecting the therapy the patient was on at the time of the blood draw. The remaining samples did not fall in one of these six categories and were categorized as either other (treated with a DMT that was not analyzed independently due to low sample count) or blank (information relating to therapy was not provided in the clinical dataset). Significant analysis of variance (ANOVA) Bonferroni corrected p-values were observed for 14 of the 18 individual biomarkers utilized in the MSDA Test algorithm (Supplementary Figure 3). Samples (n=40) from the anti-CD20 category had the lowest Disease Activity score on average (3.11 ± 1.77), followed by samples (n=129) from the natalizumab category (4.17 ± 1.63). The highest Disease Activity score on average was associated with samples (n=62) from the glatiramer acetate category (6.39 ± 1.67; Supplementary Table 4).
Classification of evidence
The MSDA Test is a multi-protein, serum-based biomarker assay designed to quantitatively measure disease activity using the protein levels of biomarkers present in the serum of patients with MS. Protein concentrations were LOQ imputed, log10 transformed, and demographically adjusted for age and sex. The combination of multiple proteins was used to calculate four Disease Pathway scores and an overall Disease Activity score. The protein selection was intended to reflect the various biological pathways associated with MS pathophysiology. Using associations of the results from the multi-protein and single-protein models and the radiographic/clinical assessments, the MSDA Test was clinically validated in this study. This study provided Class II evidence to demonstrate the clinical validation of MSDA Test for disease activity assessments in MS.
DISCUSSION
There are currently no validated clinical tests that leverage multiple serum biomarkers to monitor disease activity or progression in patients with MS. We have previously established that the MSDA Test is accurate, sensitive, precise, and robust, which serves as a critical first step in the validation of this assay.23 The MSDA Test uses 18 protein biomarkers, which reflect various biological pathways associated with MS pathophysiology.
In this study, we successfully demonstrated the clinical validation of the MSDA Test. All disease activity assessments, namely, Gd+ lesions, N/E T2 lesions, and active/stable disease status, showed association with the Disease Pathway and overall Disease Activity scores from the MSDA Test. The multi-protein model had significantly (bootstrapped, one-sided p<0.05) greater performance compared with the top-performing single-protein model based on demographically corrected NfL in all assessments. The MSDA Test performed well when differentiating samples from patients with ≥1 Gd+ lesions versus no lesions, as well as when differentiating samples from patients with ≥2 Gd+ lesions versus 0 or 1 lesions.
The highest-performing single-protein biomarker for all disease activity endpoints was NfL. NfL has been demonstrated to be a prognostic indicator of disease activity in MS.27-31 Elevated serum and CSF levels of NfL correlate with neuronal cell damage and brain atrophy.28, 29, 32-34 It has been used to predict long-term clinical outcomes of MS34, 35 and guide treatment decisions.28, 29, 34, 36, 37 However, NfL is not specific for MS and is elevated in several neurodegenerative diseases.38-40 Furthermore, the protein is released into CSF and blood as a result of neuroaxonal damage, reflecting pathophysiology downstream of immune-mediated inflammatory pathways.29, 38, 39 In single-protein analyses, high-performing biomarkers in addition to NfL in this study included IL-12β, CXCL13, and TNFSF13B, all of which have been found to be potential biomarkers for MS in other studies.29, 34, 41-46 Our analysis revealed that the multi-protein model outperformed single-protein models for each of these biomarkers, which indicated a more accurate representation of the various pathways, processes, and cell types involved in a complex disease state, such as MS, by a combination of biomarkers.20
The MSDA Test was developed to favor sensitivity over specificity. In biomarker validation studies, there is typically a tradeoff between sensitivity and specificity for assay development.14 Development of a highly sensitive model, which can produce a higher degree of false-positive results and have reduced specificity is critical to the identification of patients with subradiographic and subclinical disease activity. Detection of early-stage MS remains challenging when using conventional clinical or radiographic assessments.17-19 We believe that the MSDA Test utilizes a well-balanced sensitivity and specificity combination, which can play a key role in the identification of patients with subradiographic and subclinical MS prior to detection of clear clinical or radiographic manifestations. This more sensitive detection will allow for optimal and timely treatment, which can positively impacts patient outcomes.47 Another advantage of a blood-based approach is that disease activity can be detected regardless of where in the CNS lesions have occurred. Brain MRIs are more frequently used in the assessment of patients with MS, although spinal lesions are a common occurrence in MS as well.15, 48
A disease activity measurement tool should reflect therapeutic efficacy and be characterized relative to the biological impact of various mechanisms of action. In the DMT analysis, the lowest average Disease Activity scores were observed in the anti-CD20 and natalizumab groups, which represent the highest efficacy therapies in our categorizations. Lower Disease Activity scores were observed in patients without radiographic evidence of disease activity (0 Gd+ lesions) and there was a direct correlation between Disease Activity and Disease Pathway scores with Gd+ lesion counts across all DMT categories. Future studies will expand upon the DMT analysis to factor in a patient’s overall disease duration, their duration on the DMT, and their previous DMT history.49
Our study has limitations. The patient samples in this study were obtained from four different sites. Practice at different sites, including sample preparation techniques, may introduce potential differences between the data sets. Despite this challenge, the MSDA Test showed successful performance in the test subset, demonstrating its promise of real-world performance.
With the successful clinical validation of the MSDA Test, we envision several potential uses in the future, including a routine surveillance test to better monitor disease activity and progression (eg, distinguish inflammation from silent disease progression), especially in patients considered to have stable disease, and to track new/worsening symptoms, as well as an evaluation test of treatment response, or in consideration of alternative treatment options. We also wish to expand the analysis to investigate the association between overall Disease Activity and Disease Pathway scores and additional assessments (eg, Expanded Disability Status Scale; Patient Determined Disease Steps). Evaluation of the MSDA Test in a larger population of patients with MS in a real-world setting is also valuable. The MSDA Test is intended to complement standard radiographic imaging and clinical assessment and promote individualized disease management.14, 19, 50
STUDY FUNDING
This study was funded by Octave Bioscience, Inc. and in part by the U.S. Department of Defense (W81XWH2110633 to T Chitnis).
DISCLOSURES
Tanuja Chitnis has received compensation for consulting from Biogen, Novartis Pharmaceuticals, Roche Genentech, and Sanofi Genzyme, and has received research support from the National Institutes of Health, National MS Society, US Department of Defense, EMD Serono, I-Mab Biopharma, Mallinckrodt ARD, Novartis Pharmaceuticals, Octave Bioscience, Inc., Roche Genentech, and Tiziana Life Sciences. This research was conducted in part with the support of the Department of Defense through the Multiple Sclerosis Research Program under Award No. W81XWH-18-1-0648 (to T. Chitnis).
John Foley has received research support from Biogen, Novartis, Adamas, Octave Bioscience, Inc., Genentech, and Mallinckrodt, has received speakers’ honoraria and acted as a consultant for EMD Serono, Genzyme, Novartis, Biogen, and Genentech, has equity interest in Octave Bioscience Inc., and is the founder of InterPro Bioscience.
Carolina Ionete has received research support from Biogen, Serono, Genentech, NMSS, and Department of Defense, and received compensation for advisory board activity from Sanofi-Genzyme.
Nabil K. El Ayoubi has received support to attend scientific educational courses from Novartis, Merck Serono, Sanofi, Biologix, and has received speaker honoraria for scientific presentations on Multiple Sclerosis from Biologix, Sanofi, Merck Serono, and Novartis.
Shrishti Saxena, Patricia Gaitan-Walsh, Anu Paul, and Fermisk Saleh have no disclosures.
Hrishikesh Lokhande has received research support from the US Department of Defense and Octave Bioscience, Inc.
Howard Weiner has received research support from the Department of Defense, Genentech, Inc., National Institutes of Health, National Multiple Sclerosis Society, Novartis and Sanofi Genzyme. He has received compensation for consulting from Genentech, Inc, IM Therapeutics, IMAB Biopharma, MedDay Pharmaceuticals, Tiziana Life Sciences and vTv Therapeutics.
Jennifer L. Venzie is an employee of The Lockwood Group and provided editorial support funded by Octave Bioscience, Inc.
Ferhan Qureshi, Anisha Keshavan, Kian Jalaleddini, and Ati Ghoreyshi are employees of Octave Bioscience, Inc.
Michael J. Becich, Fatima Rubio da Costa, Victor M. Gehman, and Fujun Zhang were employees of Octave Bioscience, Inc., at the time the study was completed. Samia J. Khoury has received compensation for scientific advisory board activity from Merck and Roche.
Supplementary Materials
List of the 18 proteins selected for assay inclusion.
Single-protein model performance of individual biomarker proteins (LOQ-imputed, log10 transformed, and demographically corrected for age and sex) based on Gd+ lesion assessments. Nominal p-values from a two-sided T-test assuming equal variances between groups (no Gd+ lesions vs ≥1 Gd+ lesions) are reported.
Model-level performance of the multi-protein model in the Train and Test subsets.
Mean±SDDisease Activity and Disease Pathway scores categorized by disease-modifying therapy.
Feature importance demonstrated by mean AUROC decrease (no Gd+ lesions vs ≥1 Gd+ lesions) after permutation (1000 times) for each biomarker in the Disease Activity stacked classifier model and the input Pathway models. Biomarkers with a larger positive decrease in AUROC have greater importance for the Disease Activity model. NfL was identified as the most important feature, followed by TNFSF13B. Error bars correspond to 2.5th and 97.5th percentiles. AUROC = area under the receiver operating characteristic; Gd+ = gadolinium-positive; NfL = neurofilament light polypeptide chain; TNFSF13B = tumor necrosis factor superfamily member 13B.
Score distribution and respective box plots with Gd+ lesions for A) Immunomodulation, B) Neuroinflammation, C) Myelin Biology, and D) Neuroaxonal Integrity scores in the multi-protein model. DA = Disease Activity; Gd+ = gadolinium-positive.
Significant ANOVA Bonferroni corrected p-values, based on biomarkers across DMT categories, are marked in red. Nominal ANOVA p-values are reported. ANOVA = analysis of variance; DMT = disease-modifying therapy.
ACKNOWLEDGMENTS
Copyediting assistance was provided by The Lockwood Group (Stamford, CT) and was funded by Octave Bioscience, Inc.
Appendix
Appendix 1. Author contributions
Footnotes
↵* Employee of Octave Bioscience, Inc., at the time the study was completed
Previous presentation: This work was previously presented in part at the 37th, Congress of the European Committee for Treatment and Research in Multiple Sclerosis (ECTRIMS), Vienna, Austria (October 13□15, 2021)
Research support: The study was funded by Octave Bioscience, Inc.
T. Chitnis: tchitnis{at}rics.bwh.harvard.edu, S. Saxena: ssaxena3{at}bwh.harvard.edu, P. Gaitan-Walsh: pgaitanwalsh{at}bwh.harvard.edu, H. Lokhande: hlokhande{at}bwh.harvard.edu; A. Paul: apaul12{at}bwh.harvard.edu, F. Saleh: fermisk.x{at}gmail.com, H. Weiner: hweiner{at}rics.bwh.harvard.edu
J. Foley: ldjfoley{at}gmail.com
C. Ionete: carolina.ionete{at}umassmemorial.org
N.K. El Ayoubi: ne42{at}aub.edu.lb, S.J. Khoury: sk88{at}aub.edu.lb
J. Venzie: jvenzie{at}thelockwoodgrp.com
F. Qureshi: fqureshi{at}octavebio.com, M.J. Becich: mjbecich{at}gmail.com, F. Rubio da Costa: fatima.rubio{at}gmail.com, V.M. Gehman: vmgehman{at}gmail.com, F. Zhang: fujun2{at}yahoo.com, A. Keshavan: akeshavan{at}octavebio.com, K. Jalaleddini: kianj{at}octavebio.com, A. Ghoreyshi: aghoreyshi{at}octavebio.com
GLOSSARY
- AMIR
- American University of Beirut Medical Center Study
- ANOVA
- analysis of variance
- AUROC
- area under the receiver operating characteristic
- CD6
- cluster of differentiation 6
- CIS
- clinically isolated syndrome
- CLIMB
- Comprehensive Longitudinal Investigation of Multiple Sclerosis at the Brigham and Women’s Hospital
- CNS
- central nervous system
- CSF
- cerebral spinal fluid
- CXCL13
- C-X-C motif chemokine ligand 13
- DMT
- disease-modifying therapy
- FSDD
- Family Study of Demyelinating Disease
- Gd+
- gadolinium-positive
- IL-12β
- interleukin 12β
- LOQ
- limit of quantitation
- MRI
- magnetic resonance imaging
- MS
- multiple sclerosis
- MSDA
- multiple sclerosis disease activity
- N/E
- new and enlarging
- NfL
- neurofilament light polypeptide chain
- NPV
- negative predictive value
- NPX
- normalized protein expression
- OLS
- ordinary least squares
- PPV
- positive predictive value
- RMMSC
- Rocky Mountain Multiple Sclerosis Clinic
- RRMS
- relapsing-remitting multiple sclerosis
- SUMMIT
- Serially Unified Multicenter Multiple Sclerosis Investigation
- TNFSF13B
- tumor necrosis factor superfamily member 13B