Abstract
Modeling treatment effect could identify a subgroup of individuals who experience greater benefit from disease modifying therapy, allowing for predictive enrichment to increase the power of future clinical trials. We use deep learning to estimate the conditional average treatment effect for individuals taking disease modifying therapies for multiple sclerosis, using their baseline clinical and imaging characteristics. Data were obtained as part of three placebo-controlled randomized clinical trials: ORATORIO, OLYMPUS and ARPEGGIO, investigating the efficacy of ocrelizumab, rituximab and laquinimod, respectively. A shuffled mix of participants having received ocrelizumab or rituximab, anti-CD20-antibodies, was separated into a training (70%) and testing (30%) dataset, but we also performed nested cross-validation to improve the generalization error estimate. Data from ARPEGGIO served as additional external validation. An ensemble of multitask multilayer perceptrons was trained to predict the rate of disability progression on both active treatment and placebo to estimate the conditional average treatment effect. The model was able to separate responders and non-responders across a range of predicted effect sizes. Notably, the average treatment effect for the anti-CD20-antibody test set during nested cross-validation was significantly greater when selecting the model’s prediction for the top 50% (HR 0.625, p=0.008) or the top 25% (HR 0.521, p=0.013) most responsive individuals, compared to HR 0.835 (p=0.154) for the entire group. The model trained on the anti-CD20-antibody dataset could also identify responders to laquinimod, finding a significant treatment effect in the top 30% of individuals (HR 0.352, p=0.043). We observed enrichment across a broad range of baseline features in the responder subgroups: younger, more men, shorter disease duration, higher disability scores, and more lesional activity. By simulating a 1-year study where only the 50% predicted to be most responsive are randomized, we could achieve 80% power to detect a significant difference with 6 times less participants than a clinical trial without enrichment. Subgroups of individuals with primary progressive multiple sclerosis who respond favourably to disease modifying therapies can therefore be identified based on their baseline characteristics, even when no significant treatment effect can be found at the whole-group level. The approach allows for predictive enrichment of future clinical trials, as well as personalized treatment selection in the clinic.
Introduction
Multiple disease modifying therapies have been successfully developed for the treatment of RRMS using the strategy of performing relatively short and small phase 2 trials with an MRI endpoint for establishing proof of concept and finding the optimal dose, before proceeding to longer, more expensive phase 3 trials. The absence of analogous MRI endpoints for progressive multiple sclerosis have hampered progress in developing drugs for this clinical phase of the disease. As proceeding directly to large, phase 3 trials is expensive and risky, most programs having followed this path have failed to adequately demonstrate efficacy.
It is often the case that medications are more effective in some patients than others. Selecting such a subgroup for inclusion in a clinical trial in order to increase its power is a technique called predictive enrichment.1 A drug proven to be efficacious in an enriched trial can later be tested in a larger, more inclusive population. This sequence prevents efficacious medications from having their effect diluted in early clinical trials due to inclusion of a population that is too heterogeneous. As an example, Bovis et al.2 used CPH models to predict a more responsive sub-group of RRMS patients to laquinimod, a medication whose average treatment effect in the original phase 3 studies was insufficient for drug approval.
To achieve the goal of enriching clinical trials with more responsive individuals, a machine learning problem can be formulated as the prediction of the ITE, or the difference between a person’s rate of progression on treatment and that on placebo. This formulation is grounded in the theory of causal inference, and the related sub-fields of heterogeneous treatment effects and uplift modeling have contributed numerous approaches adapted to machine learning (reviewed elsewhere3). Arguably some of the most popular methods have been tree-based approaches (see Radcliffe and Surry4 for an example) which model treatment effect directly, and meta-learning approaches5 which use base models trained on the outcome of interest for the downstream task of treatment effect prediction.
We implement a learning framework based on an ensemble of multitask MLPs, a type of deep neural network architecture, to predict ITE using readily available clinical information (demographic characteristics and clinical disability scores) and scalar MRI metrics (lesional and volumetric) obtained at a screening visit. This approach can be used to identify a sub-group of more responsive individuals for the purpose of clinical trial enrichment, using a desirable ITE threshold. In this work, we study the population of patients with PPMS that were exposed to anti-CD20 monoclonal antibodies in two clinical trials, ORATORIO (NCT01194570) and OLYMPUS (NCT00087529). We also test our model on individuals exposed to laquinimod as part of the ARPEGGIO trial (NCT02284568), a medication with a completely different mechanism of action, to assess whether learned predictors of response could be mechanism-agnostic.
Materials and methods
Data
Data were acquired as part of three randomized placebo-controlled clinical trials: ORATORIO,6 OLYMPUS7 and ARPEGGIO.8 Anti-CD20 monoclonal antibodies, henceforth abbreviated anti-CD20-Abs (ocrelizumab in ORATORIO and rituximab in OLYMPUS) are primarily thought to act through B-cell depletion. Laquinimod (in ARPEGGIO) has a different and complex mechanism of action, thought to involve immunomodulatory effects on innate immune cell lineages (dendritic cells and monocytes peripherally, and microglia and astrocytes centrally). All three trials enrolled adults with PPMS and had similar inclusion criteria, shown in Supplementary Table 1. We excluded participants who spent less than 24 weeks in the trial, who had less than two clinical visits, or who were missing one or more input features at the baseline visit.
We used all clinical and MRI metrics (features) that were consistently recorded as part of the data available from the three trials, amounting to 19 features. Values were recorded at the baseline visit (immediately before randomized treatment allocation), and are a combination of binary (e.g. sex), ordinal (e.g. disability scores), and continuous variables (e.g. age, T2 lesion volume). Lesion segmentation and brain volume estimation were done according to the individual study’s methodology. Means and standard deviations for each feature distribution are shown in Table 1. The feature distributions are similar for all three treatment groups. However, some clinically meaningful differences are found in the MRI metrics, where the anti-CD20-Ab group has a higher average Gad count and T2 lesion volume compared to the other treatment group. The following right-skewed distributions were log-transformed: normalized brain volume, T2 lesion volume, timed 25-foot walk (T25FW), and 9-hole peg test (9HPT). Gad counts were binned into ten bins as follows: 0, 1, 2, 3, 4, 5-6, 7-9, 10-14, 15-19 and 20+ lesions. Finally, to improve convergence during gradient descent,9 all non-binary features were standardized by subtracting the mean and dividing by the standard deviation, both calculated from the training dataset.
Treatment effect modeling
We are interested in predicting the ITE, τi, defined according to the Neyman/Rubin Potential Outcome Framework:10 where Yi(1) and Yi(0) represent the outcome of participant i when given treatment and control medications, respectively. The Fundamental Problem of Causal Inference11 is that the ITE is unobservable because only one of the two outcomes is realized in any given patient, dictated by their treatment allocation. Yi (1) and Yi(0) are therefore termed potential outcomes or, alternatively, factual (observed) and counterfactual (not observed) outcomes.
Ground-truth can nevertheless be observed at the group level in specific situations, such as randomized control trials, because treatment allocation is independent of the outcome. We provide a detailed discussion of two important estimands, the average treatment effect and the CATE in Supplementary Material Section 1. Briefly, the average treatment effect represents the average effect when considering the entire population, while CATE considers a sub-population selected based on patient characteristics. The following discussion focuses on the problem of CATE estimation, which we use to frame the problem of predicting treatment response.
The best estimator for the CATE is conveniently also the best estimator for the ITE in terms of mean squared error (see Equation 2 in Künzel et al.5). Several frameworks have been developed to model the CATE, but a simple metalearning approach which decomposes the estimation into sub-tasks that can be solved using any supervised machine learning model provides a flexible starting point.5 For a broader survey of methods, see the survey on uplift modeling by Gutierrez & Gérardy3 (the uplift literature has contributed extensively to the field of causal inference, particularly when dealing with randomized experiments from an econometrics perspective).
In the present work, a MLP was used as the base model for its high representational power, flexible architecture, and ability to integrate into larger end-to-end-trainable neural networks consisting of different modules (such as convolutional neural networks or attention modules). We used a multitask architecture with two output heads, one for the modeling the potential outcome on treatment, , and the other to model the potential outcome on placebo, . For inference, we compute can compute the CATE estimate given a feature vector xi as:
An additional correction to increase robustness to distributional shifts that could occur as a result of our tuning procedures is applied to at inference time to produce our final ITE estimate, (see Supplementary Material Section 1 and 2 for details).
This multitask approach can be seen as a variant of the T-Learner described by Künzel et al.,5 except that the two base models in our case share weights in the first layers. Our network is similar to that conceptualized by Alaa et al,12 but without the propensity network used to correct for any dependence between the treatment allocation and the outcome conditional on the features, given that our dataset is from a randomized controlled trial.
To decrease the size of the hyperparameter search space, we fixed the number of layers and only tuned the layer width. We used two common hidden layers and one treatment-specific hidden layer (Fig. 1). More common or treatment-specific layers could be used if necessary, but given the low dimensionality of our feature-space and the relatively low number of instances, we opted to keep the network’s depth small to avoid over-fitting. The inductive bias behind our choice of using a multitask architecture is that disability progression can have both disease-specific and treatment-specific predictors which can be encoded into the common and treatment-specific hidden layer representations, respectively. Consequently, the common hidden layers can learn from all the available data, irrespective of treatment allocation. Rectified linear unit (ReLU) activation functions were used for non-linearity.
We compared the performance of the multitask MLP on our validation metric, the weighted ADabc (defined below in the “Validation metrics” subsection) to the following linear (or log-linear) models as baselines: ridge regression and CPH. We also compared it to a popular non-linear heterogeneous treatment effect estimator, uplift forest.4 Ridge regression models were used as base models inside both an S-Learner and T-learner configuration (as defined by Künzel et al.5), while the CPH model was used as the base model for a T-learner only (as implemented by Bovis et al.2). The uplift forest was used as a standalone heterogeneous treatment effect estimator.
Outcome definition
The primary outcome used in clinical trials assessing the efficacy of therapeutic agents on disease progression is the time to CDP at 12, or 24 weeks. We chose to use CDP at 24 weeks (CDP24) because it is a more robust indication that the disability accrual will be sustained over 5 years.13 CDP24 is based on progression on the EDSS, a scale from 0 (no disability) to 10 (death), in discrete 0.5 increments. We define CDP24 as an increase in the EDSS of 0.5 for baseline EDSS values > 5.5, or an increase of 1.0 for baseline EDSS values ≤ 5.5, sustained over 24 weeks. This difference in the increment required to confirm disability progression is commonly adopted in clinical trials, and partially accounts for the finding that patients transition through different stages of the EDSS at different rates.14
While it is possible to predict time-to-event using traditional machine learning methods if workarounds are used to address right-censored data or using machine learning frameworks specifically developed to model survival data (reviewed elsewhere15), we choose not to model time-to-CDP24 because of limitations inherent in this metric. As outlined by Healy et al.,16 CDP reflects not only the rate of progression but also the current stage of disease, which is problematic because the stage is represented by a discretized EDSS at a single baseline visit. This results in a noisy outcome label which could make it harder for a model to learn a representation that relates to the progressive biology which we are trying to model.
We therefore model the rate of progression directly by fitting a linear regression model onto the EDSS values of each individual participant over multiple visits (see Supplementary Material Section 3 for details) and take its slope to be the outcome label yi that our model uses for training. One advantage of the slope outcome over time-to-CDP24 is that it can be modeled using any type of regression model. After model training using the slope outcome, we revert to time-to-CDP24 for model evaluation to facilitate comparison with treatment effect survival metrics reported in the original clinical trial publications.
Training
We trained our model using mini-batch gradient descent with a batch size of 128 and the Adam optimizer.17 To prevent overfitting, we monitored the concordance index (C-index) on the validation set during CV, and early-stopped model training at the epoch with the highest C-index, up to a maximum of 100 epochs. Dropout and L2 regularization were used, along with a max-norm constraint on the weights18 to further prevent overfitting.
Batches were sampled in a stratified fashion to maintain the proportions of participants receiving active treatment and placebo. The mean squared error was computed from each instance’s squared error at the output-head with the available factual (ground-truth) outcome. Furthermore, the squared errors from each output head were weighted by ns/(2 * nt), where ns represents the total number of participants in the training split and nt represents the number of participants in the treatment arm corresponding to the output head of interest. This compensates for treatment allocation imbalance in the dataset.
Hyperparameter tuning and experimental setup
A random search over 100 different hyperparameter combinations was used to find the combination with the best validation performance (see Supplementary Table 2 for the full search set).
In our first experiment, we randomly shuffled data from ORATORIO and OLYMPUS given that the two active arms test drugs (ocrelizumab and rituximab, respectively) with the same mechanism of action (n = 1,119). We will refer to this dataset as the anti-CD20-Ab dataset. We then split this dataset into a training (70%) and testing (30%) set. We kept the data from ARPEGGIO as a second held-out test set (n = 323). The training set was subjected to 10-fold CV for hyperparameter tuning.
In our second experiment, we used 5×8 nested CV19 on the anti-CD20-Ab dataset to ensure our model training procedure was robust to the choice of training and test split. This involves adding an outer loop to the usual CV procedure, such that the entire dataset is split into five sets, one of which (20% of the total) is used as an unseen test set for each outer fold. Its corresponding training set (the 80% remaining) is subjected to 8-fold CV (validation sets are therefore 10% of the original dataset) for tuning and model selection in the inner loop. This results in five different models being selected and tested on a different test set, amounting to test predictions for every individual in the dataset.
In our third experiment, we retrained a model on the entire anti-CD20-Ab dataset and tested it on the ARPEGGIO dataset. The training set was subjected to 10-fold CV.
We used CV aggregation, or crogging,20 to improve the generalization error estimate using our metrics. Crogging involves aggregating all validation set predictions (rather than the validation metrics) and computing one validation metric for the entire CV procedure. We also used crogging in the outer loop of the nested CV experiment, therefore computing one test metric for the entire dataset.
We aimed to reduce variance by using the early-stopped models obtained from each CV fold as members of an ensemble. This ensemble’s prediction is the median of its members’ predictions (as opposed to the mean, which is more sensitive to outliers), and is used for inference on the unseen test set.
Hyperparameter tuning for the baseline models (ridge regression and CPH) was done on same folds and with the same metrics as for the MLP. The uplift forest was tuned only for the weighted ADabc (defined below) given that it estimates treatment effect directly.
Validation metrics
The best model was selected during CV on the basis of two validation metrics: the C-index of the factual predictions, and the weighted ADabc.
The first validation metric, the C-index, is defined as the proportion of correctly ordered pairs over all admissible pairs.21 We used the C-index instead of the mean squared error for tuning because we found empirically that models tuned for C-index performed better during CV in terms of their weighted ADabc. It has the advantage over other non-parametric ranking metrics such as Spearman’s rank correlation coefficient of being able to deal with censored data, if the chosen outcome label of interest would be a time-to-event variable, but in the case where a slope outcome is used either would provide an appropriate measure of rank ordering.
The second validation metric, a modified (weighted) version of the ADabc described by Zhao et al.,22 directly measures the quality of the treatment effect prediction. It is a measure derived from the area under the AD(c) curve, which is the average treatment effect for individuals who are predicted to respond (according to ) more than a desired response threshold c. A model capable of ranking responders appropriately should have an AD(c) curve that is almost monotonically increasing with a large area under the curve. ADabc is in unit years, and a larger positive number therefore indicates the ability for the model to separate responders across a range of predicted treatment effect sizes. See Supplementary Material Section 4 for a more detailed discussion including how we weigh the ADabc. For ease of interpretation, we will refer to c in terms of the percentile for c among all in the test set (e.g. the 80th percentile threshold for c represents the top 20% most responsive individuals according to our model’s prediction ).
We combine both validation metrics during tuning by choosing the model with the highest weighted ADabc among all models that fall within 1 SD of the best performing model based on the C-index. The SD of the best performing model’s C-index is calculated from the C-indices obtained in the individual CV folds.
Statistical analysis
Hazard ratios were calculated using CPH models with associated p-values from log-rank tests. Right censoring times were clamped at 2 years at inference for all experiments except for the effect size calculation in the simulated phase 2 clinical trial in the Results section “Simulating a one-year phase 2 clinical trial enriched with predicted responders”, where the right censoring times were clamped at 1 year.
Note that we multiplied all treatment effect values by −1 to simplify interpretation in the Results section, such that a positive effect indicates improvement, while a negative effect indicates worsening on treatment.
Software
All experiments were implemented in Python 3.8.23 MLP models were implemented using the Pytorch library.24 Scikit-Learn25 was used for the implementation of ridge regression, while Lifelines26 was used for CPH. CausalML27 was used for implementing the uplift forest. For reproducibility, the same random seed was used for data splitting and model initialization across all experiments.
Data availability
Data used in this work are controlled by pharmaceutical companies and therefore are not publicly available. Access requests should be forwarded to data controllers via the corresponding author.
Results
Predicting response to anti-CD20 monoclonal antibodies
We first trained an ensemble of multitask MLPs on 70% (n = 784) of the mixed dataset consisting of participants from ORATORIO and OLYMPUS. A histogram of test predictions on the 30% (n = 335) test set is shown in Supplementary Fig. 1. Varying the threshold c from the 10th percentile to the 80th percentile of predicted treatment effect results in the AD(c) curve shown in Fig. 2 (ADabc = 0.0679). It is appropriately increasing throughout its range with a positive ADabc, demonstrating the ability for the model to rank response to anti-CD20-Abs accurately, in a way that reflects the group-level ground-truth.
Kaplan-Meyer curves of the factual time-to-CDP24 in predicted responders and non-responders are shown in Fig. 4 at various percentile thresholds for c. We observe a decrease in HR from 0.787 (p=0.292) when including the entire test population to 0.395 (p=0.0279) at the 75th percentile threshold for c (including only the top 25% most responsive individuals). At this 75th percentile threshold, the nonresponder group has a HR of 1.02 (p=0.938), which suggests that a large part of the beneficial effect visible at the whole-group level comes from a very small proportion of patients. The lowest HR achieved is 0.3 (95% CI 0.100-0.901, p=0.0229) at the 85th percentile threshold, but beyond this point the sample size becomes too small and the confidence intervals too large to provide a meaningful estimate.
Due to the relatively small test size (n = 335), to better estimate the generalization error of the training and model selection procedure, we performed 5×8 nested CV,19 yielding test-time predictions for the entire dataset (n = 1,119). The outer-fold crogging is expectedly consistent with our previous experiment (ADabc = 0.0415), and provides more confidence in the treatment effect estimates due to the increased test set size. The effect is very similar at the 50th percentile threshold (HR 0.625, p=0.008), while it slightly less beyond the 75th percentile. From this estimate, we expect to reach a HR of 0.521 (p=0.013) with a 75th percentile threshold, with a peak HR of 0.338 (p=0.013) at the 92nd percentile threshold.
We verified whether the predictive ability of the model is similar if we consider men and women separately. The model tested on men alone achieved a weighted ADabc of 0.057, while the one tested on women alone achieved 0.093, indicating a greater power in isolating responders in women. Taking the 75th percentile for comparison, the model reaches a HR of 0.293 (p<0.001) with women compared to 0.555 (p=0.090) with men.
Simulating a one-year phase 2 clinical trial enriched with predicted responders
To understand the effect of enriching a future clinical trial studying novel B-cell depleting agents for reduction of disability progression, we simulated a 1-year randomized clinical trial using populations enriched with predicted responders. Using our model to predict sub-groups of responders to anti-CD20-Abs across a variety of thresholds, we can calculate the 1-year CDP24 event rate and 1-year HR for these sub-groups, which can then be used for sample size estimation. Table 2 shows the sample size needed to have 80% power to detect a significant difference with α = 0.05 across various degrees of enrichment.
The 50th percentile is the threshold for inclusion of participants that provides the best compromise between needing a small number (n = 497) of participants to show a significant treatment effect, while still randomizing most screened participants. This process involves screening a total of 944 individuals and selecting the top 50% who are predicted to be most responsive for enrollment in the trial. This leads to a 6-fold reduction in the number of patients that need to be randomized, and a 3-fold reduction in the number of patients that need to be screened, compared to the scenario where all participants are randomized into a one-year study (n = 3,068).
Group characteristics of predicted responders and non-responders
We provide group-level characteristics of the predicted responders in comparison with non-responders at the 50th percentile threshold in Table 3. See Supplementary Table 3 for the statistics at the 75th percentile threshold. We observe enrichment across a broad range of input features in the responder sub-group: younger, more men, shorter disease duration, higher disability scores, and more lesional activity. Normalized brain volume was the only feature which did not differ between the two groups at either threshold. A lower weight was only significantly different in the single test set, and not in the nested CV aggregate.
Predicting response to laquinimod
Finally, to determine whether the same model could be predictive of treatment response to medications with different mechanisms of action, and to provide a second validation for the model trained on the single 70% training set in the first anti-CD20-Ab experiment, we tested it on a separate clinical trial that was not used during training: ARPEGGIO. As a second step, we retrained the model on 100% of the anti-CD20-Ab dataset, and again tested this retrained model on the ARPEGGIO dataset. The results are shown in Table 4. We see that the model trained on the single anti-CD20-Ab training set also generalizes to this second unseen test set (ADabc = 0.024). The treatment effect increases almost monotonically and reaches significance (p < 0.05) at the 80th percentile threshold. The retrained model performs better (ADabc = 0.0436), obtaining a HR of 0.352 (p=0.043) at the 70th percentile and 0.196 (p=0.010) at the 80th percentile. The Kaplan-Meyer curves for the 70th and 80th percentile thresholds are shown in Supplementary Fig. 2. We show the group characteristics for predicted responders using a 50th percentile and a 75th percentile threshold in Supplementary Tables 4 and 5, respectively. The significant groupwise differences are largely similar to those obtained on the anti-CD20-Ab dataset, with a few notable exceptions. A taller height and a higher normalized brain volume are present in the ARPEGGIO responder group, whereas no difference is observed in weight, Gad count and T2 lesion volume. Altogether, this out of distribution generalization provides strong evidence suggesting that this approach can truly find underlying predictors of treatment response which are at least partly drug mechanism agnostic.
Comparison to baseline models
To determine whether non-linear models (such as an MLP with ReLU activations) provide any performance gains over linear models, we compared the multitask MLP to the commonly used ridge regression trained on the slope outcome and to a CPH model (which is log-linear) trained directly on the time-to-CDP24 outcome. We used an uplift forest as a non-linear baseline.
The ADabc achieved by each model when tested on both the anti-CD20-Ab and the laquinimod datasets is shown in Table 5, along with the average p-value obtained from log-rank tests at each of the 8 percentile thresholds used to compute the AD(c). The multitask MLP outperforms all other linear (and log-linear) baselines, along with the non-linear uplift forest which fails to identify responders (negative weighted ADabc) on our dataset.
Discussion
This work addresses an important limitation in the current treatment of PPMS, whereby no biomarker adequately predictive of treatment response is available to guide the choice of treatment for individuals, either in the clinic or in clinical trials. It is well recognized that the disability trajectory of individuals with multiple sclerosis and their response to specific treatments are highly heterogeneous. We therefore set the stage for treatment effect prediction with the primary goal of increasing the efficiency of early phase clinical trials in PPMS, but also secondarily to help make better treatment decisions in the clinic. We use a multitask MLP architecture for CATE estimation, and through a one-year phase 2 clinical trial simulation demonstrate the benefit of predictive enrichment, which could reduce the number of patients randomized by 6-fold while screening 3 times less patients than would be required for a short (one year) clinical trial in PPMS. We go through numerous steps to validate our approach, in particular by using nested CV and crogging to better estimate the generalization error and by showing generalization to a different unseen test set. The latter demonstrates that this approach can likely be useful on future clinical trials even if they study different medications.
Although our results illustrate a potential benefit of using predictive models in the clinic to assist in selecting patients who are more likely to benefit from anti-CD20-Abs, it remains an open question whether patients for whom our model predicted minimal response over two years could benefit from longer periods of administration. Answering this question would require longer-term observational data.
Interpretability of neural networks is a growing field in the machine learning literature (reviewed elsewhere28) and how to best interpret the predictions of neural networks remains an area of active research. We therefore do not attempt to provide in-depth interpretations of the network’s predictions, but instead analyse the group-level statistics of the predicted responders at various thresholds of response. The more responsive groups are enriched in numerous baseline features, including a younger age, more men, higher disability scores, and more lesional activity. Similarly, in subgroup analyses from OLYMPUS, an age less than 51 years and presence of gad lesions at baseline was also found to be associated with increased response.7 In a study by Bovis et al.,2 a response scoring function obtained via a T-learner composed of CPH models in RRMS found increased age, female sex, more Gad lesions and a higher normalized brain volume to be associated with better response. Conversely, Signori et al.29 found an opposite relationship between age and response in RRMS, demonstrating that different types of models looking at different combinations of features can find different optimal solutions.
Finding more individuals with greater lesional activity (particularly Gad, and to a lesser extent T2 lesions) in the responder subgroups would support a role for these lesions in progressive biology. However, this role appears to be indirect, as most of the progression in multiple sclerosis occurs independent of relapse activity. The presence of multiple pathophysiological mechanisms to explain progression in multiple sclerosis involving both inflammation and neurodegeneration has been postulated,30 and smouldering inflammation associated with slowly enlarging lesions31 could be a potential effector for anti-CD20-Ab’s effect on progression. The alternative hypothesis that lesional activity in the responder group could be a marker of more aggressive disease for which CDP is more likely to occur is not supported by our results, since our model does not work solely by identifying a more rapidly progressive sub-group. This is evidenced by the reduction in the frequency of CDP in the treatment group for the more responsive individuals (see Table 2).
Interestingly, despite a balanced dataset with respect to gender, our model was better at identifying responders in women compared to men. This is particularly interesting because most of the responders are men. To deploy ethically fair point-of-care tools, every effort should be made to ensure that easily identifiable groups of individuals (e.g. based on gender, age, and ethnicity) all benefit equally from such tools. Potential ways to address this in future work would be through the use of fine tuning on larger observational datasets, optimizing models for each identifiable sub-group separately, or through more complex loss-weighing schemes.
Limitations of this work include the choice of model. Although our MLP outperformed linear baselines, MLPs are notoriously more difficult to tune and at higher risk of overfitting. We made heavy use of several regularization schemes to prevent this (shallow/narrow network, dropout, weight decay, max-norm constraint and early-stopping). This approach is not the easiest to implement nor the most computationally efficient, but it provided the best results, suggesting inherent non-linearities in the dataset that benefit from ReLU networks and their compositional expressivity. Our hyperparameter tuning procedure is also one of many that can be designed. Optimizing for the weighted ADabc directly could potentially provide better results and is the subject of ongoing work. Finally we used MRI-derived metrics that came from the individual clinical trials and that offered expert corrected lesion counts and volumes. Extracting features from the MRI images themselves through convolutional neural networks is the subject of ongoing work.
In conclusion, we demonstrate the utility of CATE estimation for predictive enrichment of clinical trials aimed at increasing the efficiency of the drug development process in PPMS. We were able to find subgroups of increasingly responsive individuals to anti-CD20 therapies. Our model was able to generalize to a medication with a very different mechanism of action, laquinimod, suggesting that there might be common predictors for treatment effect independent of mechanism, which would facilitate the use of such a model for planning future clinical trials. This flexible training paradigm and multitask model architecture can easily be integrated into larger neural networks to benefit from data-driven imaging feature extraction through convolutional neural networks. The use of this approach is not limited to enrichment of clinical trials and can also be used for precision medicine in the clinic when deciding whether initiation of a therapy is worthwhile, by predicting response of individual patients based on their unique characteristics.
Data Availability
Data used in this work are controlled by pharmaceutical companies and therefore are not publicly available. Access requests should be forwarded to data controllers via the corresponding author.
Funding
This investigation was supported (in part) by an award from the International Progressive Multiple Sclerosis Alliance (award reference number PA-1412-02420), by an endMS Personnel Award from the Multiple Sclerosis Society of Canada (Falet, JR), and by a Canada Graduate Scholarship-Masters Award from the Canadian Institutes of Health Research (Falet, JR). Falet, JR is also being supported through the Fonds de recherche Santé / Ministère de la Santé et des Services sociaux training program for specialty medicine residents with an interest in pursuing a research career, Phase 1.
Competing interests
Arnold, DL, reports consulting fees from Albert Charitable Trust, Alexion Pharma, Biogen, Celgene, Frequency Therapeutics, Genentech, Med-Ex Learning, Merck, Novartis, Population Council, Receptos, Roche, and Sanofi-Aventis, grants from Biogen, Immunotec and Novartis, and an equity interest in NeuroRx. Sormani, MP, has received personal compensation for consulting services and for speaking activities from Merck, Teva, Novartis, Roche, Sanofi Genzyme, Medday, GeNeuro, and Biogen. Precup, D, works part-time for DeepMind. The remaining authors report no competing interests.
Abbreviations
- CATE
- Conditional average treatment effect
- CDP
- Confirmed Disability Progression
- CPH
- Cox Proportional Hazards
- CV
- Cross-validation
- EDSS
- Expanded Disability Status Scale
- Gad
- Gadolinium-enhancing lesion
- ITE
- Individual treatment effect
- MLP
- Multi-layer perceptron
- PPMS
- Primary progressive multiple sclerosis
- RRMS
- Relapsing-remitting multiple sclerosis