Abstract
Determining which features drive the treatment effect for individual patients has long been a complex and critical question in clinical decision-making. Evidence from randomized controlled trials (RCTs) are the gold standard for guiding treatment decisions. However, individual patient differences often complicate the application of RCT findings, leading to imperfect treatment options. Traditional subgroup analyses fall short due to data dimensionality, type, and study design. To overcome these limitations, we propose CODE-XAI, a framework that interprets Conditional Average Treatment Effect (CATE) models using Explainable AI (XAI) to perform feature discovery. CODE-XAI provides feature attribution at the individual subject level, enhancing our understanding of treatment responses. We benchmark these XAI methods using semi-synthetic data and RCTs, demonstrating their effectiveness in uncovering feature contributions and enabling cross-cohort analysis, advancing precision medicine and scientific discovery.
Introduction
Quantifying the influence of an intervention on a given result is a quintessential issue researchers face in numerous high-stake applications [1, 2]. In medicine, healthcare professionals use available evidence to decide which treatments could improve an individual patient’s health[2]. Randomized controlled clinical trials (RCTs) are the current gold standard for determining treatment effects [3]. However, applying such evidence towards treatment decisions for individual patients can be complicated by deviations in patient characteristics and clinical practice settings that differ from the strictly controlled conditions enforced during RCTs. As a result, clinicians are left guessing if the treatment identified in the RCT will benefit an individual patient when they differ in some way from those studied.
Attempts to understand why treatments are effective, and thus maximize their application, have traditionally been relegated to secondary objectives of RCTs that lack the power to drive changes in clinical practice. Subgroup analysis focuses on treatment outcome differences across patients based on observed covariates[2, 4, 5]. However, as data dimensionality increases, the number of potential subgroups increases exponentially, quickly overwhelming their application to patients and practice in the real world. [6, 7]. Subgrouping also typically relies on categorical variables while many features are continuous, and converting continuous features to categorical variables can lead to loss of important information, difficulty in determining the number and boundaries of categories, and risk of false discovery [7]. Moreover, it requires balanced treatment and control allocation within each subgroup, complicating the analysis of features or subgroups not accounted for in the original trial design [6, 8]. Finally, subgroup analysis fails to both provide insights into how individual characteristics affect treatment efficacy and to allow cross-cohort comparisons, even among groups with similar treatments or features.
To more effectively understand and quantify treatment effects, researchers have developed Conditional Average Treatment Effect (CATE) models [9]. CATE models aim to adjust for imbalances between control and treatment groups and leverage observed covariates to enhance the estimation of treatment effects. Numerous proposed approaches [10– 13] address the question of how the treatment affects the outcome. However, these methods are tailored for optimal prediction, and do not inform robust feature or subgroup discovery. They fall short of answering two vital questions related to why estimations drive specific outcomes: (1) which feature drives the treatment effect? and (2) why do individual responses to treatments vary? Such factors differ across cohorts and are diverse and complex, so simply measuring the treatment effect is insufficient to identify them. Thus, the need to interpret these CATE models provides a unique opportunity to answer these important questions.
To overcome these deficiencies, we propose CODE-XAI, a framework that discovers feature that drives treatment effects by interpreting CATE models using Explainable AI (XAI) [14, 15].In particular, local explanation methods[16], such as Integrated Gradient (IG) [17] and Shapley values [18, 19], can address the issue of which feature drives the treatment effect for a given individual. These methods are favorable because they decompose the treatment effect (i.e., CATE model’s output) into each feature’s contribution directly without grouping or feature conversion [20]. Additionally, they enable feature attribution on the individual level in a usable way, enhancing our understanding of why certain individuals may respond more favorably to treatment than others.
To obtain reliable attribution scores from CODE-XAI, we employed an ensemble approach and introduced bench-marking techniques that assess both CATE and XAI methods. Moreover, we propose a novel subpopulation analysis using Shapley values on various baselines to uncover clinical feature interactions and resolve conflicting results across different trials. We then tested CODE-XAI against the two most common hurdles present when applying RCT’s to real world practice, differences in patient characteristics and alternative clinical practice settings. Finally, we demonstrate that CODE-XAI can successfully distill RCT treatment effects to the level of the individual patient.
Results
Benchmarking CATE and XAI on Real-World Clinical Data
We next examine the performance of both CATE models and their explanation in real-world datasets. We first train CATE models for each cohort, including IST3[21], CRASH-2[22], ACCORD[23], and SPRINT[24], and we obtain explanations with methods described in Section 0.2. Details of cohort description, datasets, and model implementations are in Appendix S1. We also conduct additional experiments in semi-synthetic environments to examine each explanation method (S4.1).
Estimating Real-World Treatment Effect with Ensemble CATEs
To obtain an accurate explanation, we first train CATE models to emulate treatment effects from four well-known randomized control trials [21–24]. We select the best-performing models according to their pseudo-outcome surrogate (Appendix S4.2), finding that X-learner outperforms other models in IST-3, CRASH-2, and SPRINT, while DR-learner performs best in ACCORD (Table S4).
Table 1 presents an ensemble estimate of the average treatment effect (ATE) for each cohort, including uncertainty estimates. CATE estimates for IST-3 and CRASH-2 are consistent with their reported findings [21, 22]. Interestingly, for the blood pressure control trials, i.e., SPRINT and ACCORD, the CATE model provides more optimistic estimates, showing improvements of 1.6% and 1.2% in primary outcomes, respectively, compared to 0.54% and 0.22% reported originally. The CATE estimation for SPRINT also demonstrates better ATE compared to ACCORD.
Enhanced Robustness in Explanations with Ensemble Models
We next demonstrate the importance of interpreting ensemble models over a single model. By measuring cosine similarities between explanations (0.3), single-model explanations exhibit low similarity and high variance, with scores of 0.13, 0.15, 0.15, and 0.21, as depicted in Figure 2(b-top). In contrast, ensemble explanations display greater consistency and robustness. As shown in Figure 2(b-middle), the average explanation similarity (Shapley value) within the ensemble increases from 0.6 with 10 models to 0.8 with 20 models, highlighting the enhanced reliability and consistency of explanations achieved through the ensemble approach.
Benchmarking XAI Methods on Real-World Clinical Data
To examine the obtained explanation, a commonly used approach is an ablation study, where features are systematically added or removed based on their importance ranking.[25]. However, individual-based ablation studies suffer from baseline selection bias (S4.3) and are computationally expensive for ensemble models. Instead, we propose a distillation technique (0.3) that focuses on global explanations, training student models on globally ranked features to emulate the CATE model’s outputs across varying feature budgets (0.3).
As we show in Figure 2(c), both Shapley-mean and IG-mean consistently demonstrate lower distillation loss (mean squared error) across the SPRINT, ACCORD, and IST-3 datasets under various feature budgets. In contrast, within the CRASH-2 dataset, the performance of all methods is comparable except for Saliency, likely due to the dataset containing only 10 features, simplifying the task of identifying influential features. Our proposed evaluation shows that explanation methods with a population mean as the baseline outperform constant baselines.
In Table S6, we show the best methods and their top 5 features across different RCTs. In the CRASH-2 dataset, the top 5 features identified by IG-mean as important factors to treatment effect are injury type, gender, age, and gcs score; in contrast, Saliency ranks heart rate, respiratory rate, and capillary refill time as the top features.
Insights by Explaining CATE with Shapley Value
We now show how to use feature attributions obtained from CATE models to analyze clinical trials and their advantages relative to traditional subgroup analysis. Since gradient-based attribution is difficult to interpret[17], we employ Shapley value explanation methods to analyze clinical cohorts.
Global Feature Identification: Shapley Values versus RCT Findings
To assess the effectiveness of Shapley values in feature discovery, we compute the Spearman’s rank correlation [26] between the global explanation from Shapley values and the features reported in original studies. For RCTs, we employ reported interaction p-values as proxies for feature ranking [27].
As shown in Table 2 and Figure 2(b-bottom), a significant correlation between Shapley rankings and reported features is observed, 0.8, 0.54, and 0.6 in CRASH-2, IST-3, and SPRINT, respectively, where CATE models accurately predict treatment effects. However, the correlation is low, 0.05, in the ACCORD study. This is expected given that no significant features have been reported[23], and the explanation would be less reliable when the CATE model struggles, Table 1.
IST3: Analyzing Features’ Contribution to rt-TPA Treatment Effect through Shapley Value
Here, we analyze clinical features in IST-3, a clinical trial that assesses the efficacy of intravenous rt-PA in acute ischaemic stroke patients. Compared to traditional subgroup analysis, which requires subgrouping and computing risk or odds ratios, Shapley values enable direct analysis of feature impact at both individual and group levels. They provide individual explanations [14, 18] by decomposing the total treatment effect into each feature’s contribution for every individual.
In Figure 3 (b), the upper force plot shows an example patient who experienced an increased survival probability of 11%, significantly above the ATE, which is 1.6%. The red bar indicates features that contribute positively to the treatment effect, including a high NIHSS score, TACI, and usage of anti-platelet within 48 hours; the blue bar indicates features that reduce the treatment effect, including atrial fibrillation history and higher systolic blood pressure. Conversely, the individual in the bottom force plot, a male patient with low NIHSS scores and PACI, had a treatment effect diminished by 11%.
On the cohort level, we analyze feature importance in IST3-trial by averaging their Shapley value across the cohort. Results show that the NIH Stroke Scale (NIHSS), a neurological examination for stroke evaluation, is the most influential feature affecting rt-TPA’s efficacy; see Figure 3(c). Further, without categorizations or creating numerous subgroups, we can easily examine the impact of continuous features. The Shapley plot indicates that patients with higher NIHSS, depicted by the red cluster, demonstrate a pronounced improvement in treatment outcomes when administered TPA, in contrast to those with lower NIHSS scores, marked by the blue cluster. This observation is consistent with prior research [21, 28], which also identified a significant interaction between NIHSS scores and tPA treatment effectiveness.
Notably, the second most impactful feature is the type or syndrome of the stroke. In Figure 3(c), rt-TPA exhibits enhanced benefits for patients diagnosed with TACI and PACI, a finding consistent with the original IST-3 study and reported in several stroke-related studies [29]. Our findings also reveal that factors such as receiving an anti-platelet drug within 48 hours and infarction history significantly affect the effect of rt-TPA, which previous studies have also discovered [29, 30].
IST-3: Subgroup Analysis with Shapley Value
We now extend the analysis to multiple features and identify subgroups that are more susceptible to rt-TPA treatment. For instance, in Figure 3(c), we analyze gender and NIHSS and their combined influence on treatment effect. We observe that with the same NIHSS scores, males and females exhibit different treatment efficacy. In male patients (red dots) with lower NIHSS scores (< 15), rt-TPA appears less effective, whereas its effectiveness increases in males with higher NIHSS scores (> 15).
To obtain deeper insights into the contributions of specific features within a particular subgroup, we modify the baseline used in Shapley value calculations (Section 0.3). We thereby compare male individuals or female individuals to male or female baselines by adjusting our research question to: Which features are important for males or for females compared to other males or females? In this case, the significance of gender is no longer present.
Within the male population, while the NIHSS score remains the most critical feature, the order of importance of other features shifts; see Figure S10(b)). Conversely, when analyzing female patients against a female baseline, the significance of NIHSS diminishes, and TACI emerges as the most influential feature, followed by anti-platelet usage, Figure S10(b). Interestingly, although most feature trends remain consistent when using the population baseline, the effects of pre-stroke anti-platelet therapy differ between genders. Its usage seems to counteract the benefits of rt-TPA in males while enhancing its effects in female patients. This finding is consistent with several studies that emphasize the positive impact of anti-platelet therapy on women, as reported by [31].
Deciphering Treatment Effects When Patients are Different
A common reason why RCTs cannot be applied to more general populations is due to variation in patient characteristics that influence treatment effects. To address this issue, We stress-tested CODE-XAI’s ability to identify key differences in patient characteristics driving alternative treatment outcomes in the setting of intensive blood pressure management using two notable RCTs. The SPRINT trial showed that intensive blood pressure management reduced cardiovascular events and mortality in high-risk, non-diabetic patients, whereas the ACCORD trial found no significant benefit when the same treatment was applied to patients with type 2 diabetes[23, 24].
Discrepancies in Predictive Features
We first compared the top features affecting treatment outcomes in both trials. Interestingly, despite overall similarities between the cohorts, the top features affecting the treatment effect for each trial were quite different. In the SPRINT trial, age was the most significant factor influencing blood pressure control, followed by gender, statin usage, chronic kidney disease history (CKD), and cardiovascular (CVD) history; see Figure 4(a-bot). Conversely, in ACCORD, the most significant feature affecting the treatment effect was a history of CVD, followed by gender, aspirin use, number of antihypertensive medications, and an individual’s ethnicity.
Additionally, when examining the identified features’ clusters, the SPRINT trial showed a clear effect of feature pairs, e.g., age and CVD history or age and gender Figure 4(c-bottom, d-bottom). However, such effects were absent in the ACCORD trials. In some cases, the combined effect of features seems to be reversed, e.g., in glucose level and aspirin usage; see Figure 4(b-top).
Analyzing ACCORD with a SPRINT Baseline
Using CODE-XAI, we directly addressed the question of Which features are important for ACCORD individuals compared to the SPRINT population? We achieved this by simply substituting the baseline with an example individual from the SPRINT cohort (S3.2).
Upon reassessing the top features from both cohorts and reanalyzing the feature rankings, we observed that fasting glucose (fpg) emerged as a prominent feature in ACCORD, but it ranked 14th among the 18 clinical features in SPRINT; see Figure S11 (a). By identifying fasting glucose as a key treatment effect, CODE-XAI correctly and independently identified the underlying key patient characteristic, i.e. the presence of diabetes, most likely driving the difference in treatment effect between the two trials. Moreover, CODE-AXI independently provided a clear and usable treatment metric (fasting glucose) for clinicians seeking to manage blood pressure in diabetic patients.
To further investigate the impact of glucose on the effectiveness of blood pressure control in the ACCORD study, we analyzed the treatment uplift using qini scores and uplift scores (S2.2.1) among patients with varying glucose levels. As we show in Figure 4 (f-left) and Table S7, the uplift score and qini score for the original ACCORD was 3.8 × 10−3 and 2.2 × 10−3, respectively, significantly lower than the SPRINT studies, i.e., 7.5 ×10−2 and 3.9 × 10−2, respectively. However, when excluding patients with glucose levels exceeding 300 mg/dL (the maximum observed value in the SPRINT cohort), the average treatment effect of ACCORD increased by 39.5% for the uplift score and 36.3% for the qini score.
Using CODE-XAI, we thus unravel these conflicting results in trials. Our analysis highlights variances in glucose levels as a potential explanatory factor for the observed disparities in treatment outcomes between the two studies.
Applying CODE-XAI across Clinical Practice Settings
Here, we test the ability of CODE XAI to identify important features in treatment effects when a proven treatment is applied to a different clinical setting. For this test, we used the treatment of traumatic bleeding after injury using tranexamic acid (TXA), a drug that is used to stabilize blood clots to reduce bleeding after injury. Strong randomized data favor the use of TXA for trauma victims at risk of significant bleeding if given at hospital admission and within 3 hours of injury[22]. Time from injury has emerged as having an important effect on TXA efficacy. So clinical practice has steadily crept towards using this drug at the scene of injury or during transport (pre-hospital), despite the lack of randomized evidence for its efficacy in this alternative practice setting. In this scenario, we asked CODE-XAI to identify which features were most important for trauma patients when TXA was given in the hospital setting vs. when TXA was given pre-hospital. Using data made available from CRASH-2 study investigators [22] and our local trauma center registry, we asked CODE-XAI to identify features that determine TXA efficacy when administered in these different clinical practice settings (Appendix S1.3). We then validated the feature selected by CODE-XAI in the new pre-hospital setting by computing the treatment effect gain. We also compared it to features identified during a more recent randomized controlled trial of TXA when given specifically in the pre-hospital setting[32].
We first compared the top features based on their Shapley values. As shown in Figure S13(a-left), in the pre-hospital settings, the top features were time-to-injury, GCS score, trauma type, and a new effect, age. We then examined the treatment effects among different age groups. As shown in Figure 4(f-right) and Table S7, the uplift score and qini score for our pre-hospital cohort are 5 ×10−4 and −5 × 10−4, respectively. Surprisingly, after excluding patients older than 45 y/o in the pre-hospital settings, the scores increase to 5 × 10−3 and 8 ×10−4, respectively. This finding indicates that, in the pre-hospital setting, CODE-XAI is identifying age as a new and potentially crucial correlate of TXA efficacy. This result was validated by similar emergence of age as a new treatment effect for TXA efficacy from the PATCH study, a randomized controlled trial of TXA administered to injured patients in the pre-hospital clinical setting [32]. This result highlights the ability of CODE-AXI to identify important treatment effects when randomized clinical trial data are applied towards different clinical practice settings.
Discussion
Using explainable AI (XAI) in the life sciences continues to expand [33–35], however, its application, robustness, validity, and trustworthiness remain largely unexplored [25, 36–38]. We demonstrate that providing a deeper understanding of CATE dynamics with XAI can extend the capability of RCT’s to unveil real world clinical insights and support physicians to make better-informed decisions. In doing so, we present a framework, CODE-XAI, that rigorously explains these models, overcoming the hurdles involved in applying randomized controlled trial data toward real-world use in a robust and explainable way.
We first showcase that ensemble CATE models can reliably estimate treatment effects using real-world clinical data by comparing with factual outcomes and benchmarking pseudo-outcomes for model selection [39]. We then demonstrate that an ensemble explanation is more robust than the best single model. However, since examining explanations from an ensemble is not straightforward, we highlight the importance of global explanations and propose using knowledge distillation to benchmark feature attribution methods. This differentiates our method from those reliant on unrealistic assumptions regarding oracle accessibility in real-world scenarios [38], benchmark tests susceptible to inherent biases [15], or evaluations that are inefficient for ensemble models [25].
A natural use case of CODE-XAI is to analyze driving features for treatment effects across various trials in healthcare. We demonstrate how to use the ensemble Shapley value to analyze well-known RCTs [21–24]. Compared to traditional analysis, our approach provides not only subgroup analysis but individual analysis without the need to analyze millions of strata [5]. By analyzing individual features, we observe how a single feature can have varying effects on treatment outcomes (Figure 3). Such explanations of patient response differences can be particularly useful for clinical practitioners making individual treatment decisions. Similarly, with features at hand, we identify subgroups that would respond better to certain treatments in real-world settings (Figure 4), which can help researchers identify scientific insights that require further investigation.
CODE-XAI can also untangle conflicting results between trials and identify crucial covariates. We analyze two well-known trials, ACCORD [23] and SPRINT [24], which both evaluated blood pressure control but showed conflicting results, presumably due to differences in trial subject characteristics. Notably, we observe that glucose plays a significant role in the treatment effect, thus independently identifying the key difference between subjects enrolled in the two trials, i.e., the presence of diabetes. In addition, fasting glucose was identified as an important and clinically relevant treatment effect for clinicians to consider when expanding intensive blood pressure control to real world populations. We also investigated how CODE-XAI could inform important treatment effects when translating RCT knowledge across differing clinical practice settings. When examining TXA efficacy across in- and out-of-hospital practice settings, CODE-XAI identified age as a vital treatment effect explaining differences in efficacy. These results suggest that CODE-XAI can help clinicians identify key variations between study cohorts that explain outcome differences despite seemingly overlapping demographics, treatments, and outcomes.
However, the effectiveness of explanations is limited by the performance of the CATE models. Though these models work well in controlled settings such as RCTs, it is difficult to obtain a reliable CATE model from observational studies with imbalanced treatment assignments. In the presence of unobserved confounders, the identifiability assumption would be violated, invalidating CATE model efficacy [9, 39] and leading to biased explanations [38]. Therefore, a promising research direction involves developing methods to impute robust attribution scores to mitigate selection bias. Additionally, some works incorporate causal knowledge to enhance the accuracy of feature attribution [40], but this assumption is often impractical in real-world experiments.
To conclude, we present a new approach to performing clinical feature discovery by explaining CATE models with XAI. We propose evaluation methods to assess CATE models with XAI in real-world clinical trial. Our framework, CODE-XAI, demonstrates several advantages compared to traditional subgroup analysis, including individual explanation, subpopulation analysis, and cross-cohort examinations. In an era where precision medicine and individualized treatments are taking center stage, understanding the nuances of treatment effects is more crucial than ever.
Methods
This section describes: (1) CATE models, (2) XAI methods and ensemble explanation, and (3) evaluation of ensemble explanation. We include detailed descriptions of these topics in Appendix S1 (dataset), S2 (potential outcome framework), and S3 (explanation methods).
0.1 CATE Models
0.1.1 Model Design, Evaluation, and Cross Examination
Under the potential outcome framework [9] (S2), meta-learners [38] represent a class of nonparametric CATE estimation methods. These methods approach treatment effect estimation for binary treatments as an imputation problem for missing counterfactual outcomes. They simplify the task by decomposing it into multiple sub-regression problems, often termed pseudo-outcomes [2], which can be solved using any standard supervised machine learning (ML) methods.
CATE estimation methods include T-Learner [2], X-learner[41], DR-learner[13], and R-learner[42]. These methods estimate CATE by learning nuisance functions η to identify the optimal τ∗ where is pseudo-outcome loss depending on the learner and 𝒟 is the training distribution.
This work uses a diverse range of CATE models, including meta-learners such as S-learner, T-learner, X-learner, DR-learner, and R-learner as well as representation learners like Dragonnet[43], TARNets[44], CFR[11], and DR-CFR. See Appendix S2.1.1 for further details regarding the structures, training procedures, and implementation of these models.
To evaluate CATE models, we employ pseudo-outcome surrogate criteria (S2.2) with a 5-fold validation technique. Additionally, to assess model performance across different cohorts, we utilize the Qini curve and Uplift curve (S2.2.1), which base model evaluation on observed treatment outcomes.
0.2 Explaning CATEs with Feature Attribution Methods
Once the best-performing models were identified, we used explainability (XAI) methods[14] to obtain feature contributions, i.e., explanations, for CATE treatment effects. XAI methods decompose model output into each feature’s contribution on the individual level with respect to a baseline; they effectively address the specific question: What is the contribution of each feature for an individual compared to the average person within a specific cohort? Specifically, we choose methods for CATE models that meet specific criteria (S3.1), including Integrated Gradients[17] and Shapley values[18].
Integrated Gradients (IG)
IG assigns importance to input features by approximating the integral of a model’s gradients from a baseline input to the actual input [17]. For a given trained CATE model τ, the IG attribution for an explicand x, a variable xi, and a baseline x′ is: Typically, the zero vector serves as the baseline, denoted as x′ = 0. This means feature contributions are measured relative to their absence.
Shapley Value
The Shapley value, a concept derived from cooperative game theory, offers a unique approach to feature attributions[18]. For any prediction model, it assigns each feature an importance value by averaging all possible combinations of feature presence or absence. Mathematically, for a CATE model τ, the exact Shapley value for a feature xi is defined as: where N is the set of all features and S is any subset of N that does not include feature xi.
However, computing the exact Shapley value can be computationally intensive, especially for models with a large number of features. Therefore, in practice, an approximation method like Shapley Value Sampling [19], Baseline Shapley[45] or KernelSHAP[18] is often used. This work experiments with various methods, including Vanilla Gradient (Saliency), Integrated Gradient (IG) with 0 as the baseline (IG-0), Integrated Gradient (IG) with population mean as the baseline (IG-mean), Baseline Shapley with 0 as the baseline (Shapley-0), and Baseline Shapley with the population mean as the baseline (Shapley-mean). Additional detail about these methods is in Appendix S3.
0.2.1 Ensemble-based CATE Estimation and Explanation
Despite the progress in CATE models based on neural networks, their stability in real-world datasets remains an issue due to the inherent randomness encountered during model initialization and training [46]. To address this, we employ an ensemble approach [47] within CODE-XAI. We train individual CATE models τi(x) with different random seeds i. The ensemble CATE estimator, τe, and its ensemble explanation, ϕj, for a feature, j, and an explicand, x, can be computed as: where N is the number of models in an ensemble. This method enhances both the model’s and explanation’s stability by averaging out variability.
0.3 Examining Explainability Methods on CATEs
In this section, we introduce methods that assess the explanations of CATE.
Explanation Robustness Assessment
To evaluate the effect of the number of single models in an ensemble on explanation stability, we first train L ensembles, each with k single models, and then calculate the pairwise cosine similarity of their explanations. Given feature attributions ϕ(·) for the lth ensemble, , composed of k single models, the average cosine similarity cos (θk) is:
Examining Ensemble Explanation via Knowledge Distillation
Though ablation studies offer a convenient way to inspect explanation methods, their choice of baseline can potentially favor particular explanation methods[15, 48]. To address this, we introduce an evaluation approach rooted in knowledge distillation[49], wherein the student model is coached to emulate the behavior of the teacher model. However, retraining models using local explanation rankings is resource-intensive given the myriad combinations of feature subsets[25]. We circumvent this by retraining with a global explanation ranking. Intuitively, an optimal explanation method should also highlight impactful features on a global level. To quantify the efficacy of an explanation method, we propose using the knowledge distillation loss, ℰKD. Formally, this evaluation is defined as where τs is a student model, Xk represents the top k features ranked by their average absolute attribution scores across training samples, and is the output from the ensemble (teacher) model,. If the identified features are predictive of the treatment effect, ℰKD would be low in the testing set.
Our approach shares similarities with the Remove-and-Retrain (ROAR) method; however, in our setting, ROAR requires retraining every model in an ensemble whenever a feature is removed, imposing a heavy computational cost[25]. In contrast, our approach requires only a single student model at every removing step, significantly enhancing computational efficiency. Notably, knowledge distillation is the only way to obtain comparable model performance for an ensemble, as shown in [50]. This approach also bypasses the dilemma when selecting a baseline [15, 48]. Additionally, feature contribution on a global (cohort) level facilitates human evaluation[24, 51].
Global Feature Identification
Alternatively, if the ground-truth explanation or important feature is available, we propose computing Spearman’s rank correlation[26] rankings derived from the explanation methods and the oracle. Specifically, in the context of the treatment effect, we consider interaction p-values [27] as ground truth. A lower p-value indicates a higher likelihood of a feature being an important factor in the treatment effect. To evaluate an explanation method in identifying important features on the global level, we propose computing Spearman’s rank correlation[26] where ρ is the Spearman’s rank correlation, denotes the global ranking according to the explanation method ϕ and model , and g(p) indicates the ranking based on interaction p-values.
Data Availability
All data except UW Harborview data are available online at the below links
https://biolincc.nhlbi.nih.gov/home/
https://freebird.lshtm.ac.uk/index.php/available-trials/
Data availability
The generation process for synthetic datasets is available on GitHub at https://github.com/AliciaCurth/CATENets. The IST-3 dataset is publicly accessible at https://datashare.ed.ac.uk/handle/10283/1931. The CRASH-2 dataset can be accessed at https://freebird.lshtm.ac.uk/index.php/available-trials/, with treatment allocations available upon request. Both the ACCORD and SPRINT datasets are available upon request at https://biolincc.nhlbi.nih.gov/home/.
Funding
Ethics declarations
Competing interests
The authors declare no competing interests.
Acknowledgement
We extend our gratitude to the CRASH-2 investigators for sharing treatment allocation data, and to the researchers in the Lee lab for their valuable discussions.
Footnotes
↵* indicates co-senior authorship