Abstract
Aim To review and appraise the quality of studies that present models for causal inference of time-varying treatment effects in the adult intensive care unit (ICU) and give recommendations to improve future research practice.
Methods We searched Embase, MEDLINE ALL, Web of Science Core Collection, Google Scholar, medRxiv, and bioRxiv up to March 2, 2022. Studies that present models for causal inference that deal with time-varying treatments in adult ICU patients were included. From the included studies, data was extracted about the study setting and applied methodology. Quality of reporting (QOR) of target trial components and causal assumptions (ie, conditional exchangeability, positivity and consistency) were assessed.
Results 1,714 titles were screened and 60 studies were included, of which 36 (60%) were published in the last 5 years. G methods were the most commonly used (n=40/60, 67%), further divided into inverse-probability-of-treatment weighting (n=36/40, 90%) and the parametric G formula (n=4/40, 10%). The remaining studies (n=20/60, 33%) used reinforcement learning methods. Overall, most studies (n=36/60, 60%) considered static treatment regimes. Only ten (17%) studies fully reported all five target trial components (ie, eligibility criteria, treatment strategies, follow-up period, outcome and analysis plan). The ‘treatment strategies’ and ‘analysis plan’ components were not (fully) reported in 38% and 48% of the studies, respectively. The ‘causal assumptions’ (ie, conditional exchangeability, positivity and consistency) remained unmentioned in 35%, 68% and 88% of the studies, respectively. All three causal assumptions were mentioned (or a check for potential violations was reported) in only six (10%) studies. Sixteen studies (27%) estimated the treatment effect both by adjusting for baseline confounding and by adjusting for baseline and treatment-affected time-varying confounding, which often led to substantial changes in treatment effect estimates.
Conclusions Studies that present models for causal inference in the ICU were found to have incomplete or missing reporting of target trial components and causal assumptions. To achieve actionable artificial intelligence in the ICU, we advocate careful consideration of the causal question of interest, the use of target trial emulation, usage of appropriate causal inference methods and acknowledgement (and ideally examination of potential violations) of the causal assumptions.
Systematic review registration PROSPERO (CRD42022324014)
Introduction
Many treatment choices in the intensive care unit (ICU) are made quickly, based on patient characteristics that are changing and monitored in real-time. Given this dynamic and data-rich environment, the ICU is pre-eminently a place where artificial intelligence (AI) holds the promise to aid clinical decision making.1–3 So far, however, most AI models developed for the ICU remain within the prototyping phase.4,5 One explanation for this may be that most models in the ICU are built for the task of prediction, ie, mapping input data to (future) patient outcomes.6 However, even a very accurate prediction of, for instance, sepsis,7 does not tell a physician what to do in order to treat or prevent it. In other words, prediction models are seldom actionable. For AI that assists clinicians in what to do, ie, ‘actionable AI’, models need to take into account cause and effect.
Causal inference (CI) represents the task of estimating causal effects by comparing patient outcomes under multiple counterfactual treatments.6,8 The most widely used method for CI is a randomized controlled trial (RCT). Through randomization (coupled with full compliance), the difference in outcome between treatment groups can be interpreted as a causal treatment effect. Because carrying out RCTs may be infeasible due to cost, time, and ethical constraints, observational studies are sometimes the only alternative. CI using observational data can be seen as an attempt to emulate the RCT that would have answered the question of interest (ie, the ‘target trial’).9 With such an approach, however, treatment is not assigned randomly and extra adjustment for confounding is required. In the simple situation of a time-fixed (or ‘point’) treatment (figure 1, panel 1),10,11 this can be achieved by ‘standard methods’ like regression or propensity-score (PS) analyses.12 However, ICU physicians are typically confronted with treatment decisions which occur at multiple time-points, ie, time-varying treatments (figure 1, panel 1).10,11 Estimating the effect of time-varying treatments using observational data is often complicated by treatment-confounder feedback,13 leading to ‘treatment-affected time-varying confounding’ (TTC, panel 1)11,14,15. Usage of standard methods in the presence of TTC leads to bias.16,17 Inverse-probability-of-treatment weighting (IPTW), the parametric G formula and G estimation (collectively known as ‘G methods’, panel 1) were introduced by Robins18 to estimate causal effects in the presence of TTC, making them well-suited for CI in the ICU. Time-varying treatments can be further subdivided into static (STRs) and dynamic treatment regimes (DTRs, figure 1, panel 1). The latter type is most common in the ICU, as treatment choices are typically dynamically re-evaluated based on the evolving patient state. For example, rather than deciding upon admission to administer vasopressors daily, an ICU physician reconsiders giving this treatment throughout the ICU stay based on the patient’s response. Hence, the practical question of interest often requires a comparison of DTRs. Reinforcement learning (RL)19 is another class of methods which, like G methods, can be used to estimate optimal DTRs and have been increasingly applied to ICU data.20 Partly due to the different language used to describe similar concepts (table S1), studies applying G methods and RL may appear as completely separate disciplines. However, they show great similarities and can be used to build actionable AI models.
To move towards actionable AI in the ICU, our review provides an overview of CI studies concerning time-varying treatments in the ICU, discusses quality of reporting and gives recommendations to improve future research practice.
Panel 1: Glossary
Time-fixed treatment
a treatment that only occurs at the start of follow-up (eg, one shot antibiotics at ICU admission), or does not change over time (eg, genotype).
Time-varying treatment
any treatment that is not time-fixed. Time-varying treatments can be sub-divided in static and dynamic treatment regimes.
Static treatment regime (STR)
a treatment regime that is not tailored to evolving patient characteristics (eg, ‘treat patient with daily antibiotics during ICU admission’).
Dynamic treatment regime (DTR)
a treatment regime where the treatment decisions depend on changing patient characteristics and/or treatment history (eg, ‘treat patient with antibiotics until procalcitonin drops below 0.5 μg/L’).
Treatment-affected time-varying confounding (TTC)
time-varying confounding in which one or more time-varying confounders are affected by previous treatment.
G methods
a class of methods proposed to appropriately adjust for TTC in the estimation of time-varying treatment effects, including inverse-probability-of-treatment weighting (IPTW), the parametric G formula, and G estimation.
Reinforcement learning (RL)
a class of methods that deals with the problem of sequential decision making which returns an optimized treatment regime, including (among others) Q-learning and policy iteration.
Off-policy evaluation (OPE)
the task of estimating the value of an (optimized) treatment regime (or ‘policy’) using data from patients who received treatments not conform to this regime (eg, observational data). OPE methods fall into two main categories: importance-sampling and model-based methods (doubly robust methods borrow ideas from both importance-sampling and model-based methods).
Causal assumptions
Conditional exchangeability
Exchangeability means that the risk of an outcome (eg, mortality) in the untreated group (eg, those who did not receive antibiotics) would have been the same as the risk in the treated group (eg, those who received antibiotics), had the patients in the untreated group received treatment. In observational data, exchangeability generally does not hold due to confounding and/or selection bias, and, therefore, CI requires the assumption that all confounders are measured and adjusted for to achieve exchangeability conditional on the measured confounders.
Positivity
One can only estimate the causal effect of a treatment by comparing data of treated and untreated patients. Therefore, in all subgroups (or ‘strata’) defined by specific combinations of the confounder values, there must be treated and untreated patients. In other words, treatment should occur with some positive probability in all confounder strata (ie, the positivity assumption).
Consistency
Consistency assumes that the outcome for a given treatment will be the same, irrespective of the way treatments are ‘assigned’. This is often plausible for medical treatments, but less obvious for treatments that are modifiable by a variety of means, such as body-mass index which could be caused by diet or a metabolic disorder.
Methods
This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines,21 and the protocol was registered in the online PROSPERO database (CRD42022324014).22
Search strategy
Candidate articles were identified through a comprehensive search in Embase, MEDLINE ALL, Web of Science Core Collection, Google Scholar, medRxiv, and bioRxiv up to 2 March 2022, with no start date. We developed a search strategy that could be modified for each database (appendix A). Search terms included relevant controlled vocabulary terms and free text variations for CI, G methods, or common RL methods, combined with ICU related terms.
Eligibility criteria
We included any primary research article, conference proceedings or pre-print papers that present models for the task of CI concerning time-varying treatments in adult (≥ 17 years of age) patients admitted to the ICU. Articles were not eligible if data from an RCT was used (unless the treatment of interest was not the randomized treatment), it focused on the introduction of new methodology, or was an abstract-only, review, opinion, or survey. Duplicates and articles not written in English were also excluded.
Study selection
We used a two-stage approach for screening: first by title and abstract and then by full article text. One reviewer (JS) screened the titles and abstracts. Full text articles were then screened and selected. Studies for which uncertainty remained for eligibility were independently screened in full-text by two other reviewers (JK and MvG), and conflicts were resolved by discussion between the three reviewers. For both title-abstract and full text screening, reasons for exclusion were recorded.
Data extraction
Data was extracted by using a standardised data extraction form. Uncertainties were resolved by discussion between three reviewers (JS, JK and JL). The items for extraction were based on the STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) checklist,23 supplemented by method-specific items. We extracted the following items from all included studies: details of study variables (ie, studied treatment and primary outcome), the number of included ICUs, usage of open-source database(s), number of participants included, studied treatment type (figure 1), and used CI method. In addition, we extracted the following method-specific items: the usage of methods to reduce extreme weights (ie, weight stabilization24 and truncation25) for studies using IPTW and the off-policy evaluation26 (OPE, panel 1) method used for studies using RL. Finally, if a study estimated the treatment effect both by adjusting for baseline confounding and by adjusting for baseline confounding and TTC, we also collected these different estimates.
Quality of reporting
To assess the quality of reporting (QOR) of the included studies, we judged the reporting of the components of the target trial framework9 and the causal assumptions (panel 1).
Target trial components
The ‘target trial framework’, introduced by Hernan and Robins9, consists of seven main components. We judged the reporting of five of these: eligibility criteria, treatment strategies, follow-up period, outcome and analysis plan (table 1). We scored the analysis plan component as ‘reported’ if one could reproduce the modelling given the required data. For studies using RL, we judged the ‘treatment strategies’ and ‘outcome’ components as ‘reported’ if the definitions of the used action space and reward were reported, respectively. We split the follow-up period component into three subcomponents: time-zero (or ‘index date’), end of follow-up, and time resolution (ie, the time steps in which the treatment level is considered the same).27 We split the analysis plan component into specific subcomponents depending on the CI method used (table S2). We scored the target trial components that are split in subcomponents as ‘reported, ‘partially reported’ and ‘not reported’ if all, some, or none of the subcomponents were reported, respectively. Because (outside RCTs) ICU physicians are typically aware of the treatment patients receive, one cannot emulate target trials with blind assignment. Therefore, we did not consider the ‘assignment procedures’ component. Also, we did not consider the ‘causal contrast of interest’ component because an intention-to-treat analysis based on observational data is rarely possible.9
Causal assumptions
The task of CI relies on strong assumptions, including conditional exchangeability, positivity and consistency (hereafter referred to as ‘causal assumptions’, panel 1).8 Violations of these assumptions lead to biased estimates and therefore, acknowledgement is important and ideally, potential violations are examined. We scored each study using three levels of increasingly good reporting quality: (1) assumption not mentioned, (2) assumption mentioned, and (3) attempt to check for potential violations of the assumption reported. For the conditional exchangeability assumption, we distinguished two types of attempts to check for potential violations: ‘indirect approaches’9 and ‘bias analyses’.28 For the positivity assumption, we considered the examination of the distribution of the estimated (stabilized) inverse-probability-of-treatment (IPT) weights as an attempt to check for potential violations.29 Approaches to check potential violations of consistency do not exist and therefore, mentioning the consistency assumption (level 2) was considered as the highest level of reporting quality.
Evidence synthesis
We tabulated extracted study items for each study individually and grouped by CI method used. QOR results concerning the target trial components and the causal assumptions are summarised as percentages using bar charts, and QOR results for each study individually were tabulated. For the reporting of the target trial components, we made separate tables for each group of studies that used a specific CI method. The collected treatment effect estimates reported by studies that estimated the treatment effect both by adjusting for baseline confounding and by adjusting for baseline and TTC, were plotted as point estimates and corresponding 95% confidence intervals.
Results
Our search identified 1,714 unique articles, of which 1,605 were excluded based on title and abstract screening. We screened 109 full texts, 60 of which met the eligibility criteria and were included in the review (figure 2). The articles were published between 2005 and 2021 in 36 different journals and conference proceedings, with a steadily growing number of articles per year starting around 2010 (figure S1). A reference list of all included studies and the list with collected items per study can be found in appendix B and table S3, respectively. Most studies applied G methods (n=40, 67%), of which 36 (60%) used IPTW and four (7%) the parametric G formula. Twenty (33%) studies used RL methods (table 2). The three most frequently studied treatment categories were nosocomial infections (n=8, 13%), anti-inflammatory drugs (n=6, 10%) and sedatives and analgesics (n=6, 10%). Most studies (n=32, 53%) considered mortality (at varying follow-up times) as the primary outcome. Thirty-one studies (52%) included data from at least two different ICUs. Studies that used RL generally included more patients than studies that used G methods, with a median of 7,513 (IQR 5,252 to 18,340) versus 1,451 patients (IQR 421 to 2,914) and relied more often on open-source ICU databases (75% vs 15%). In total, 21 (35%) of the studies used one or more open-source ICU database, among which the Medical Information Mart for Intensive Care (MIMIC)-III database30 was the most frequently used (n=19, 32%). In contrast to RL studies (which inherently deal with DTRs), only three31–33 of the 40 studies (8%) that used G methods considered DTRs.
Method-specific items
Among the studies that used IPTW (n=36), 17 applied stabilized weights, one applied weight truncation, and eight studies applied both weight stabilization and truncation. Among studies that applied RL on real (ie, not simulated) patient data (n=16), seven studies used an importance-sampling based34, model-based35,36, a doubly robust OPE method37, or a combination of these. Eight studies used the so-called ‘U-curve method’38 (panel 1) and for six of these, this was the only reported OPE method. In three studies, the OPE method was not reported (figure S2).
Quality of reporting
Target trial components
The ‘eligibility criteria’ and ‘outcome’ components were reported in 58 (97%) and 59 (98%) of the studies, respectively (figure 3a). We scored the ‘treatment strategies’, ‘follow-up period’ and ‘analysis plan’ components as partially or not reported in respectively 23 (38%), 16 (27%) and 29 (48%) of the studies. All five target trial components were fully reported in only ten (17%) studies.31,39–47 The reporting of the target trial components grouped by used CI method are summarized in figures S3-5 and tabulated for each individual study in tables S4-S6.
Causal assumptions
The conditional exchangeability assumption remained unmentioned in 21 (35%), was mentioned in 25 (42%), and an attempt to check for potential violations was reported in 14 studies (23%, figure 3b). Among the studies that reported a check for potential violations, four studies48–51 performed a bias analysis. The positivity assumption remained unmentioned in 41 (68%), was mentioned in three (5%), and a check for potential violations was reported in 16 (27%) of the studies. The consistency assumption was mentioned in seven (12%) of the studies. All three assumptions were mentioned (or a check for potential violations was reported) in only six (10%) studies.31,33,42,43,51,52 The reporting of assumptions grouped by CI method used are summarised in figures S6-S8 and individual results for all studies are tabulated in table S7. In general, the causal assumptions remained unmentioned more often in studies that applied RL, compared to those which applied G methods (figures S6-8). All studies that reported a check for potential violations of the conditional exchangeability assumption also mentioned this assumption, whereas for the positivity assumption, seven out of 16 studies that reported a check for potential violations did not explicitly mention positivity (table S7).
Adjusting for TTC
Eighteen studies (30%) estimated the treatment effect by adjusting for baseline confounding and by adjusting for baseline confounding and TTC. For most of these studies, the point estimates of the treatment effects varied substantially after adjusting for both baseline and TTC, moving the effect estimate towards or away from the null hypothesis, or even leading to opposite effect estimates (figure 4).
Discussion
Our review of 60 published studies found a wide variety of treatments being studied, with a predominant focus on STRs, despite DTRs being most relevant in the ICU setting. We found incomplete reporting of the target trial components in most studies, among which the ‘treatment strategies’ and ‘analysis plan’ were incompletely reported most often. The causal assumptions often remained unmentioned, and this was especially noticeable in studies that applied RL methods.
ROBINS-I53 is a tool developed for assessing the risk of bias (ROB) in CI studies using observational data. Instead of assessing the ROB using this tool, we chose to assess the QOR. First, to fairly assess the ROB, the emulated target trial needs to be well reported, which was often not the case in the included studies (figure 3a). Moreover, ROB assessment would require expert knowledge of each specific treatment-outcome relationships studied in the included articles, which is beyond the scope of this review.
G methods and RL methods are often perceived as separate disciplines, but show great similarities. For example, Q-learning54 (an RL method, used by many of the included studies46,47,55–57) is very similar -and under certain conditions even algebraically equivalent- to G estimation (a G method).58 An important difference is that G methods are used for modelling both STRs and DTRs, while RL methods typically deal with DTRs. As both G methods and RL methods perform the same CI task (ie, finding optimal treatment regimes), both rely on the same, strong causal assumptions which should be acknowledged. While the consistency assumption is often plausible for treatments in the ICU, violations of the conditional exchangeability and positivity assumption are more likely and should be examined. Prior to examining violations of the causal assumptions, one needs a research question that is truly of interest in the ICU, a clear description of the target trial, and usage of a CI method that is appropriate for the type of studied treatment. The results of our review have led to five recommendations to improve future CI research and move towards actionable AI in the ICU (panel 2).
Recommendations for future research
Ask the right research question
Treatments of interest in the ICU are typically DTRs and, therefore, this type of treatments is expected to be the focus of CI research in the ICU. However, 93% of the studies that used G methods studied STRs. To illustrate that many of these studies are considering research questions that are not truly of interest in the ICU, we will explore some examples. Zhang and colleagues[57] divided patients into two groups according to whether they received diuretics within the first two days of ICU admission or not. Thus, the emulated target trial answers the question whether or not to administer diuretics at the start of ICU admission. However, we argue that the question an ICU physician is really interested in is when to administer diuretics throughout the whole ICU stay, taking into account changing patient characteristics such as fluid balance (especially at later ICU stages). In addition, many of the included studies emulated target trials comparing ‘giving treatment sometime during follow-up’ versus ‘never giving treatment’. For example, Bailly and colleagues[52] studied the effect of systemic antifungal therapy, comparing a treated group (those who received antifungals during their ICU stay) with an untreated group (those who never received antifungals). As giving treatment ‘sometime during follow-up’ can be done in many ways, the estimated treatment effect is ill-defined and typically not truly of interest. In other words, both studies by Zhang[57] and Bailly[52] serve as examples of emulated RCTs that would never be conducted in the ICU.
Describe the question as a target trial emulation
To identify flaws in the relevance of a research question and correctness of the analysis, it is useful to describe the research question as a target trial emulation using the target trial framework9. Many of the included studies lacked a clear description of the ‘treatment strategies’ component of the target trial, that is, which treatment regimes are compared in the target trial. For example, Arabi and colleagues59 used IPTW to study the effect of corticosteroid therapy for ICU patients with Middle East Respiratory Syndrome. However, it remains unclear which treatment regimes (eg, ‘treat daily with corticosteroids’) are being compared. Moreover, roughly half of the included studies lacked a complete description of the ‘analysis plan’ component and therefore, are not reproducible. We advocate detailed description which allows reproduction of the used methodology, ideally accompanied with code and (example) data.
Use methods that suit the research question
We excluded 227 studies that modelled time-fixed treatments (figure 2). As time-fixed treatments in the ICU are rare, we hypothesize that in many of these studies, the implicit treatment of interest is time-varying. Research questions concerning time-varying treatments may be reformulated into simplified, time-fixed versions, because standard methods are easier to implement or high-quality, longitudinal data is unavailable. One may argue that, if the bias introduced by TTC16,17 is negligible, standard methods suffice for CI in time-varying treatments as well. However, empirical results from studies included in this review suggest that adjusting for TTC can lead to substantial differences in effect estimates and sometimes even to opposite conclusions (figure 4). Hence, it is possible that many of the excluded studies that implicitly studied time-varying treatments but modelled these as if they are time-fixed, published biased effect estimates. We advocate adjustment for TTC in any CI study where the treatment of interest is time-varying. Modelling DTRs is slightly more complex than STRs (which may be a reason for the focus on STRs among the included studies) and therefore requires different approaches. Various methods exist to find optimal DTRs, either from a set of pre-specified regimes or directly from data (for an overview, we refer to the book by Chakraborty and Moodie60). Among the included studies in this review, for example, Shahn and colleagues31 used ‘artificial censoring/IPW’60–62 to estimate the optimal fluid-limiting treatment regime for sepsis patients among a pre-specified set of DTRs (ie, ‘fluid caps’). Wang and colleagues33 used the parametric G formula to estimate the per-protocol (PP) effect of ‘low tidal volume ventilation’, a pre-specified DTR that was compared with standard care in an earlier RCT.63 Here, the target trial corresponds to the original RCT, but with full compliance. RL methods and G estimation can be used to approximate optimal DTRs without a pre-specified set of regimes. In RL studies, finding the optimal treatment regime (often referred to as the optimal ‘policy’, table S1) is typically followed by a validation step to quantify the value of the optimized regime (ie, OPE, panel 1). The ‘U-curve method’38 (a specific OPE method) is common among the included RL studies (figure S2) and is based on associating the difference between the (observed) clinician’s treatment regime and the (estimated) optimal treatment regime with patient outcome. As it completely ignores the potential effect of confounders, we recommend avoiding this method. We argue that G methods are essentially OPE methods and therefore, these could (and maybe should) be used to evaluate optimal treatment regimes found in RL studies.
Mind the conditional exchangeability assumption
Conditional exchangeability is never guaranteed using observational data as the absence of unmeasured confounders is not verifiable in the data. To think about residual confounding or selection bias, incorporation of subject-matter expertise is key. Causal diagrams (represented by directed acyclic graphs)64,65 are a visual way to represent this expert knowledge and can be useful to describe potential sources of bias. There are different approaches to quantify how potential violations of the conditional exchangeability would affect the study results.66 Indirect approaches consider, for instance, the effect of adding additional confounders.9 A ‘bias analysis’ (or sensitivity analysis)28 examines the characteristics of potential unmeasured confounders and can be useful to quantify how much bias it would produce as a function of those characteristics.
Mind the positivity assumption
The positivity assumption -on the contrary-is verifiable, although this is rather complex for time-varying treatments29 and, given its dynamic nature, violations are expected in the ICU setting. The intuition for this assumption is that one can only study a treatment regime using data of patients who have received treatment that conform to this regime. The number of patient treatment histories that match the treatment regime of interest (ie, the ‘effective sample size’67) shrinks with the number of treatment decisions in the patient’s history (which tends to be high in the ICU). For example, Gottesman and colleagues38 applied RL to a dataset of 3,855 patients to find an optimal treatment regime for sepsis, but the effective sample size for this regime was only a few dozen. A small effective sample size makes positivity violations likely and leads to high uncertainties in estimated treatment effects. A straight-forward opportunity to tackle this challenge is increasing the sample size. Therefore, we advocate more usage (if appropriate) of the four currently available open-source ICU databases.68 However, increasing the sample size does not guarantee increasing the effective sample size, as the patients in the extra dataset may not be treated according to the regime of interest. Hence, another opportunity to increase the effective sample size is to minimize the mismatches between the treatment regime(s) of interest and those observed in the data, for instance, by avoiding modelling treatment regimes which differ greatly from the treatment protocol in place.
To detect (but not rule out) violations of the positivity assumption, examination of the distribution of the estimated (stabilized) IPT weights can be useful.29 This was common among the included studies that used IPTW (n=16/36, table S7), but is recommended in studies that use other CI methods as well. For RL studies that use an importance-sampling69 OPE method, it is recommended to examine the distribution of importance weights38 (which is closely related to examination of IPT weights).
In studies using IPTW, weight stabilization and truncation can be used to limit high uncertainties in the effect estimates. Weight stabilization can improve the precision of effect estimates without the introduction of bias. However, a model based on stabilized weights results in a slightly different effect estimate compared to non-stabilized weights70 and should be interpreted carefully. Weight truncation also improves precision, but at the expense of bias. Examination of the influence of the introduced bias by checking the change of the effect estimates under progressive truncation of IP weights is recommended.25
Study limitations
First, whereas efforts were made to ensure that the literature search was comprehensive, we could have missed studies for different reasons. Some research may have used non-conventional terminology to describe the used CI method, or used a CI method which was not included in our search strategy. For example, dynamic weighted ordinary least squares (dWOLS)71,72 is a relatively new method which has been used to model DTRs in the ICU setting in several studies.73,74 This method benefits from properties of both Q-learning (an RL method) and G-estimation (a G method) and may be an interesting direction for future research. Second, only one reviewer (JS) performed title-abstract screening and item collection, although thorough discussions with the other reviewers (MvG, JK, JL) occurred in case of uncertainty. Third, items that were not collected could be of interest for future investigation. For example, we did not differentiate RL further into specific RL methods.
Study strengths
This systematic review stresses the importance of causality for actionable AI and provides a contemporary overview of CI research in the ICU literature. We describe shortcomings of the identified studies in terms of reporting and, furthermore, provide handles for improving future CI research. These recommendations are not limited to the ICU but apply to medical CI research as a whole. Unlike other systematic reviews on time-varying medical treatments,75–77 we did not limit our focus to either G methods or RL, but rather acknowledge that both these method classes can be used to perform CI tasks and therefore, hold the promise to bring actionable AI to the bedside.
Conclusion
Towards actionable AI in the ICU, we concur with the guidance of editors of critical care journals78,79 to change the focus of observational research in the ICU from prediction to CI. To unlock this potential in a trustworthy and responsible manner, we advocate development of models for CI focusing on clinically relevant treatments, using a description of the research question as a target trial emulation, choosing appropriate CI methods given the treatment of interest and acknowledging (and ideally examining potential violations) of the causal assumptions.
Panel 2: Summary of recommendations for future research
Ask the right research question
When developing a model for CI, consider clinically relevant treatments. In the ICU, treatment decisions typically occur at multiple time point during admission (ie, time-varying treatments) and often depend on the patient’s response to treatment (ie, dynamic treatment regimes).
Describe the question as a target trial emulation
To identify flaws in the relevance of a research question and correctness of the analysis, it is useful to imagine a randomized trial that would have answered the research question (ie, the target trial), describe its components using the target trial framework and emulate it to the extent possible.
Use methods that suit the research question
Standard methods (like regression) are easy to implement and suffice for time-fixed treatments, but lead to biased estimates when used for time-varying treatments. Therefore, adjustment for bias introduced by treatment-affected time-varying confounding is always recommended when dealing with time-varying treatments. Modelling of dynamic treatment regimes requires slightly different approaches compared to static treatment regimes.
Mind the conditional exchangeability assumption
CI is not possible based on data only, and incorporation of expert knowledge is key to think about the causal structure between the treatment and outcome of interest. Representing this expert knowledge in causal diagrams is useful to visualize potential sources of bias. Although violations of this assumption can never be completely ruled out using observational data, several approaches exist to examine potential violations. For example, a bias analysis can be helpful to quantify how much bias unmeasured confounders could produce.
Mind the positivity assumption
This assumption is verifiable, but this is rather complex for time-varying treatments and violations are expected given the dynamic nature of the ICU. Violations could be minimized by increasing the sample size (eg, by more usage of open-source ICU databases) and studying treatment regimes that are similar to those observed in the data. Examination of estimated inverse-probability-of-treatment weights is useful to detect (but not rule out) positivity violations.
Data Availability
All data produced in the present work are contained in the manuscript.
Appendix
Appendix A: Embase search strategy
Embase.com
(‘causal inference’/de OR ‘causal model’/de OR ‘causal modeling’/de OR ‘inverse probability weighting’/de OR ((causal NEAR/3 (inferen* OR model*)) OR ((causal OR average-treatment* OR individuali*-treatment* OR personali*-treatment*) NEXT/1 (effect*)) OR time-vary*-confound* OR g-computation* OR g-estimation* OR g-formula* OR doubly-robust OR counterfactual* OR (inverse-probabilit* NEAR/3 (weight* OR estimat*)) OR ((marginal-structur* OR structural-nest* OR causal-effect* OR causal-graphic* OR causal-inferen* OR condition*-outcome* OR sequen*-cox*) NEAR/3 (method* OR model*)) OR TAR-Net OR (Treatment*-Agnost* NEAR/3 Representat* NEAR/3 Network*) OR double-machine-learning OR anchor*-regress* OR x-learner* OR t-learner* OR s-learner* OR q-learning OR q-network OR reinforcement*-learn* OR ((policy OR value) NEXT/1 iteration*) OR temporal-differen* OR actor-critic* OR (Markov NEAR/3 decision NEAR/3 process*)):ab,ti) AND (‘intensive care’/exp OR ‘intensive care unit’/exp OR ‘critically ill patient’/de OR ‘critical illness’/de OR ‘artificial ventilation’/exp OR ‘mechanical ventilator’/exp OR (intensive-care* OR critical-care* OR critical*-ill* OR icu OR ((mechanic* OR artificial*) NEAR/3 ventilat*)):Ab,ti,jt) NOT [conference abstract]/lim AND [english]/lim NOT (‘pediatric intensive care unit’/de OR ‘neonatal intensive care unit’/de OR child/exp OR pediatrics/exp OR (nicu OR picu OR nicus OR picus OR infant* OR child* OR neonat* OR newborn* OR pediatr* OR paediatr*):ab,ti)
Medline ALL
(((caus* ADJ3 (inferen* OR model*)) OR ((causal OR average-treatment* OR individuali*-treatment* OR personali*-treatment*) ADJ (effect* OR method*)) OR time-vary*-confound* OR g-computation* OR g-estimation* OR g-formula* OR doubly-robust-estimation* OR counterfactual* OR (inverse-probabilit* ADJ3 (weight* OR estimat*)) OR ((marginal-structur* OR structural-nest* OR causal-effect* OR causal-graphic* OR causal-inferen* OR semi-paramet* OR semiparamet* OR fully-paramet*) ADJ3 (method* OR model*)) OR TAR-Net OR (Treatment*-Agnost* ADJ3 Representat* ADJ3 Network*) OR double-machine-learning OR anchor*-regress* OR x-learner* OR t-learner* OR s-learner* OR q-learning OR q-network OR reinforcement*-learn* OR ((policy OR value) ADJ iteration*) OR temporal-differen* OR actor-critic* OR (Markov ADJ3 decision ADJ3 process*)).ab,ti. OR (RL OR IRL).ti.) AND (exp Intensive Care Units/ OR Critical Illness/ OR exp Respiration, Artificial/ OR exp Ventilators, Mechanical/ OR (intensive-care* OR critical-care* OR critical*-ill* OR icu OR ((mechanic* OR artificial*) ADJ3 ventilat*)).ab,ti,jt) NOT (conference abstract) AND english.la. NOT (Intensive Care Units, Pediatric/de OR Intensive Care Units, Neonatal/de OR exp Child/ OR exp pediatrics/ OR (nicu OR picu OR nicus OR picus OR infant* OR child* OR neonat* OR newborn* OR pediatr* OR paediatr*).ti,ab)
Web of Science Core Collection
TS=(((causal NEAR/2 (inferen* OR model*)) OR ((causal OR average-treatment* OR individuali*-treatment* OR personali*-treatment*) NEAR/1 (effect*)) OR time-vary*-confound* OR g-computation* OR g-estimation* OR g-formula* OR doubly-robust OR counterfactual* OR (inverse-probabilit* NEAR/2 (weight* OR estimat*)) OR ((marginal-structur* OR structural-nest* OR causal-effect* OR causal-graphic* OR causal-inferen* OR condition*-outcome* OR sequen*-cox*) NEAR/2 (method* OR model*)) OR TAR-Net OR (Treatment*-Agnost* NEAR/2 Representat* NEAR/2 Network*) OR double-machine-learning OR anchor*-regress* OR x-learner* OR t-learner* OR s-learner* OR q-learning OR q-network OR reinforcement*-learn* OR ((policy OR value) NEAR/1 iteration*) OR temporal-differen* OR actor-critic* OR (Markov NEAR/2 decision NEAR/2 process*)) AND (intensive-care* OR critical-care* OR critical*-ill* OR icu OR ((mechanic* OR artificial*) NEAR/2 ventilat*)) NOT (nicu OR picu OR nicus OR picus OR infant* OR child* OR neonat* OR newborn* OR pediatr* OR paediatr*)) AND DT=(Article OR Review OR Letter OR Early Access)
Google Scholar
Searched with 2 different queries:
“causal inference”|”marginal structural models”|”g-formula”|”structural nested models”|”reinforcement learning” “intensive|critical care”
Only the first 200 results
“causal inference”|”marginal structural models”|”g-formula”|”structural nested models”|”reinforcement learning” intitle:”intensive|critical care”
All 389 results
MedRxiv and BioRxiv
searched via Google with the following query:
inurl:medrxiv|biorxiv filetype:pdf “causal inference”|”marginal structural models”|”g-formula”|”structural nested models”|”reinforcement learning” “intensive|critical care”
When: 2 March 2022
Settings:
SafeSearch Filters turned on
Auto-complete with trending searches: Show popular searches
Region Settings: Current region (i.e. the Netherlands)