Abstract
Background Observational research provides a unique opportunity to learn causal effects when randomized trials are unavailable, but obtaining the correct estimates hinges on a multitude of design and analysis choices. We illustrate the advantages of modern causal inference methods and compare to standard research practice to estimate the effect of corticosteroids on mortality in hospitalized COVID-19 patients in an observational dataset. We use several large RCTs to benchmark our results.
Methods Our retrospective data consists of 3,298 COVID-19 patients hospitalized at New York-Presbyterian March 1-May 15, 2020. We design our study using the target trial framework. We estimate the effect of an intervention consisting of six days of corticosteroids administered at the time of severe hypoxia and contrast with an intervention consisting of no corticosteroids. The dataset includes dozens of time-varying confounders. We estimate the causal effects using a doubly robust estimator where the probabilities of treatment, outcome, and censoring are estimated using flexible regressions via super learning. We compare these analyses to standard practice in clinical research, consisting of two approaches: (i)Cox models for an exposure of corticosteroids receipt within various time windows of hypoxia, and (ii)Cox time-varying model where the exposure is daily administration of corticosteroids beginning at hospitalization.
Results Our target trial emulation estimates corticosteroids to reduce 28-day mortality from 32% (95% confidence interval: 31-34) to 23% (21-24). This is qualitatively identical to the WHO’s RCT meta-analysis result. Hazard ratios from the Cox models range in size and direction from 0.50 (0.41-0.62) to 1.08 (0.80-1.47) and all study designs suffer from various forms of bias.
Conclusion We demonstrate that clinical research based on observational data can unveil true causal relations. However, the correctness of these effect estimates requires designing and analyzing the data based on principles which are different from the current standard in clinical research. Widespread communication and adoption of these design and analytical techniques is of high importance for the improvement of clinical research.
Introduction
Observational databases are invaluable resources when randomized controlled trials (RCTs) are infeasible or unavailable. However, the correctness of the conclusions gleaned from analyses of observational data hinges on the careful consideration of study design principles and choice of estimation methodology.
Threats to the validity of causality are pervasive in the clinical literature(1–4). A major reason for the failure to address these biases is the widespread adoption of a model-first approach to observational research. In this approach, a model is first chosen according to the data type and outcome of interest, and the quantity used to answer the research question is automatically determined by the model choice. For example, when faced with a time-to-event outcome, researchers automatically resort to a Cox regression model. It is common practice to then use the coefficients of the model or transformations thereof (e.g., hazard ratios) as the answer to the clinical question of interest.
A model-first approach induces multiple problems for the estimation of causal effects(5). First, model parameters often do not represent quantities of scientific interest or well-defined causal effects(6)). Second, assumptions such as the proportional hazards assumption used in Cox models are rarely correct in medical research since hazards cannot be proportional when a treatment effect changes over time(7). Third, regression models cannot correctly handle time-dependent feedback between confounders, treatment, and the outcome(1). Fourth, the model-first approach yields a tendency to interpret all coefficients in the model, which is a mistake known as the Table 2 fallacy(8). Lastly, model-first analyses often employ less-than-optimal model selection techniques, which may lead to improper variance estimates and model misspecification bias(9).
Recent developments in the causal inference literature provide researchers with a number of tools to alleviate the aforementioned biases. Frameworks such as the target trial emulation(10) and roadmap for causal inference(11) allow researchers to proceed with a question-first approach. Instead of defaulting to effect measures provided by regression models, a question-first approach begins by defining a hypothetical target trial and subsequent target of inference that answers the scientific question of interest. This is the so-called estimand, or quantity to be estimated. After the estimand is chosen, the most appropriate statistical technique may be selected. Incorporating these principles can help clarify the research question, determine study eligibility requirements, identify enrollment and follow-up times, decide whether sufficient confounder data are available, and more(12,13). A question-first approach also allows researchers the freedom to select an estimation technique which mitigates model misspecification biases and increases the likelihood of obtaining a correct estimate.
We hypothesize that a question-first approach will have improved success in recovering causal effects as opposed to a model-first approach. To test this hypothesis, we use a retrospective cohort of 3,298 coronavirus 2019 (COVID-19) patients hospitalized at New-York-Presbyterian Hospital (NYPH) March 1-May 15, 2020. Lack of guidance for clinical practice at the beginning of the pandemic meant that high variability existed in the administration and timing of corticosteroids. While provider practice variability aids in the estimation of causal effects, the resulting complex longitudinal treatment patterns can complicate study design and analytical methods. Our dataset together with results from numerous RCTs on corticosteroids provide a unique opportunity to assess various design and analysis methods. We benchmark the results of our analyses against the effect measures obtained in the World Health Organization (WHO)’s RCT meta-analysis(14).
Methods
Hypothetical target trial
Population
Inclusion criteria is all adult patients with COVID-19 who were admitted to NYPH Weill Cornell, Lower Manhattan Hospital, or NYPH Queens. Cases are confirmed through reverse-transcriptase–polymerase chain-reaction assays performed on nasopharyngeal swab specimens. Patients who have chronic use of corticosteroids prior to hospitalization or who are transferred into the hospital from an outside hospital are excluded.
Intervention strategies
Patients are randomized on their first day of hospitalization to receive either (1)standard of care therapy (without corticosteroids) or (2)standard of care plus a corticosteroid regimen to be administered if and when criteria for severe hypoxia are met. The corticosteroid dosage is a minimum of 0.5 mg/kg body weight of methylprednisolone equivalent per 24-hour period and the duration of therapy is six days(15). Corticosteroids include prednisone, prednisolone, methylprednisolone, hydrocortisone, and dexamethasone and choice of drug is at the attending physician’s discretion. Severe hypoxia is defined as the initiation of high flow nasal cannula, venti-mask, noninvasive or invasive mechanical ventilation, or an oxygen saturation of <93% after the patient is on 6 Liters of supplemental oxygen via nasal cannula.
Outcome and estimand
The primary outcome is 28-day mortality from time of randomization. The estimand of interest is the difference in 28-day mortality rates between the two treatment strategies.
Data analysis plan
In this hypothetical target trial with no loss-to-follow-up, we analyze the difference in proportion of patients who experienced the outcome between those who were randomized to the “standard of care” treatment regime and those who were randomized to the “standard of care plus corticosteroids at time of hypoxia” treatment regime.
Emulation using observational data
Cohort description and data source
The target trial emulation uses retrospective data from patients who meet the hypothetical trial’s eligibility criteria between March 1 and May 15, 2020. Demographics, comorbidity, intubation, death, and discharge data were manually abstracted by trained medical professionals into a secure REDCap database(16). These were supplemented with a data repository housing laboratory, procedure, diagnosis, medication, and flowsheet data documented as part of standard care(17). Patients are followed for 28 days from hospitalization and lost to follow-up by discharge or transfer to an external hospital system.
Treatment strategies and measurement
To emulate the target trial corticosteroid treatment strategy, we estimate the effect of a hypothetical dynamic treatment regime(18), whereby each patient is administered six days of corticosteroids if and when they meet severe hypoxia criteria. This dynamic regime is contrasted with a static regime where patients never receive corticosteroids.
We measure severe hypoxia using vital signs (for oxygen saturation) and flowsheet data (for supplemental oxygen) and define it in the same way as our target trial. We measure corticosteroid exposure using the Medical Administration Record. We compute cumulative mg/kg dosing of corticosteroids over rolling 24-hour windows, and if a patient received >0.5 mg/kg methylprednisolone equivalent, they are denoted as having corticosteroids exposure that day.
Unlike our target trial, patients in the observational study are subject to loss-to-follow-up. Thus, our emulation requires a hypothetical intervention whereby patients are not loss to follow-up, so that we can observe their 28-day mortality status. A sample of observed data is shown in Supplemental Figure 1. An illustration of the treatment regimes as they relate to the observed data are shown in Figure 1.
Confounders
In contrast to the hypothetical trial, treatment assignment in the observational study is not randomized and depends on physiological characteristics of each patient. Correct emulation requires: (i) careful consideration of all possible confounders, and (ii) careful adjustment for these confounders in data analysis.
Baseline confounders include age, sex, race, ethnicity, Body Mass Index (BMI), comorbidities (coronary artery disease (CAD), cerebral vascular event, hypertension, diabetes mellitus (DM), cirrhosis, chronic obstructive pulmonary disease, active cancer, asthma, interstitial lung disease, chronic kidney disease (CKD/ESRD), immunosuppression, human immunodeficiency virus (HIV)-infection, home oxygen use), mode of respiratory support within three hours of hospital arrival, and hospital admission location.
Time-dependent confounders include heart rate, pulse oximetry percentage, respiratory rate, temperature, systolic/diastolic blood pressure, blood urea nitrogen (BUN)-creatinine ratio, creatinine, neutrophils, lymphocytes, platelets, bilirubin, blood glucose, D-dimers, C-reactive protein, activated partial thromboplastin time, prothrombin time, arterial partial pressures of oxygen and carbon dioxide, and level of supplemental oxygen support. Figure 2 summarizes the relationship between confounders, treatment, and outcomes in the form of a Directed Acyclic Graph (DAG).
Outcome and estimand
Our estimand of interest is the difference in 28-day mortality rates in a hypothetical world where we had implemented the two different corticosteroid treatment strategies, as well as an intervention to prevent loss-to-follow-up. Under the assumption that treatment and loss-to-follow-up each day are randomized conditional on baseline and time-dependent confounders, this estimand is identified by the longitudinal g-computation formula(19). This longitudinal g-computation formula for our two corticosteroids treatment regimens with a censoring intervention will be our estimand of interest.
Data analysis plan
Correct emulation of a target trial requires proper adjustment for measured confounding through estimation of the g-computation formula. It is important to use estimation methods capable of fitting the data using flexible mathematical relationships so that confounding is appropriately removed, especially when the number of baseline and time-dependent confounders is large.
Several methods can be used to estimate the g-computation formula (e.g., inverse probability weighting (IPW), parametric g-formula, targeted minimum loss-based estimators (TMLE), sequentially doubly robust estimators (SDR), etc.)(20,21). These estimation methods rely on two kinds of mathematical models: (i)models of the outcome as a function of the time-dependent confounders, and (ii)models of treatment as a function of time-dependent confounders. Methods that use only one of these models are often called singly robust, because their correctness relies on the ability to correctly specify one of the models (e.g., IPW relies on estimating treatment models correctly). Estimation methods that use both of these models are often called doubly robust, because they remain correct under misspecification of one of the two models.
Furthermore, doubly robust estimators such as TMLE and SDR allow the use of machine learning to flexibly fit relevant treatment and outcome regressions(22,23). This is desirable because these regression functions might include complex relationships between exposures and treatments, and capturing those relationships is not possible using simpler models such as the Cox proportional hazards(24).
The primary analysis is conducted using SDR estimation with a dynamic intervention, time-varying confounders, and a time-to-event outcome. An ensemble of machine learning models using the super learner algorithm is used to estimate the regressions for treatment and outcome(25,26). Additional methodological details are available in Supplemental Materials. A code tutorial is available at https://github.com/kathoffman/steroids-trial-emulation.
Emulation of model-first approaches common in clinical literature
For contrast with the target trial emulation strategy, we review methodology of papers cited in Chaharom et al.’s(27) meta-analysis, and then analyze the data using study designs common in other corticosteroids for COVID-19 observational research. The data source, outcome, and confounders are the same as the above target trial. Modifications to the cohort and treatment definitions to accommodate the model-first approaches are outlined below.
Cox models using a point treatment
The first approach we explore is a regression for mortality with a point (as opposed to time-varying) treatment variable. The inclusion criteria and time zero are defined as the time of meeting hypoxia criteria, which is the intended indication for corticosteroids. A study design using this analytical approach entails a number of choices, including:
Defining a range of time relative to inclusion criteria for a patient to be considered “treated”.
Deciding whether to exclude patients treated before the inclusion time.
Deciding how to handle patients who died during the treatment time window.
Deciding how to handle patients treated after the treatment time window.
We fit Cox proportional hazards models using data sets obtained from various design choices, summarized in Table 1. All baseline confounders, time-dependent confounders from day zero, and the corticosteroid exposure are included as variables in the Cox models. The exponentiated coefficient for corticosteroids is interpreted as the hazard ratio for corticosteroid exposure within the defined treatment window for moderate-to-severe COVID-19 patients.
Time-varying Cox models
In our second model-first approach, we fit a time-varying Cox model for time to mortality up to 28 days from the day of hospitalization. This model uses our entire cohort and contains all baseline and time-dependent confounders, as well as daily corticosteroid administration. The coefficient for corticosteroids is exponentiated and used as an estimate of the hazard ratio for corticosteroids.
Software
All data were analyzed in R version 4.0.3 with open-source packages ggplot2, gtsummary, survival, survminer, lmtp, and sl3(28–34).
RCT benchmark
Several RCTs have established the effectiveness of corticosteroids in the treatment of moderate-to-severe COVID-19 patients(35–37). The WHO performed a meta-analysis of seven such RCTs and estimated the odds ratio (OR) of mortality to be 0.66 (95% CI (0.53-0.82)(14). We use this estimate, as well as supporting evidence from other RCT meta-analyses(27,38) to benchmark our results.
Results
Target trial emulation
The final cohort includes 3,298 patients of a median age 65 (IQR 53, 77) and 60% males. The median BMI was 27 (IQR 23-31). There were 1033 (31%) patients with DM, 460 (14%) with CAD, 1780 (54%) with hypertension, and 159 (4.8%) with CKD/ESRD. Table 2 shows baseline characteristics of the cohort, overall and stratified by any corticosteroid exposure. There were 1,690 patients who reached the randomization criteria of severe hypoxia and 423 patients who received corticosteroids at any point during follow-up. 699 patients died before 28 days.
In the target trial emulation analysis, all 3,298 patients who were admitted to the hospital are analyzed. The estimated mortality rate under a hypothetical intervention of no corticosteroids is 32% (95% CI 31-34). The estimated mortality rate under a hypothetical intervention in which corticosteroids are administered for six days upon patients becoming severely hypoxic is 23% (21-24). This yields an estimated mortality reduction of 9.6% (8.8-10.4) if this policy had been implemented.
Model-first approaches
In a subset of 1,690 patients who met severe hypoxia, 72 patients received corticosteroids within one day of hypoxia and 191 patients received corticosteroids within 5 days of hypoxia. There were 18 and 451 patients who died within one and five days of hypoxia without receiving corticosteroids, respectively.
Model A, which defined corticosteroid exposure as anytime during hospitalization, yielded an HR of 0.50 (0.41-0.62). Models B-I, which placed either a one- or five-day limit on corticosteroids treatment from the time of hypoxia, yielded mostly non-significant HRs in both directions (B: 0.95 (0.66-1.37), C: 0.92 (0.63-1.33), D: 0.89 (0.56-1.41), E: 0.66 (0.41-1.04), G: 1.05 (0.77-1.45), H: 1.04 (0.75-1.45)). The exception to this was Model I, which excluded patients who died before five days and estimated the HR to be 0.63 (0.48-0.83). Model F also reached statistical significance, 0.77 (0.60-0.99), and was the result of a 5-day treatment window with no exclusion or censoring variations. The time-varying Cox model (J), yielded an HR of 1.08 (0.80-1.47). HRs for the model-first approaches are summarized in Figure 3.
Discussion
Our research illustrates how a question-first approach can aid in devising an optimal design and choice of estimation procedure for an analysis of observational data. Specifically, we show that using the target trial framework succeeds in recovering the benchmark causal effect obtained in RCTs. Our final estimate that corticosteroids would reduce overall 28-day mortality in a hospitalized cohort by 9.6% is equivalent to an OR of 0.62, which is qualitatively identical to the WHO’s estimate of 0.66. Our study design allowed us to create a realistic trial with a meaningful intervention, i.e. randomize patients at hospitalization but do not give corticosteroids unless the patient becomes severely hypoxic. Our analysis plan enabled us to flexibly adjust for a large number of potential time-dependent confounders of the treatment, censoring, and outcome mechanisms.
In contrast, the majority of our model-first approaches could not recover the RCT benchmark using the same data source. This finding aligns with other corticosteroids research; a recent meta-analysis containing observational analyses on over 18,000 patients found no overall effect for corticosteroids on mortality (OR 1.12, (0.83–1.50))(27). The task of creating reliable evidence from complex longitudinal data is not an easy one, and many of these studies suffer from flawed designs.
We found most studies in the current observational corticosteroids literature allowed the “treated” group to receive corticosteroids anytime during hospitalization(39–41). This is problematic because it introduces immortal time and biases results towards a protective effect of corticosteroids(42). A few studies did limit the treatment time frame in an effort to diminish immortal time bias. The “grace period” for treatment was handled in various ways, e.g. excluding patients who die prior to a time window after inclusion criteria(43), or excluding patients who receive treatment after the treatment window ends(44,45). Both exclusions are incorrect and may lead to bias and spurious associations(1). An alternative to exclusion is censoring patients at their time of receiving treatment if that time is after the treatment window passes. However, since Cox regression cannot handle time-dependent censoring, this biases results in favor of corticosteroids(1).
In addition to these issues, it is often unclear in the current literature how patients who receive corticosteroids prior to meeting inclusion criteria are handled in the analysis(39–41,46). A related issue is that corticosteroids can, according to RCTs, affect severity of illness (e.g. severe hypoxia). All of the point treatment studies are thus subject to this form of collider bias(47). Although the time-varying treatment approach does not suffer from the same time-alignment biases as the point-exposure design, the time-varying Cox model cannot properly account for time-dependent confounders(1), such as the relationship between intubation, corticosteroids, and mortality. These biases appear in our model-first results; the only study designs which demonstrate a protective effect of corticosteroids in line with the RCT benchmark suffer from extreme immortal time bias through undefined or extended treatment time windows (A, I).
There are limitations of our study. First, while the study time frame is ideal in terms of corticosteroid experimentation, it includes New York City’s initial pandemic surge conditions and rapidly changing clinical practice. We cannot rule out the presence of unmeasured confounding. Second, we did not have the data to look at individual corticosteroid types, making an exact comparison to a specific randomized trial impossible.
Despite these limitations, our study serves as an example in which the current standard for clinical research methods fail to recover the correct treatment effect where a modern causal inference method succeeds. Using observational data to guide clinical practice is possible, but relies on the incorporation of advanced epidemiological and statistical methodology principles. We hope this study and accompanying technical guide encourages the incorporation of these innovative techniques into study designs and statistical analyses of observational data.
Data Availability
Data produced in the present study are available in HIPAA compliant de-identified form upon reasonable request to the authors.
Funding
This research study was not funded by any grants. Dr. Edward Schenck is supported by NHLBI HL151876 and reports consulting for Axle Informatics regarding Coronavirus vaccine clinical trial through NIAID outside of the current work. Dr. Michael Satlin is supported by research grants from Allergan, Merck, BioFire Diagnostics, and SNIPRBiome and reports consulting payments from Shionogi outside of the current work.
Acknowledgements
The authors thank all of the healthcare workers who courageously expanded their roles during the pandemic’s surge conditions. This work was made possible through data provided by the Cornell COVID-19 Registry, led by Parag G. Goyal, M.D., Justin Choi, M.D., Laura Pinheiro, Ph.D., and Monika Safford, M.D., of Weill Cornell Medicine. The authors also thank the contributions to this work of the Architecture for Research Computing in Health team.
Footnotes
Text cut to a 3,000 word limit, supplemental materials updated, Table 1 added (and Figure 1 moved to Supplemental), abstract updated, minor formatting updates.