Abstract
In medicine, randomized clinical trials (RCT) are the gold standard for informing treatment decisions. Observational comparative effectiveness research (CER) is often plagued by selection bias, and expert-selected covariates may not be sufficient to adjust for confounding. We explore how the unstructured clinical text in electronic medical records (EMR) can be used to reduce selection bias and improve medical practice. We develop a method based on natural language processing to uncover interpretable potential confounders from the clinical text. We validate our method by comparing the hazard ratio (HR) from survival analysis with and without the confounders against the results from established RCTs. We apply our method to four study cohorts built from localized prostate and lung cancer datasets from the Stanford Cancer Institute Research Database and show that our method adjusts the HR estimate towards the RCT results. We further confirm that the uncovered terms can be interpreted by an oncologist as potential confounders. This research more credible causal inference using data from EMRs, offers a transparent way to improve the design of observational CER, and could inform high-stakes medical decisions. Our method can also be applied to studies within and beyond medicine to extract important information from observational data to support decisions.
1 Introduction
As the number of highly targeted cancer treatments increase, it is increasingly difficult for oncologists to decide on optimal treatment practices. In the recent years, medicine has seen the reversal of 146 standard medical practices [1], and many unanswered questions remain on treatment decisions in oncology. The gold standard for assessing treatment effects is randomized clinical trials (RCT). However, RCTs can be very expensive, time-consuming, and limited by the lack of external validity [2, 3]. Hence, there has been a growing interest in using observational data to compare and evaluate the effectiveness of clinical interventions, also known as comparative effectiveness research (CER) [2].
Many studies have used large-scale observational registries such as the Surveillance, Epidemiology, and Ends Results (SEER) and National Cancer Data Base (NCDB) to perform CER. However, such studies may be unreliable due to the systemic bias present in observational data and the presence of unmeasured confounders [1, 2, 4]. Moreover, population-based CERs in oncology often also face small data challenges. Electronic medical records (EMRs) are another source of rich observational information on patient demographics and past medical history. We hypothesize that the more detailed unstructured data present in EMRs can be harnessed to reduce confounding compared to prior CER studies.
We study how EMRs, especially clinical text, can be used to reduce selection bias in observational CER studies and better inform treatment decisions in oncology. A confounder is a variable that is associated with both treatment assignment and the potential outcomes a subject would have under different treatment regimes. In the presence of confounders, the correlation between treatment assignment and outcomes cannot be interpreted as causal. One way that confounding may arise is when patients are selected for a treatment group on the basis of the severity of their illness. In such a case, failing to adjust for patient severity can lead to selection bias when attempting to estimate causal effects. For example, surgery tends to be performed on younger or healthier patients; certain doctors or institutions may prefer one treatment over another, and this creates confounding if those doctors or institutions treat patients with systematically different severity. Studies based on a small set of covariates tend not to capture the important confounders and result in biased estimates [5, 6]. Observational studies are more reliable when we can better control for these confounders. While structured EMR, such as billing codes, can be used to encode expert-curated patient characteristics, studies suggest that administrative claims data may contain errors [7, 8] and expert-curated covariates may not capture all potential confounding [5, 9]. EMR clinical text is a potential source of additional information about factors that might relate to both treatment assignment and prognosis.
We propose an automated approach using natural language processing (NLP) to uncover interpretable potential confounders from the EMRs for treatment decisions. For high-stake settings such as cancer treatment decisions, it is important to design models that are interpretable for trust and understanding [10]. NLP can be used to process the unstructured clinical notes and create covariates that supplement the traditional covariates identified through expert opinion. We then augment our dataset with covariates that impact both treatment assignment and patient outcomes, where attempting to estimate causal effects while omitting such variables leads to biased estimates [11, 12]. Finally, we use methods designed to estimate causal effects in observational studies with observed confounders to estimate treatment effects in our augmented data set. We show that controlling for these confounders appears to reduce selection bias when compared against the results from established RCTs and clinical judgement.
We apply our method to localized prostate and lung cancer patients. Based on cohorts from established RCTs, we built four treatment groups for comparison. We uncovered interpretable potential confounders from clinical text and validated the potential confounders against the results from the RCTs. Simple NLP techniques (e.g. lemmatization, entity identification) were used to construct a bag-of-words representation of the frequently occurring terms. A Lasso model [13] was then used to select the terms that are predictive of both the treatment and survival outcome as potential confounders. Finally, we validated our method by comparing the hazard ratio (HR) from survival analysis with and without the confounders.
Our main contribution is presenting an approach to uncover interpretable potential confounders from clinical text. Existing work in observational causal inference rarely employs unstructured data [14, 15, 16], and most NLP studies on clinical text focus on prediction or classification settings [17, 18, 19, 8]. Our paper is the first to uncover interpretable potential confounders from clinical notes for causal analysis on cancer therapies, and one of the few works that combines NLP and causal inference in a time-to-event setting. Our method allows researchers to extract and control for confounders that are not typically available., This appears to be a useful step for future observational CER studies to help reduce selection bias unique to that dataset. The research presented can help unlock the potential of clinical notes to help clinicians understand current clinical practice and support future medical decisions. 3.
1.1 Related Work
In the past decade, there has been a growing interest in using observational data for clinical decision making and causal inference in oncology [2]. However, such studies are often unreliable, and many observational studies have been refuted by RCTs soon after [2, 4]. For example, Yeh et al. [6] performed a comparison of surgery vs. radiotherapy for oropharynx cancer and suggested that surgery may be superior to radiation for quality of life outcomes. A few years later, this claim was refuted by an RCT study Nichols et al. [20], which showed that radiation is in fact superior to surgery in terms of 1-year quality of life scores. A similar example is seen with prostate cancer. In 2016, Wallis et al. [5] showed through population-based studies that surgery is superior to radiation for early-stage prostate cancer for overall and prostate-cancer specific survival; a few months later, the finding was refuted by Hamdy et al. [21], which showed that surgery and radiation are equivalent in terms of overall and prostate -cancer specific survival. Many other studies have shown the fallibility of population CERs that rely on expert-curated features to draw conclusions about treatment effects [2, 22].
Beyond clinical studies, there is a relatively large literature on performing causal inference from observational data. Various papers have explored how to correct for bias when evaluating average treatment effect (ATE) from observational studies with propensity score matching or weighting [23, 14]; see [24] for a review. There is also a growing amount of literature that adapts machine learning models, such as random forest or regularized regression, for doubly-robust ATE estimation in high-dimensional settings [25, 26, 27, 12, 28]. However, most of the methods do not include unstructured data.
Recent literature has shown the usefulness of conditioning on textual data to adjust for confounding [29, 30, 31, 32]. Roberts et al. [30] proposes text matching to employ textual data for causal inference. Mozer et al. [29] applies text matching to patient charts texts for a medical procedure evaluation; however, they focus on continuous outcomes and rely mostly on expert-curated terms from the clinical text. Veitch et al. [31] is another work that employ unstructured data for causal inference; however, they rely on black-box models that are not interpretable. Moreover, many existing causal inference methods are developed for continuous outcomes and do not transfer easily to the time-to-event outcomes for survival analysis used in oncology. Of the ones that perform causal inference on time-to-event outcomes for medical applications [15, 16], we did not find any that include unstructured data in a systematic way. Austin [16] presents methods for using propensity scores to reduce bias in observational studies with time-to-event outcomes. Our study leverages some of the ideas and methods in this literature to develop our approach for identifying and evaluating the potential confounders from the unstructured clinical notes. Keith et al. [32] presents a review of the literature on using textual data to adjust for confounding. Our paper contributes to this literature by addressing obstacles in using NLP methods to remove confounding.
There is also a growing literature that seeks to better employ EMRs for clinical tasks. Existing work has employed structured EMR data and unstructured clinical notes for survival prediction and analysis [18, 33], predicting metastatic recurrence [17], clinical risk prediction [34], and prediction of multiple medical events [19]. However, most current work involving EMRs focuses on prediction tasks. In studies that include the unstructured notes, most use deep learning to produce context-rich embedding representations of words or documents [18, 19]. While these representations are highly accurate for prediction tasks, they are often black-box and very difficult to interpret for causal insights. Our approach differs in that we use simple NLP techniques (e.g. entity identification, bag-of-words) to generate matrix representations that can be easily mapped to specific words and phrases. This increases the interpretability of our method and allows us to explain our confounders to clinicians.
Our study advances both the clinical and causal inference literature by using NLP to perform causal inference on clinical text in time-to-event settings. We hope this will inform clinical practice and improve patient outcomes.
2 Results
We apply our methods to localized prostate and stage I non-small cell lung cancer (NSCLC) patients and compare the results against established RCTs. We select these diseases due to data availability and having established clinical RCTs for validation. After filtering and assignment, we include 1,822 patients for prostate cancer, with 988 surgery patients, 385 radiation patients, and 449 active monitoring patients; the average follow-up time is 4.11 years. For stage I NSCLC, we include 749 patients, with 492 surgery patients and 257 radiation patients; the average follow-up time is 4.96 years. The patient characteristic descriptions of the prostate cancer cohort are shown in Table 1 and the NSCLC cohort are shown in Table 2. Please see Section 4.1 for more details on the patient selection process.
We use the findings from established RCTs and clinical judgement as a benchmark for evaluating our results. For localized prostate cancer, Hamdy et al. [21] compared active monitoring, radical prostatectomy, and external-beam radiotherapy. A total of 1,643 patients were included in the study, with 553 men assigned to surgery, 545 men assigned to radiotherapy, and 545 men to active monitoring. They observed no significant difference among the groups for prostate -cancer or all-cause mortality (P = 0.48 and P = 0.87 respectively). Similarly, a recent study showed that difference in treatment effects for surgery vs. radiation observed from observational studies is entirely due to treatment selection bias [9]. For stage I NSCLC, The Chang et al. [35] study is a pooled study comparing stereotactic ablative radiotherapy (SABR) to surgery. A total of 58 patients were included, with 31 patients assigned to SABR and 27 to surgery. The study observed that SABR had slightly better overall survival than surgery (P = 0.037), but claims to be consistent with the clinical judgement that surgery is equipoise to radiation.
Following the design of Hamdy et al. [21] and Chang et al. [35], we evaluate our results for the following four treatment groups for an outcome of all-cause mortality:
• surgery vs. radiation for prostate cancer
• surgery vs. monitoring for prostate cancer
• radiation vs. monitoring for prostate cancer
• surgery vs. radiation for stage I NSCLC
We do not analyze other treatment groups for lung cancer due to patient count constraints.
Our approach identifies covariates that are likely potential confounders in this particular dataset from the high-dimensional and high-noise EMR data. These covariates are interpretable as they are represented by structured data or words from a bag-of-words matrix. To evaluate the effectiveness of the potential confounders selected in the model, we use these potential confounders to perform survival analysis for the treatment groups for prostate and stage I NSCLC. We compare the results of various methods for time-to-event analysis in terms of HR. Although we cannot know what the true HR is, we suggest that using medical notes improves on the traditional covariates. We compare our results against existing RCTs to evaluate how the confounders we have uncovered can help correct for selection bias. The overall workflow is shown in Figure 1. Supplement A details the covariates extracted from the structured data.
2.1 Potential Confounders
We show that our methods uncover terms that are predictive of both the treatment and survival outcome. Hence, these are potential confounders that should be controlled for in observational CERs to reduce selection bias. Please see Supplement C for a discussion on the structures of potential confounding our method can capture.
We select the intersection covariates from our treatment and outcome prediction models as the potential confounders. We base this idea on the selection of union variables to reduce confounding when performing causal inference on observational data in the case of continuous outcomes [12]. However, in survival analysis, it is recommended that the covariates analyzed be constrained by the statistical 1 in 10/20 rule of thumb with respect to the the event count [36, 37]. In our high-dimensional setting, the union of covariates that are predictive of treatment and outcome yield too many potential confounders relative to the sample size. Hence, we use the intersect as a heuristic to focus on the most important confounders.
In Figure 2, we illustrate the unpenalized coefficients of covariates from two models, the treatment assignment model and the survival outcome model. For each covariate, the x-axis plots the coefficient from the treatment prediction model while the y-axis plots the coefficient from the survival outcome model. Each covariate is labeled by the text next to it. The intersection covariates, intersect, are shown in blue; these are the covariates that have strong effects in both models. For the structured covariates, we illustrate in black the coefficients for the covariates that were not selected; these coefficients are closer to at least one of the axes in the figure. We do not illustrate the coefficients for unstructured covariates that are not selected, as there are a large number of these covariates. The axes are labeled to indicate which treatment the coefficient predicts and whether the coefficient is indicative of a good or bad survival prognosis. For example, in the treatment model, patients with a high “bladder” word-occurrence have a higher likelihood of receiving surgery; in the outcome model, patients with a high “bladder” occurrence have a lower likelihood of survival.
In Supplement E, we show the R2 correlation among all the selected covariates for each treatment group.
2.2 Evaluation of Potential Confounders
We evaluate these potential confounders by comparing the results on 3 covariate combinations:
structured: Using only the structured covariates. We use this as a baseline because these are covariates that are typically used in retrospective oncology studies and are readily available in the structured data [5].
intersect: Using only the intersection covariates identified as confounders.
struct+intersect: Using the union of the structured and intersection variables.
We then perform survival analysis using univariate Cox proportional hazard models (Cox-PH) with propensity score matching (matching), univariate Cox-PH model with inverse propensity score weighting (IPTW), and multivariate Cox-PH model with inverse propensity score weighting (multi.coxph). In Figure 3, we show the hazard ratio (HR) of the effect of treatment for each study cohort when the selected covariates are included in analysis. An HR below 1 indicates that patients with the second treatment are more likely to survive than those with the first treatments. An HR above 1 indicates the opposite, and an HR equal to 1 indicates that the two treatments are equipose. For each HR estimate, we also show the 95% confidence interval (CI). Please see Section 4.5 for more details on the methods.
We observe that with the additional covariates, we are able to shift the estimate of the HR towards the direction of the RCT for an outcome of all-cause mortality. We also compare the covariate-specific HR of each of the selected covariates in terms of univariate and multivariate Cox-PH analysis for an all-cause mortality outcome in Tables 3-6.
In Figure 3A and Table 3, we show the results with surgery vs. radiation for prostate cancer. The RCT reports no significant difference between surgery vs. radiation for localized prostate cancer [21]. With structured, we observe a significant effect that radiation is superior to surgery, a result that disagrees with most retrospective studies [5]. We observe a significant shift in the HR towards equipoise with the additional identified confounders for intersection and struct+intersect. For structured, we observe an HR of 2.51 with 95% CI (2.39-4.55) and p-value of 0.002 with multi.coxph. For struct+intersect, we estimate an HR of 1.54 with 95% CI (0.78-3.03) and p-value of 0.214 with multi.coxph. We shift the HR point estimate by 0.97, or 38.6%, towards equipose.
In Figure 3B and Table 4, we show the results of surgery vs. active monitoring for prostate cancer. Hamdy et al. [21], the RCT, reports the HR for surgery vs. active monitoring as 0.93 with 95% CI (0.65, 1.35) and p-value of 0.92. With structured, we again have a significant effect that active monitoring is superior to surgery; this disagrees with most retrospective studies [5] and Hamdy et al. [21]. We again observe a significant shift in the HR towards equipoise with the additional identified confounders. For structured, we observe an HR of 2.71 with 95% CI (1.55-4.75) and p-value < 0.001 with multi.coxph. For struct+intersect, we estimate an HR of 1.10 with 95% CI (0.55-2.21) and p-value of 0.781 with multi.coxph. We shift the HR point estimate by 1.61, or 59.1%, towards equipose.
In Figure 3C and Table 5, we show the results of radiation vs. active monitoring for prostate cancer. We do not see as significant a shift with radiation vs. active monitoring. Hamdy et al. [21] records the HR for radiation vs. active monitoring as 0.94 with 95% CI of (0.65, 1.36) and p-value of 0.92. We observe that the matching results are not very far from the RCT results matching estimated the HR closest to the RCT results when compared against IPTW and multi.coxph. All results with intersect and struct+intersect shift the HR estimate slightly towards equipose, with the most shift of 0.32 by intersect and IPTW; this is closely followed by a shift of 0.20, or 45.5%, with intersect and multi.coxph. We suspect this While the adjusted results are not as close to the RCT results as compared to Figures 3a-b, the HR estimate are all shifted towards the RCT results in terms of bias reduction for each of the data and method combination. We suspect the less significant shift may be due to the smaller dataset available for radiation vs. active monitoring or the confounding not being observable within the text.
In Figure 3D and Table 6, we show the results with surgery vs. radiation for stage I NSCLC. With structured, we observe a significant effect that surgery is superior to radiation. The results from Chang et al. [35] and clinical judgement tells us that surgery and radiation should be about equipose for stage I NSCLC. The shift is not as significant as with prostate cancer, but we also note that the established clinical standard for lung cancer is not as well studied. We do observe a more significant shift with multi.coxph there is a slight shift with IPTW and matching. For structured, we observe an HR of 0.39 with 95% CI (0.30-0.51) and p-value < 0.001 with multi.coxph. For struct+intersect, we estimate an HR of 0.54 with 95% CI (0.40-0.53) and p-value < 0.001 with multi.coxph. We shift the HR point estimate by 0.15 towards equipose. We suspect the small changes with IPTW and matching are While the adjusted results are not as close to the RCT results as compared to Figures 3a-b, the HR estimates are all shifted towards equipose in terms of bias reduction for each combination. We suspect the less significant shift is again due to the even smaller data size of stage I NSCLC. The doubly-robust method of multi.coxph seem to perform better under these settings.
Overall, our methods uncover several potential confounders that can reduce selection bias in observational data. Although our method cannot uncover all potential confounders, we are able to uncover confounders that are not usually included in expert-selected covariates. Supplementary analysis of propensity scores and covariate balance plots for each analysis are seen in Supplement D.
2.3 Potential Confounder Interpretation
We show that the potential confounders we have uncovered are interpretable through clinical expertise. We examine the effect on survival for each selected covariate in term of univariate and multivariate survival analysis with a Cox-PH model. In univariate analysis, a single covariate is regressed on the survival outcome and describes the survival with respect to a single covariate. In multivariate analysis, all the selected covariates are regressed on the survival outcome and describe each covariate’s effect on survival while adjusting for the impact of all selected covariates. For a particular variable, an HR below 1 indicates that the covariate is a positive predictor of survival, an HR above 1 indicates a negative predictor of survival, and an HR equal to 1 means that the variable does not seem to effect survival.
2.3.1 Prostate Cancer
For surgery vs. radiation and surgery vs. active monitoring, patient age, bladder, and urothelial are chosen as intersection covariates. Moreover, they are also shown to be significant through both univariate and multivariate covariate analysis in Tables 3 and 4.
Patient age is a known confounder in treatment decision and survival outcomes. Older patients are more likely to receive radiation due to surgery risk. However, older patients also have higher mortality. In Figure 2a-c, we observe that patients with higher struct:patient age, i.e. older patients, are more likely to receive radiation and a bad prognosis.
We hypothesize that text:bladder and text:urothelial are identified because prostate cancer patients often have bladder symptom issues and can also have urothelial cancer. Most retrospective prostate cancer studies have not excluded patients with early stage bladder cancer [5]. Examples of text:bladder in the clinical notes are “he notes incomplete bladder emptying”, “evidence of benign prostatic hyperplasia and chronic bladder outlet obstruction”, and “diagnosis of bladder cancer”. Examples of text:urothelial in the notes are “pathology showed high grade urothelial carcinoma with muscle present and not definitively involved”, “it was read as a high grade urothelial cancer which involved the stroma of the prostate as well as the bladder”. Patients with bladder cancer or bladder issues are more likely to get surgery than radiation. Radiation does not work well for bladder cancer. Patients with bladder problems may prefer surgery because radiation can irritate the bladder and cause urinary problems. However, these are also patients with higher mortality and more health issues. In Figure 2a-b, we observe that text:bladder and text:urothelial are more common in patients who received surgery and had a bad prognosis.
Moreover, for this particular dataset, we note that the confounding appear to be observable. The bias of surgery being worse than radiation and monitoring is due to a group of patients who are diagnosed with prostate cancer through a resection for bladder cancer or other bladder issues. When a patient with bladder cancer has a cystoprostatectomy in which the bladder and prostate are both removed, a pathologist can sometimes find a prostate tumor in the pathology specimen. Bladder cancer patients tend to be older, have more medical issues, and a higher mortality rate. However, these patients often have low clinical stage for prostate cancer. The terms bladder and urothelial describe this group of patients. Our method is able to capture some characteristics of this group and use this to reduce selection bias.
For radiation vs. active monitoring, we do not observe confounders that present a significant shift in treatment HR in Table 5. It can be that the confounding here is not as easily observable or our method is unable to identify it. We can identify interesting potential confounders, such as text:resident. From Figure 2c, we observe that text:resident is more common in patients who received radiation and had a bad prognosis. This term likely refers to both resident physicians and the patient being a resident of a long-term care facility or skilled nursing facility. Both uses of the term could reduce survival time: inpatients at teaching hospitals have much of their care delivered by resident physicians, and frequent inpatient stays or nursing facility residency could both indicate a sicker patient.
2.3.2 Lung Cancer
We examine Table 6 for the intersection covariates through univariate and multivariate analysis. We observe that some of the significant terms are patient age, male, race api, diagnosis year, alk, left.low, and severe.
We note that age, gender, race, and diagnosis year are known confounders for treatment decision and outcome.
The covariate alk points to the ALK mutation for NSCLC. About 5% of NSCLCs have a rearrangement in a gene called ALK; the ALK gene rearrangement produces an abnormal ALK protein that causes the cells to grow and spread. This change is often seen in non-smokers (or light smokers) who are younger and who have the adenocarcinoma subtype of NSCLC [38]. It’s been observed that patients with the ALK mutation have worse disease-free survival, citing higher rates of recurrence and metastasis [38]. Alternatively, we hypothesize that alk is significant because the ALK mutation is mutually exclusive from the EGFR mutation [39]. The EGFR mutation is often present in asian patients and EGFR patients typically have better survival. Hence, the significance of alk can be related to the absence of the EGFR mutation. In Figure 2d, we observe that text:alk is more common in patients who received radiation and had a bad prognosis.
The covariate left.low can point to NSCLC on the lower left node of the lung. Studies have observed that lung cancer on the lower lobe or lower left lobe has worse survival [40, 41]. This can also be related to the absence of the EGFR mutation, since EGFR mutation occur less frequently in the lower lobe [40]. In Figure 2d, we observe that text:left.low is also more common in patients who received radiation and had a bad prognosis.
The covariate text:nipple can indicate a history of breast cancer. Studies have shown that patients with a history of breast cancer are diagnosed with lower stages of NSCLC and show better prognosis when compared to women with first NSCLC, perhaps due to heightened surveillance compared to the general population [42]. In Figure 2d, we observe that text:nipple is more common in patients who received surgery and had a good prognosis; both effects have also been observed in Milano et al. [42].
The covariate text:sponge can refer to sponges used for surgical preparations. Sponge is commonly used in surgery and can be an indication that the patient has some history of receiving surgery. Patients who receive surgery tend to be healthier and have better survival. In Figure 2d, text:sponge is more common in patients who received surgery and had a good prognosis.
The covariate severe and text:rib could be pointing to a severe conditions related to lung and other problems that indicate poor overall health and performance status, which has been shown to be related to a patient’s survival outcomes [43]. Examples of text:severe include phrases such as “severe pulmonary hypertension”, “severe COPD”, or “severe emphysema”. Examples of text:rib include phrases such as “rib fractures” or “rib shadows”. In Figure 2d, we observe that both text:severe and text:rib are more common in patients who received radiation and had a bad prognosis. Similarly, we also observe other terms that could describe the severity of lung cancer - such as squamous.cell text:pulmonary.nodule, or text:silhouette - or overall health levels - textdsalert, attention. text:cyst, or text:discomfort Overall, we are able to uncover some potential confounders that are easy to interpret and capture useful clinical insights.
3 Discussion
We have demonstrated how causal inference methods can be used to draw more reliable conclusions from population-based studies. Our paper shows that 1) clinical notes, or unstructured data, can be an important source for uncovering confounders, and 2) current clinical tools can be augmented with machine learning methods to provide better decision support. Furthermore, our experimental framework can be easily adapted to use textual data to reduce selection bias in retrospective studies more generally.
Our method can be used to improve clinical practice. Due to the simplicity of the machine learning tools employed, our method can be easily implemented as an additional step in the design of observational CER studies. Our results also show that the method is generalizable to different types of cancer and for various types of study cohort comparisons. With the continued digitization of clinical notes and the increasing access to EMRs, we recommend this as an essential step for any researcher seeking to draw clinical insights from observational data. The terms uncovered with our method can not only be used to improve observational CERs, but also be used to generate interpretable insights about current clinical practice. The uncovering of relevant information and subsequent insights can then be used to inform high-stakes medical decisions.
We believe that our work is the first to explore the potential of including unstructured clinical notes to reduce selection bias in oncology settings. We are also one of the first works to incorporate unstructured data into causal inference estimators and Cox-PH models. Although our method has been developed to address a specific problem in oncology and applied in the clinical setting, it can also be easily adapted for application in any observational study that seeks to incorporate unstructured text. We propose our method as an automated selection procedure that can be used to supplement expert opinion when uncovering potential confounders for a particular observational study population. There is much work to be done in using NLP and unstructured text for causal inference. Our work present a simple and flexible way to generate interpretable causal insights from text of any sort.
Our study also has several limitations. First, we use simple NLP methods to process the clinical notes and extract the top 500 or 1000 features for variable selection. In the process, much information in the text nodes is discarded and the sequence of past medical events are not taken into account. We choose this setup due to the the small sample size of oncology study cohorts, which makes it difficult to train more complicated models for textual processing. In theory, the more work that is placed into the clinical notes preprocessing and the higher quality of the features generated from these notes, the more informative the uncovered potential confounders will be. For future work, we hope to explore how other NLP techniques, such as topic modeling or clustering, can be used to build even higher quality features from the unstructured text. There are also an increasing number of deep learning models that can be used to identify interpretable insights [19]. We are interested in how these deep learning methods can be applied to generate causal insights on another study population with larger sample size.
Fourth, we rely on the proportional hazard assumption for our Cox-PH models. In cases of many covariates, the assumption may be violated. We feel the simplicity and interpretability of the model by practitioners outweigh the increased complexity. For EMR datasets with many covariates, the assumption is often used and does not seem to present a practical issue [33]. Future work could explore alternative models that do not rely on the assumption [44].
Fifth, more work can be done to mitigate immortal time bias in our HR estimates. We discuss our approach in Section 4.2. An alternative method to address this problem would be to use a time-dependent Cox-PH model [45].
Eighth,
Third, our approach of selecting intersection covariates is an empirical approach designed for uncovering the most valuable potential confounders. While our approach seemed to work well empirically in this study, more experimentation and analysis can be done to help verify its validity in the future.
Sixth,
Seventh,
C
Finally, the validity of causal inference models cannot be determined without prospective experimental data. Therefore, the uncovered confounders and estimated HR can only be validated by clinicians. We are identifying potential candidates for the bias and then evaluating these candidates of bias against RCTs.
Many challenges still remain for employing unstructured data for causal inference analysis and medical settings. We hope this work interests both clinical practitioners augmenting existing clinical support tools and researchers using textual data to reduce confounding in observational data. We hope our workflow, problem framing, and experimental design can serve as such a sandbox for testing more complex algorithms or adapting to other application areas. Ultimately, we hope this research will find causal information in clinical notes and provide a transparent way for machine learning to inform medical decision making.
4 Methods
4.1 Dataset
With approval of the Stanford Institutional Review Board (IRB), we curate a dataset of non-metastatic prostate and lung cancer patients from the Stanford Cancer Institute Research Database (SCIRDB). The database includes patients seen in the Stanford Health Care (SHC) system from 2008 to 2019 for prostate cancer and 2000 to 2019 for lung cancer. SHC clinical sites include one academic hospital, one freestanding cancer center, and several outpatient clinics. From SCIRDB, we pull a total of 3,638 prostate cancer patients with 552,009 clinical notes and 3,274 non-small cell lung cancer (NSCLC) patients with 648,505 clinical notes. The clinical notes include progress notes, letters, discharge summaries, emergency department notes, history and physical notes, and treatment planning notes.
For each patient, we also pull the structured EMR and data from the inpatient billing system. From the California Cancer Registry (CCR), we pull the available initial treatment information, cancer staging, tumor description, date of diagnosis, date of death, and date of last follow-up for these selected patients. For NSCLC, we also pull the recorded Epic cancer staging information.
4.2 Study Cohort
We build our study cohorts from SCIRDB with reference to existing observational study principles and clinical expertise. We try our best to select patients for each treatment group built from the EMRs to match the RCTs criteria.
For each patient, we combine all treatments with the same Diagnosis ID in the CCR as the initial line of treatment. For patients with multiple diagnosis id, we keep the first record of treatment. For prostate cancer, patients without a recorded treatment are labeled as active monitoring. To avoid explicit revelation of the treatment choice, we only include notes more than 2 months before treatment start date for prostate cancer and 1 month for NSCLC. We rely on domain expertise to determine the 1 or 2-month pre-treatment cutoffs. Lung cancer patients typically have higher mortality and tend to start treatment pretty quickly. For prostate cancer, patients progress more slowly and get second opinions before making a treatment decision. We then select for patients with at least one note before the specified time. We select only patients who survived at least 6 months past their date of diagnosis to mitigate immortal-time bias [45]. Because we extract only initial treatments (rather than treatments for cancer recurrence) as recorded in SEER, most of the treatments are administered within 6 months of the diagnosis date [48]. This is similar to the setup for traditional landmark analysis [45]. To ensure the proportional hazard condition, patients who are still living are censored at time of last follow-up [49]. The patient filtering and cohort selection process are shown in Figure 4.
For patients with unknown clinical stage but known pathological stage, we impute the clinical stage by training a clinical stage classification model using the pathological stage and other patient information. Pathological stage is usually a little higher than clinical stage due to the staging based on biopsy samples instead of imaging; hence, it is inaccurate to group them together. Clinical stage is more frequently used for similar observational studies [21, 35] and it is more rigorous to impute the missing clinical stage with a model trained on the pathological stage and other relevant covariates. We train the clinical stage imputation model with patient age, pathological stage, diagnosis year, and tumor grade. For NSCLC, tumor grade is not included due to missing information. For both prostate and NSCLC, we train and validate a random forest model [50, 51] on patients with both clinical and pathological stage available. The imputed stages are used as the clinical stage for those patients. For patients with both clinical and pathological stage missing, we are able to fill in some through clinical chart reviews.
We assign patients to the treatment groups based on the initial treatment decision to capture the intent to treat rather than the actual treatments administered. We assign patients with only surgery records into the surgery group and patients with only radiation records into the radiation group. For patients with both radiation and surgery, patients who received surgery first are assigned to the surgery group and patients who received radiation first to the radiation group. For prostate cancer, patients with no recorded treatment are assigned to the active monitoring group. For NSCLC, only patients with clinical stage I are included.
4.3 Data Processing and Representation
We build the covariates used for uncovering confounders through the process shown in Figure 1A. We compile the data from SCIRDB, CCR, and Epic for each patient.
We include age, race, ethnicity, clinical stage, and diagnosis year as part of the structured data. For prostate cancer, we also include tumor grade For NSCLC, we also include gender. Based on age range categories used in Li et al. [52], we form the categorical variable patient age by splitting age into ranges of ≤ 49 years-old, 5-year buckets from 50 − 84 years-old, and ≥ 85 years-old. Race and ethnicity are encoded as one-hot vectors, with each feature indicating one race or ethnicity. Race is combined based on what is done in Li et al. [52]. We select these structured covariates because they are commonly accepted by clinicians as potential confounders and often included in CER studies [5]. For race, race unknown is not included as a covariate. For ethnicity, only hispanic is included as a covariate. For tumor grade, patients with unknown grade are imputed with the median grade value. The indicator variable grade unknown is added to indicate which patients have been imputed. The covariates tumor grade and grade unknown are not included for NSCLC due to missing information of tumor grade and clinical judgement. In the end, we have 9 structured covariates for prostate cancer and 7 structured covariates for NSCLC. [7, 8].
We build word frequency representations of the clinical notes for the unstructured covariates. For each patient, we compile notes within the specified time (i.e. 2 months prior to treatment start date for prostate cancer and 1 month prior for NSCLC). We only use notes from before treatment so that we are not predicting survival outcome with information unavailable at the time of treatment decision. The different time windows for the two diseases was selected as NSCLC treatment generally starts more quickly than prostate cancer treatment due to the more rapidly progressing nature of the cancer. The notes are segmented based on clinical field labels (e.g. “IMPRESSION:”, “HISTORY:”), tab spaces, NLTK sentence tokenization [53]. To remove noise, we remove clinical field labels and two sentences from the beginning and end of each document. We also remove sentences with common locations (e.g. “Stanford Medical Center”, “Palo Alto”) and medical doctor names (e.g. “xx xx, M.D.”) as these are often prefix or suffix to note documents. To avoid including conditions patients do not have, we remove sentences if they contain less than 15 words including a negation term (i.e. “no”, “denies”, “does not”, “none”). For example, this prevents us from extracting “smoking” as a covariates from “No history of smoking.” We then identify biomedical entities from the preprocessed clinical notes with scispaCy [54]. scispaCy is a spaCy [55] based model for processing biomedical, scientific, and clinical text. The scispaCy models identifies a list of all the entities in the text that exist in a biomedical dictionary, such as the Unified Medical Language System [56]. We then lemma-tize and combine all biomedical entities identified from the sentences for each patient into a single document. To further remove noise, we remove stop words using a combination of the NLTK stopwords [53] and data-specific stopwords such as medical units (e.g., “lb”, “oz”, “mmhg”), time terms (e.g., “months”, “days”), and medical or Stanford specific terms (e.g., “stanford”, “patient”, “doctor”) that are very common but irrelevant to the task at hand. We also create a dictionary of synonyms in the dataset and use the dictionary to combine these words. The dictionary includes lexical variations that are not reduced to the same root during lemmatization (e.g. “abnormality” → “abnormal”, “consult” → “consultation”), abbreviations (e.g. “hx” → “history”, “fu” → “followup”), and common synonyms (e.g. “assistance → “service”, “action” → “movement”).
Finally, we remove punctuation and generate term frequency representations of the text using bag-of-words (BOW) with term frequency–inverse document frequency (TF-IDF) weighting [57]. Bag-of-words (BOW) model is a simplifying representation in natural language processing. It represents text (such as sentence or document) as a vector of word occurrence count. TF-IDF, is a score that reweighs the BOW matrix to reflect how important a word is to a document in a collection or corpus. We implement this with scikit-learn [51]. For prostate cancer, we select for the top 500 most frequent features using only unigrams. For NSCLC, we select for the top 1000 most frequent features using both unigrams and bigrams, and apply a document frequency threshold strictly lower than 0.7 to filter out dataset-specific stopwords. Although there are more prostate cancer patients, the lower number of death events makes it more difficult to include as many covariates when performing survival analysis. Hence, we have 500 unstructured covariates for prostate cancer and 1000 unstructured covariates for NSCLC.
We scale and normalize both the structured and unstructured covariates before concatenating them. In total, we build 509 covariates for prostate cancer and 1007 covariates for NSCLC. These covariates are then used to uncover potential dataset-specific confounders.
4.4 Outcomes
We define our survival outcome as (Yi, Ei), where Yi ∈ Z+ is the number of survival days since the diagnosis and Ei ∈ {0, 1} is an indicator for whether a death event has been observed during follow-up. The treatment, Wi ∈ {0, 1}, is an indicator for either surgery, radiation, or monitoring, depending on the treatment group. The covariates, Xi, includes the structured dataset pulled from the EMR data and the bag-of-words matrix representation generated from EMR notes.
4.5 Uncover and Evaluate Confounders
We uncover interpretable potential confounders from the covariates and evaluate the confounders we’ve identified with survival analysis. The approach is shown in Figure 1B.
We find the potential confounders by identifying covariates that are predictive of both treatment and survival outcome. We train prediction models for treatment (Wi = 1) and the survival outcome (Yi, Ei) with Lasso [13] using glmnet [58]. Lasso is a L1−penalized linear that can produce coefficients for covariates that are exactly zero, and is, hence, often used for creating sparse models [59] or variable selection [12]. We select the intersection of covariates with non-zero coefficients from both the treatment and survival outcome models as potential confounders. For surgery vs. radiation and surgery vs. active monitoring for prostate cancer, we select the intersection covariates that correspond to the Lasso shrinkage penalty for the most regularized model such that the error is within one standard error of the minimum, lambda.1se. With radiation vs. monitoring for prostate cancer and surgery vs. radiation for stage I NSCLC, we select the intersection covariates that correspond to the shrinkage penalty that gives the minimum mean cross-validated error, lambda.min. The intersection terms selected are more stable with lambda.1se. However, we choose lambda.min for the latter two treatment groups because lambda.1se did not select any covariates from the text.
We then evaluate each of the covariate combinations with propensity score adjusted survival analysis. Propensity scores for patient i is the probability of receiving the treatment of interest, Wi = 1, given the covariates Xi [60]. Conditional on the propensity score, the distribution of observed covariates is expected to be the same in both branches of the treatment group. It is often used to reduce the effect of confounding in observationalstudies [60, 61]. In survival analysis, the hazard rate h(t|X) is the probability the patient will die within time t given covariates X. The HR is the ratio of the hazard rate of the two treatments. In survival outcomes analysis, the HR is treated as the treatment effect of choosing the treatment of interest, Wi = 1.
We use the Cox-proportional hazard (Cox-PH) model to perform survival regression [62]. We assume the proportional hazards condition [63], which states that covariates are multiplicatively related to the hazard, e.g., a covariate may halve a subject’s hazard at any given time t while the baseline hazard may vary. Hence, the effect of covariates estimated by any proportional hazards model can be reported as the HR of the covariate.
In a Cox-PH model, the hazard rate of an individual is a linear function of their static covariates and a population-level baseline hazard that changes over time. We adjust for covariates (e.g. patient age, race white, etc.) against duration of survival and a binary variable indicating whether the outcome event has occurred. We estimate where p is the number of covariates, b0(t) the baseline hazard, bW the effect size of the treatment, and bj the effect size of the jth covariate. The HR for a covariate is equal to ebi. We define the HR of the treatment as ebw. The Lasso regularization can also be applied to a Cox-PH model for variable selection.
We use 3 methods to estimate the HR:
• Nearest-Neighbor Matching on Propensity Score (matching) [16]: We perform nearest-neighbor propensity score matching (NNM) on selected covariates and estimate the HR on the matched population using a univariate Cox-PH model regressed on the treatment.
• Inverse Propensity of Treatment Weighting (IPTW) [64, 16]: We estimate the HR using a univariate Cox-PH model regressed on the treatment with inverse propensity score weighting with stabilization [64]. The weights are defined as
• Multivariate Cox Proportional Hazard (multi.coxph) [62, 65, 5]: We estimate the HR using a multivariate regression model on the treatment and selected covariates to see how covariates interact with each other. The multivariate model is also weighted with the inverse propensity scores above to form a doubly-robust model.
All Cox-PH models are trained using the survival R package [66] with robust variance. NNM is performed using the Matching R package [67].
We estimate the propensity scores using logistic regression [68] with glmnet [58], stochastic gradient boosting [69] with gbm [70], and generalized random forests with grf [25]. We select the propensity score estimation method with the best overlap and covariate balance post propensity score adjustment.
We then compare the 3 methods for estimating HR using forest plots.
For each covariate in struct+intersect, we also show the univariate and multivariate Cox-PH model HR, 95% HR confidence interval, and p-value, using the analysis presented in Bradburn et al. [65]. Note that for the multivariate Cox-PH covariate analysis, we do not weight the model with the inverse propensity scores.
5 Data availability
The datasets analyzed for the study are not publicly available. The EHR data cannot be redistributed to researchers other than those approved through the Stanford Institutional Review Board. We have therefore given detailed description of our data selection and processing pipeline in the Methods section.
6 Code availability
The training and statistical evaluation code can be made available upon request to the corresponding author.
7 Author Contributions
J.Z. contributed to data acquisition, data processing, study design, methodology, implementation, interpretation of results, and drafted and revised the paper. M.F.G. contributed to the acquisition of data, study design, interpretation of results, and revised the paper draft. D.L.R. contributed to the acquisition of data, provision of computational resources, methodology, and revised the paper draft. S.A. contributed to study design, methodology, interpretation of results, and revised the paper draft. R.D.S. contributed to the acquisition of data, study design, methodology, interpretation of results, and revised the paper draft. All authors contributed to the study conception, provided feedback during the work development, and gave approval for the submission.
Data Availability
The data is not available for sharing.
8 Competing Interests
The authors declare that there are no competing interests.
a Glossary of Structured Covariates
Table 7 shows a glossary of the structured features extracted from the EMRs.
B approach identifies covariates thaDictionary of Synonymns
• {“abnormality” → “abnormal”, “admission” → “admit”, “assistance” → “assistant”, “bilateral” → “bilaterally”, “bleeding” → “bleed”, “consult” → “consultation”, “diagnostic” → “diagnosis”, “evaluate” → “evaluation”, “hx” → “history”, “functional” → “function”, “fu” → “followup”, “gentleman” → “man”, “disease” → “illness”, “imaging” → “image”, “improvement” → “improve”, “invasion” → “invasive”, “action” → “movement”, “neurologic” → “neurological”, “operative” → “operation”, ”polyps” → “polyp”, “postop” → “postoperative”, “pulse” → “rate”, “reaction” → “reactive”, “refer” → “referral”, “removal” → “remove”, “resp” → “respiratory”, “smoke” → “smoking”, “assistance” → “service”, “spinal” → “spine”, “surgical” → “surgery”, “assessment” → “test”, “testing” → “test”, “therapeutic” → “therapy”, “treat” → “treatment”, “visualize” → “visual”}
C Confounders
Confounding is a major challenge when estimating causal effect from observational studies. The structure of confounding can be represented by causal diagrams. In Figure 5, we present a series of Directed Acyclic Graphs (DAGs) that show different causal structures with potential confounding, based on the examples in Herńan and Robins [11], Chapter 7. In the following diagrams, we define Y as the outcome, W as the treatment, and X as the covariate that has been identified as a potential confounder.
Figure 5A shows the most natural case of confounding. Treatment W is a cause of outcome Y and confounder X is a cause of both W and Y. Therefore, the association between W and Y includes both the direct causal effect and an indirect “backdoor” path from W to Y through X. Conditioning on the confounder X blocks this second path, allowing the accurate estimation of the causal effect of W on Y. Examples of this type of confounder include fixed patient characteristics such as a patient’s age or cancer clinical stage. For example, older patients can have worse survival outcomes, so doctors might assign different treatments to older patients; cancer patients with higher clinical stage can also have worse survival outcomes, affecting doctors’ treatment decisions. Our method is designed to uncover and adjust for this type of confounder, as is appropriate.
Figures 5B and 5C show different structures where W is a cause of Y, where X is a cause of one and associated with the other through an unmeasured cause U. In these cases, conditioning on confounder X blocks the “backdoor” path between W and Y ; in such cases, conditioning on X is necessary to avoid bias in estimating the causal effect of W on Y, since part of the correlation between X and Y arises due to the relationship between (U) and both W and Y. Confounding of type 5B can arise when U represents a type of lung cancer mutation. Even if a mutation test is not performed (and so the mutation is unobserved), the mutation (U) affects the symptoms recorded in the notes (X), as well as the patient outcomes (Y). An example of 5B uncovered from our NSCLC study is left.low, tumor location on the left lower lobe. Location of cancer (X) directly effects treatment decisions, and EGFR mutated lung cancer (U) is less likely to be positioned in the lower lobe [40]. For these examples, conditioning on the text describing the cancer location is appropriate, and is necessary if the backdoor path is not blocked by other covariates X.
Figure 5D shows a structure where W is a cause of Y, and X is a cause of neither of them. In 5D, conditioning on X will not eliminate the backdoor path between W and Y, but it might reduce its effect. Confounding of type 5D may arise when U represents patient “performance status” (a measure of how the disease impacts the patient’s daily living abilities), which is not typically recorded in the patient’s chart as a structured field, but can directly effect both treatment decision and survival outcome [43]. Some examples of 5D uncovered from our NSCLC study include discomfort, alert, and attention.
Figure 5E shows a structure where W is a cause of X, but not a cause of Y. In such cases, conditioning on X introduces a backdoor path between W and Y. Confounding of the type shown in 5E can arise when X represents short-term effects post treatment and U represents patient health status. Our study design avoids such cases by only considering pre-treatment covariates.
Figure 5F shows a structure where W and Y are causes of X. In this case, the covariate X is referred to as a collider, and conditioning on X introduces a backdoor path between W and Y. Our study design also avoids such cases by only considering pre-treatment covariates.
Figure 5G show another structure where W is a cause of Y, and X is a cause of neither of them. Unlike 5D, conditioning on X in 5G opens a backdoor path between W and Y, introducing bias into the causal estimate of the effect of W on Y. Situations such as 5G are a potential limitation of our methodology, but in some cases they can be recognized through inspection and reasoning. Examples of 5G are words selected in earlier iterations of our study such as menlo, referring to the location Menlo Park, CA. A patient’s education level (U1) can effect both their treatment preference and also their living location; a patient’s socioeconomic status (U2) can effect their survival outcome and living location.
Although menlo is associated with treatment and outcome through U1 and U2, treating it as a confounder would introduce bias. We were able to filter out some of these terms by selecting only biomedical-related terms. Clinical expertise is needed to avoid scenario 5G.
D#Propensity Score and Covariate Balance Plots
We show the propensity scores and covariate balance plots for each of the results plotted in Figure 3. In Figure 6, we show the plots for structured. In Figure 7, we show the plots for intersect. In Figure 8, we show the plots for struct+intersect.
E#Covariate Correlation
We show the R2 values for all combinations of the selected covariates for each of the treatment groups. The R2 is the square of the correlation from linear regression. It measures the proportion of variation in the dependent variable that can be attributed to the independent variable. In Table 8, we show the R2 values for surgery vs. radiation for prostate cancer. In Table 9, we show the R2 values for surgery vs. monitoring for prostate cancer. In Table 10, we show the R2 values for radiation vs. monitoring for prostate cancer. In Table 11, we show the R2 values for surgery vs. radiation for NSCLC.
Acknowledgements
For data acquisition, we also thank A. Solomon Henry and Douglas Wood. The research is supported by funding from the Stanford Human-Centered Artificial Intelligence Institute and Department of Management Science and Engineering.
Footnotes
We added an additional supplement and corrections/revisions throughout the paper.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵