Abstract
In medicine, randomized clinical trials (RCT) are the gold standard for informing treatment decisions. Observational comparative effectiveness research (CER) is often plagued by selection bias, and expert-selected covariates may not be sufficient to adjust for confounding. We explore how the unstructured clinical text in electronic medical records (EMR) can be used to reduce selection bias and improve medical practice. We develop a method based on natural language processing to uncover interpretable potential confounders from the clinical text. We validate our method by comparing the hazard ratio (HR) from survival analysis with and without the confounders against the results from established RCTs. We apply our method to four study cohorts built from localized prostate and lung cancer datasets from the Stanford Cancer Institute Research Database and show that our method adjusts the HR estimate towards the RCT results. We further confirm that the uncovered terms can be interpreted by an oncologist as potential confounders. This research helps enable more credible causal inference using data from EMRs, offers a transparent way to improve the design of observational CER, and could inform high-stake medical decisions. Our method can also be applied to studies within and beyond medicine to extract important information from observational data to support decisions.
1 Introduction
As the number of highly targeted cancer treatments increase, it is increasingly difficult for oncologists to decide on optimal treatment practices. In the recent years, medicine has seen the reversal of 146 standard medical practices [1], and many unanswered questions remain on treatment decisions in oncology. The gold standard for assessing treatment effects is randomized clinical trials (RCT). However, RCTs can be very expensive, time-consuming, and limited by the lack of external validity [2, 3]. Hence, there has been a growing interest in using observational data to compare and evaluate the effectiveness of clinical interventions, also known as comparative effectiveness research (CER) [2].
Many studies have used large-scale observational registries such as the Surveillance, Epidemiology, and Ends Results (SEER) and National Cancer Data Base (NCDB) to perform CER. However, such studies may be unreliable due to the systemic bias present in observational data and the presence of unmeasured confounders [1, 2, 4]. Moreover, population-based CERs in oncology often also face small data challenges. Electronic medical records (EMRs) are another source of rich observational information on patient demographics and past medical history. We hypothesize that the more detailed unstructured data present in EMRs can be harnessed to reduce confounding compared to prior CER studies.
We study how EMRs, especially clinical text, can be used to reduce selection bias in observational CER studies and better inform treatment decisions in oncology. A confounder is a variable that is associated with both treatment assignment and the potential outcomes a subject would have under different treatment regimes. In the presence of confounders, the correlation between treatment assignment and outcomes cannot be interpreted as causal. One way that confounding may arise is when patients are selected for a treatment group on the basis of the severity of their illness. In such a case, failing to adjust for patient severity can lead to selection bias when attempting to estimate causal effects. For example, surgery tends to be performed on younger or healthier patients; certain doctors or institutions may prefer one treatment over another, and this creates confounding if those doctors or institutions treat patients with systematically different severity. Studies based on a small set of covariates tend not to capture the important confounders and result in biased estimates [5, 6]. Observational studies are more reliable when we can better control for these confounders. EMR clinical text is a potential source of additional information about factors that might relate to both treatment assignment and prognosis.
We use natural language processing (NLP) to uncover interpretable potential confounders from the EMRs for treatment decisions. For high-stake settings such as cancer treatment decisions, it is important to design models that are interpretable for trust and understanding [7]. NLP can be used to process the unstructured clinical notes and create covariates. We then augment our dataset with covariates that impact both treatment assignment and patient outcomes, where attempting to estimate causal effects while omitting such variables leads to biased estimates [8, 9]. Finally, we use methods designed to estimate causal effects in observational studies with observed confounders to estimate treatment effects in our augmented data set. We show that controlling for these confounders appears to reduce selection bias when compared against the results from established RCTs and clinical judgement.
We apply our method to localized prostate and lung cancer patients. Based on cohorts from established RCTs, we built four treatment groups for comparison. We uncovered inter-pretable potential confounders from clinical text and validated the potential confounders against the results from the RCTs. Simple NLP techniques (e.g. lemmatization, entity identification) were used to construct a bag-of-words representation of the frequently occurring terms. A Lasso model [10] was then used to select the terms that are predictive of both the treatment and survival outcome as potential confounders. Finally, we validated our method by comparing the hazard ratio (HR) from survival analysis with and without the confounders.
Our main contribution is presenting an approach to uncover interpretable potential confounders from clinical text. Existing work in observational causal inference rarely employs unstructured data [11, 12, 13, 14], and most NLP studies on clinical text focus on prediction settings [15, 16, 17, 18]. Our paper is the first to uncover interpretable potential confounders from clinical notes for causal analysis on cancer therapies, and one of the few works that combines NLP and causal inference in a time-to-event setting. Our method allows researchers to extract and control for confounders that are not typically available. This appears to be a useful step for future observational CER studies to help reduce selection bias unique to that dataset. The research presented can help unlock the potential of clinical notes to help clinicians understand current clinical practice and support future medical decisions.
1.1 Related Work
In the past decade, there has been a growing interest in using observational data for clinical decision making and causal inference in oncology [2]. However, such studies are often unreliable, and many observational studies have been refuted by RCTs soon after [2, 4]. For example, Yeh et al. [6] performed a comparison of surgery vs. radiotherapy for oropharynx cancer and suggested that surgery may be superior to radiation for quality of life outcomes. A few years later, this claim was refuted by an RCT study Nichols et al. [19], which showed that radiation is in fact superior to surgery in terms of 1-year quality of life scores. A similar example is seen with prostate cancer. In 2016, Wallis et al. [5] showed through population-based studies that surgery is superior to radiation for earlystage prostate cancer for overall and prostate-cancer specific survival; a few months later, the finding was refuted by Hamdy et al. [20], which showed that surgery and radiation are equivalent in terms of overall and prostate-cancer specific survival. Many other studies have shown the fallibility of population CERs that rely on expert-curated features to draw conclusions about treatment effects [2, 21].
Beyond clinical studies, there is a relatively large literature on performing causal inference from observational data. Various papers have explored how to correct for bias when evaluating average treatment effect (ATE) from observational studies with propensity score matching or weighting [22, 11, 12]; see [23] for a review. There is also a growing amount of literature that adapts machine learning models, such as random forest or regularized regression, for doubly-robust ATE estimation in high-dimensional settings [24, 25, 26, 27, 9, 28]. However, most of the methods do not include unstructured data. Veitch et al. [29] is one of the few works that employ unstructured data for causal inference; however, they rely on black-box models that are not interpretable. Moreover, many causal inference methods are developed for continuous outcomes and do not transfer easily to the time-to-event outcomes for survival analysis used in oncology. Of the ones that perform causal inference on time-to-event outcomes for medical applications [13, 14], we did not find any that include unstructured data in a systematic way. Austin [14] presents methods for using propensity scores to reduce bias in observational studies with time-to-event outcomes. Our study leverages some of the ideas and methods in this literature to develop our approach for identifying and evaluating the potential confounders from the unstructured clinical notes.
There is also a growing literature that seeks to better employ EMRs for clinical tasks. Existing work has employed structured EMR data and unstructured clinical notes for survival prediction and analysis [17, 30], predicting metastatic recurrence [15, 16], clinical risk prediction [31], and prediction of multiple medical events [18]. However, most current work involving EMRs focuses on prediction tasks. In studies that include the unstructured notes, most use deep learning to produce context-rich embedding representations of words or documents [17, 18]. While these representations are highly accurate for prediction tasks, they are often black-box and very difficult to interpret for causal insights. Our approach differs in that we use simple NLP techniques (e.g. entity identification, bag-of-words) to generate matrix representations that can be easily mapped to specific words and phrases. This increases the interpretability of our method and allows us to explain our confounders to clinicians.
Our study advances both the clinical and causal inference literature by using NLP to perform causal inference on clinical text in time-to-event settings. We hope this will inform clinical practice and improve patient outcomes.
2 Results
We apply our methods to localized prostate and stage I non-small cell lung cancer (NSCLC) patients and compare the results against established RCTs. We select these diseases due to data availability and having established clinical RCTs for validation. After filtering and assignment, we include 1,822 patients for prostate cancer, with 988 surgery patients, 385 radiation patients, and 449 active monitoring patients; the average follow-up time is 4.11 years. For stage I NSCLC, we include 749 patients, with 492 surgery patients and 257 radiation patients; the average follow-up time is 4.96 years. The patient characteristic descriptions of the prostate cancer cohort are shown in Table 1 and the NSCLC cohort are shown in Table 2.
We use the findings from established RCTs and clinical judgement as a benchmark for evaluating our results. For localized prostate cancer, Hamdy et al. [20] compared active monitoring, radical prostatectomy, and external-beam radiotherapy. A total of 1,643 patients were included in the study, with 553 men assigned to surgery, 545 men assigned to radiotherapy, and 545 men to active monitoring. They observed no significant difference among the groups for prostate-cancer or all-cause mortality (P = 0.48 and P = 0.87 respectively). Similarly, a recent study showed that difference in treatment effects for surgery vs. radiation observed from observational studies is entirely due to treatment selection bias [32]. For stage I NSCLC, The Chang et al. [33] study is a pooled study comparing stereotactic ablative radiotherapy (SABR) to surgery. A total of 58 patients were included, with 31 patients assigned to SABR and 27 to surgery. The study observed that SABR had slightly better overall survival than surgery (P = 0.037), but claims to be consistent with the clinical judgement that surgery is equipoise to radiation.
Following the design of Hamdy et al. [20] and Chang et al. [33], we evaluate our results for the following four treatment groups for an outcome of all-cause mortality:
surgery vs. radiation for prostate cancer
surgery vs. monitoring for prostate cancer
radiation vs. monitoring for prostate cancer
surgery vs. radiation for stage I NSCLC
We do not analyze other treatment groups for lung cancer due to patient count constraints. Our approach identifies covariates that are likely potential confounders in this particular dataset from the high-dimensional and high-noise EMR data. These covariates are interpretable as they are represented by structured data or words from a bag-of-words matrix. To evaluate the effectiveness of the potential confounders selected in the model, we use these potential confounders to perform survival analysis for the treatment groups for prostate and stage I NSCLC. We compare the results of various methods for time-to-event analysis in terms of HR. Although we cannot know what the true HR is, we suggest that using medical notes improves on the traditional covariates. We compare our results against existing RCTs to evaluate how the confounders we have uncovered can help correct for selection bias. The overall workflow is shown in Figure 1.
2.1 Potential Confounders
We show that our methods uncover terms that are predictive of both the treatment and survival outcome. Hence, these are potential confounders that should be controlled for in observational CERs to reduce selection bias.
We select the intersection covariates from our treatment and outcome prediction models as the potential confounders. We base this idea on the selection of union variables to reduce confounding when performing causal inference on observational data in the case of continuous outcomes [9]. However, in survival analysis, it is recommended that the covariates analyzed be constrained by the statistical 1 in 10/20 rule of thumb with respect to the the event count [34, 35]. In our high-dimensional setting, the union of covariates that are predictive of treatment and outcome yield too many potential confounders relative to the sample size. Hence, we use the intersect as a heuristic to focus on the most important confounders.
In Figure 2, we illustrate the unpenalized coefficients of covariates from two models, the treatment assignment model and the survival outcome model. For each covariate, the x-axis plots the coefficient from the treatment prediction model while the y-axis plots the coefficient from the survival outcome model. Each covariate is labeled by the text next to it. The intersection covariates, intersect, are shown in blue; these are the covariates that have strong effects in both models. For the structured covariates, we illustrate in black the coefficients for the covariates that were not selected; these coefficients are closer to at least one of the axes in the figure. We do not illustrate the coefficients for unstructured covariates that are not selected, as there are a large number of these covariates. The axes are labeled to indicate which treatment the coefficient predicts and whether the coefficient is indicative of a good or bad survival prognosis. For example, in the treatment model, patients with a high “bladder” word-occurrence have a higher likelihood of receiving surgery; in the outcome model, patients with a high “bladder” occurrence have a lower likelihood of survival.
In Supplementary A.2, we show the R2 correlation among all the selected covariates for each treatment group.
2.2 Evaluation of Potential Confounders
We evaluate these potential confounders by comparing the results on 3 covariate combinations:
structured: Using only the structured covariates. We use this as a baseline because these are covariates that are typically used in retrospective oncology studies and are readily available in the structured data [5].
intersect: Using only the intersection covariates identified as confounders.
struct+intersect: Using the union of the structured and intersection variables.
In Figure 3, we show the hazard ratio (HR) of the effect of treatment for each study cohort when the selected covariates are included in analysis. An HR below 1 indicates that patients with the second treatment are more likely to survive than those with the first treatments. An HR above 1 indicates the opposite, and an HR equal to 1 indicates that the two treatments are equipose.
We observe that with the additional covariates, we are able to shift the estimate of the HR towards the direction of the RCT for an outcome of all-cause mortality. We also compare the covariate-specific HR of each of the selected covariates in terms of univariate and multivariate Cox-PH analysis for an all-cause mortality outcome in Tables 3–6.
In Figure 3A and Table 3, we show the results with surgery vs. radiation for prostate cancer. The RCT reports no significant difference between surgery vs. radiation for localized prostate cancer [20]. With structured, we observe a significant effect that radiation is superior to surgery, a result that disagrees with most retrospective studies [5]. We observe a significant shift in the HR towards equipoise with the additional identified confounders for intersection and struct+intersect. For structured, we observe an HR of 2.51 with 95% CI (2.39-4.55) and p-value of 0.002 with multi.coxph. For struct+intersect, we estimate an HR of 1.54 with 95% CI (0.78-3.03) and p-value of 0.214 with multi.coxph. We shift the HR point estimate by 0.97, or 38.6%, towards equipose.
In Figure 3B and Table 4, we show the results of surgery vs. active monitoring for prostate cancer. Hamdy et al. [20], the RCT, reports the HR for surgery vs. active monitoring as 0.93 with 95% CI (0.65, 1.35) and p-value of 0.92. With structured, we again have a significant effect that active monitoring is superior to surgery; this disagrees with most retrospective studies [5] and Hamdy et al. [20]. We again observe a significant shift in the HR towards equipoise with the additional identified confounders. For structured, we observe an HR of 2.71 with 95% CI (1.55-4.75) and p-value < 0.001 with multi.coxph. For struct+intersect, we estimate an HR of 1.10 with 95% CI (0.55-2.21) and p-value of 0.781 with multi.coxph. We shift the HR point estimate by 1.61, or 59.1%, towards equipose.
In Figure 3C and Table 5, we show the results of radiation vs. active monitoring for prostate cancer. We do not see as significant a shift with radiation vs. active monitoring. Hamdy et al. [20] records the HR for radiation vs. active monitoring as 0.94 with 95% CI of (0.65, 1.36) and p-value of 0.92. We observe that the matching results are not very far from the RCT results. All results with intersect and struct+intersect shift the HR estimate slightly towards equipose, with the most shift of 0.32 by intersect and IPTW. We suspect this may be due to the smaller dataset available for radiation vs. active monitoring or the confounding not being observable within the text.
In Figure 3D and Table 6, we show the results with surgery vs. radiation for stage I NSCLC. With structured, we observe a significant effect that surgery is superior to radiation. The results from Chang et al. [33] and clinical judgement tells us that surgery and radiation should be about equipose for stage I NSCLC. The shift is not as significant as with prostate cancer, but we also note that the established clinical standard for lung cancer is not as well studied. We do observe a more significant shift with multi.coxph; there is a slight shift with IPTW and matching. For structured, we observe an HR of 0.39 with 95% CI (0.30-0.51) and p-value < 0.001 with multi.coxph. For struct+intersect, we estimate an HR of 0.54 with 95% CI (0.40-0.53) and p-value < 0.001 with multi.coxph. We shift the HR point estimate by 0.15 towards equipose. We suspect the small changes with IPTW and matching are again due to the even smaller data size of stage I NSCLC. The doubly-robust method of multi.coxph seem to perform better under these settings.
Overall, our methods uncover several potential confounders that can reduce selection bias in observational data. Although our method cannot uncover all potential confounders, we are able to uncover confounders that are not usually included in expert-selected covariates. Supplementary analysis of propensity scores and covariate balance plots for each analysis are seen in Supplementary A.1.
2.3 Potential Confounder Interpretation
We show that the potential confounders we have uncovered are interpretable through clinical expertise. We examine the effect on survival for each selected covariate in term of univariate and multivariate survival analysis with a Cox-PH model. In univariate analysis, a single covariate is regressed on the survival outcome and describes the survival with respect to a single covariate. In multivariate analysis, all the selected covariates are regressed on the survival outcome and describe each covariate’s effect on survival while adjusting for the impact of all selected covariates. For a particular variable, an HR below 1 indicates that the covariate is a positive predictor of survival, an HR above 1 indicates a negative predictor of survival, and an HR equal to 1 means that the variable does not seem to effect survival.
2.3.1 Prostate Cancer
For surgery vs. radiation and surgery vs. active monitoring, pat age, bladder, and urothelial are chosen as intersection covariates. Moreover, they are also shown to be significant through both univariate and multivariate covariate analysis in Tables 3 and 4.
Patient age is a known confounder in treatment decision and survival outcomes. Older patients are more likely to receive radiation due to surgery risk. However, older patients also have higher mortality.
Patients with bladder cancer or bladder issues are more likely to get surgery than radiation. Radiation does not work well for bladder cancer. Patients with bladder problems may prefer surgery because radiation can irritate the bladder and cause urinary problems. However, these are also patients with higher mortality and more health issues.
Moreover, for this particular dataset, we note that the confounding appear to be observable. The bias of surgery being worse than radiation and monitoring is due to a group of patients who are diagnosed with prostate cancer through a resection for bladder cancer or other bladder issues. When a patient with bladder cancer has a cystoprostatectomy in which the bladder and prostate are both removed, a pathologist can sometimes find a prostate tumor in the pathology specimen. Bladder cancer patients tend to be older, have more medical issues, and a higher mortality rate. However, these patients often have low clinical stage for prostate cancer. The terms bladder and urothelial describe this group of patients. Our method is able to capture some characteristics of this group and use this to reduce selection bias.
For radiation vs. active monitoring, we do not observe confounders that are significant in Table 5. bladder is picked up but does not seem to have a significant effect. We do see some interesting terms that could indicate poor health, such as abnormal, dysfunction, or wound. We also observe selection of medical conditions that could confound treatment, such as lymphoma or oropharynx. It can be that the confounding here is not as easily observable or our method is unable to identify it.
2.3.2 Lung Cancer
We examine Table 6 for the intersection covariates through univariate and multivariate analysis. We observe that some of the significant terms are pat_age, male, race_api, diagnosis_year, alk, left.low, and severe.
We note that age, gender, race, and diagnosis year are known confounders for treatment decision and outcome.
The covariate alk points to the ALK mutation for NSCLC. About 5% of NSCLCs have a rearrangement in a gene called ALK; the ALK gene rearrangement produces an abnormal ALK protein that causes the cells to grow and spread. This change is often seen in non-smokers (or light smokers) who are younger and who have the adenocarcinoma subtype of NSCLC [36]. We hypothesize that alk is significant because the ALK mutation is mutually exclusive from the EGFR mutation [37]. The EGFR mutation is often present in asian patients and EGFR patients typically have better survival. Hence, the significance of alk can be due to picking up on the absence of the EGFR mutation.
The covariate left.low can point to NSCLC on the lower left node of the lung. Studies have observed that lung cancer on the lower lobe or lower left lobe has worse survival [38, 39]. The covariate severe could be pointing to a severe conditions related to lung and other problems that indicate poor overall health. Similarly, we also observe other terms that could describe the severity of lung cancer - such as squamous.cell, pulmonary.nodule, or silhouette - or overall health levels - alert, attention, cyst, or discomfort.
Overall, we are able to uncover some potential confounders that are easy to interpret and capture useful clinical insights.
3 Discussion
We have demonstrated how causal inference methods can be used to draw more reliable conclusions from population-based studies. Our paper shows that 1) clinical notes, or unstructured data, can be an important source for uncovering confounders, and 2) current clinical tools can be augmented with machine learning methods to provide better decision support. Furthermore, our experimental framework can be easily adapted to use textual data to reduce selection bias in retrospective studies more generally.
Our method can be used to improve clinical practice. Due to the simplicity of the machine learning tools employed, our method can be easily implemented as an additional step in the design of observational CER studies. Our results also show that our method is generalizable to different types of cancer and for various types of study cohort comparisons. With the continued digitization of clinical notes and the increasing access to EMRs, we recommend this as an essential step for any researcher seeking to draw clinical insights from observational data. The terms uncovered with our method can not only be used to improve observational CERs, but also be used to generate interpretable insights about current clinical practice. The uncovering of relevant information and subsequent insights can then be used to inform high-stakes medical decisions.
We believe that our work is the first to explore the potential of including unstructured clinical notes to reduce selection bias in oncology settings. We are also one of the first works to incorporate unstructured data into causal inference estimators and Cox-PH models. Although our method has been developed to address a specific problem in oncology and applied in the clinical setting, it can also be easily adapted for application in any observational study that seeks to incorporate unstructured text. There is much work to be done in using NLP and unstructured text for causal inference. Our work present a simple and flexible way to generate interpretable causal insights from text of any sort. Our study also has several limitations. First, we use simple NLP methods to process the clinical notes and extract the top 500 or 1000 features for variable selection. In the process, much information in the text nodes is discarded and the sequence of past medical events are not taken into account. We choose this setup due to the the small sample size of oncology study cohorts, which makes it difficult to train more complicated models for textual processing. In theory, the more work that is placed into the clinical notes preprocessing and the higher quality of the features generated from these notes, the more informative the uncovered potential confounders will be. For future work, we hope to explore how other NLP techniques, such as topic modeling or clustering, can be used to build even higher quality features from the unstructured text. There are also an increasing number of deep learning models that can be used to identify interpretable insights [18]. We are interested in how these deep learning methods can be applied to generate causal insights on another study population with larger sample size.
Second, our method can only uncover potential confounders that can be observed in notes. There are many sources of confounding in observational data and even rich EMR data cannot capture everything. If the confounding is unknown and unobservable, no method to our knowledge will be able to adjust for it. Hence, it would be good practice to perform sensitivity analysis to evaluate the result’s robustness to unknown confounding. Third, our approach of selecting intersection covariates is an empirical approach designed for uncovering the most valuable potential confounders. While our approach seemed to work well empirically in this study, one could do simulation studies to help verify its validity in the future.
Finally, the validity of causal inference models cannot be determined without prospective experimental data. Therefore, the uncovered confounders and estimated HR can only be validated by clinicians. We are identifying potential candidates for the bias and then evaluating these candidates of bias against RCTs.
Many challenges still remain for employing unstructured data for causal inference analysis and medical settings. We hope this work interests both clinical practitioners augmenting existing clinical support tools and researchers using textual data to reduce confounding in observational data. We hope our workflow, problem framing, and experimental design can serve as such a sandbox for testing more complex algorithms or adapting to other application areas. Ultimately, we hope this research will find causal information in clinical notes and provide a transparent way for machine learning to inform medical decision making.
4 Methods
4.1 Dataset
With approval of the Stanford Institutional Review Board (IRB), we curate a dataset of non-metastatic prostate and lung cancer patients from the Stanford Cancer Institute Research Database (SCIRDB). The database includes patients seen in the Stanford Health Care (SHC) system from 2008 to 2019 for prostate cancer and 2000 to 2019 for lung cancer. SHC clinical sites include one academic hospital, one freestanding cancer center, and several outpatient clinics. From SCIRDB, we pull a total of 3,638 prostate cancer patients with 552,009 clinical notes and 3,274 non-small cell lung cancer (NSCLC) patients with 648,505 clinical notes. The clinical notes include progress notes, letters, discharge summaries, emergency department notes, history and physical notes, and treatment planning notes.
For each patient, we also pull the structured EMR and data from the inpatient billing system. From the California Cancer Registry (CCR), we pull the available initial treatment information, cancer staging, tumor description, date of diagnosis, date of death, and date of last follow-up for these selected patients. For NSCLC, we also pull the recorded Epic cancer staging information.
4.2 Study Cohort
We build our study cohorts from SCIRDB with reference to existing observational study principles and clinical expertise.
For each patient, we combine all treatments with the same Diagnosis ID in the CCR as the initial line of treatment. For patients with multiple diagnosis id, we keep the first record of treatment. For prostate cancer, patients without a recorded treatment are labeled as active monitoring. To avoid explicit revelation of the treatment choice, we only include notes more than 2 months before treatment start date for prostate cancer and 1 month for NSCLC. We then select for patients with at least one note before the specified time. We also perform landmark analysis to avoid immortal-time bias [40] and select only patients who survived at least 6 months past their date of diagnosis. Landmark analysis refers to the practice of designating a time point during the follow-up period and analyzing only patients who have survived until the landmark time [40]. To ensure the proportional hazard condition, patients who are still living are censored at time of last follow-up [41]. The patient filtering and cohort selection process is shown in Figure 4.
For patients with unknown clinical stage but known pathological stage, we impute the clinical stage by training a clinical stage classification model using the pathological stage and other patient information. Pathological stage is usually a little higher than clinical stage due to the staging based on biopsy samples instead of imaging; hence, it is inaccurate to group them together. Clinical stage is more frequently used for similar observational studies [20, 33] and it is more rigorous to impute the missing clinical stage with a model trained on the pathological stage and other relevant covariates. We train the clinical stage imputation model with pat_age, pathological_stage, diagnosis_year, and tumor_grade. For NSCLC, tumor_grade is not included due to missing information. For both prostate and NSCLC, we train and validate a random forest model [42, 43] on patients with both clinical and pathological stage available. The imputed stages are used as the clinical stage for those patients. For patients with both clinical and pathological stage missing, we are able to fill in some through clinical chart reviews.
To form the treatment groups, we assign patients with only surgery records into the surgery group and patients with only radiation records into the radiation group. For patients with both radiation and surgery, patients who received surgery first are assigned to the surgery group and patients who received radiation first to the radiation group. For prostate cancer, patients with no recorded treatment are assigned to the active monitoring group. For NSCLC, only patients with clinical stage I are included.
4.3 Data Processing and Representation
We build the covariates used for uncovering confounders through the process shown in Figure 1A. We compile the data from SCIRDB, CCR, and Epic for each patient.
We include age, race, ethnicity, clinical stage, and diagnosis year as part of the structured data. For prostate cancer, we also include tumor grade. For NSCLC, we also include gender. Based on age range categories used in Li et al. [44], we form the categorical variable pat_age by splitting age into ranges of ≤ 49 years-old, 5-year buckets from 50 – 84 years-old, and ≥ 85 years-old. Race and ethnicity are encoded as one-hot vectors, with each feature indicating one race or ethnicity. Race is combined based on what is done in Li et al. [44]. We select these structured covariates because they are commonly accepted by clinicians as potential confounders and often included in CER studies [5]. For race, race_unknown is not included as a covariate. For ethnicity, only hispanic is included as a covariate. For tumor grade, patients with unknown grade are imputed with the median grade value. The indicator variable grade_unknown is added to indicate which patients have been imputed. The covariates tumor_grade and grade_unknown are not included for NSCLC due to missing information of tumor grade and clinical judgement. In the end, we have 9 structured covariates for prostate cancer and 7 structured covariates for NSCLC.
We build word frequency representations of the clinical notes for the unstructured covariates. For each patient, we compile notes within the specified time (i.e. 2 months prior to treatment start date for prostate cancer and 1 month prior for NSCLC). We only use notes from before treatment so that we are not predicting survival outcome with information unavailable at the time of treatment decision. The different time windows for the two diseases was selected as NSCLC treatment generally starts more quickly than prostate cancer treatment due to the more rapidly progressing nature of the cancer. The notes are segmented based on clinical field labels (e.g. “IMPRESSION:”, “HISTORY:”), tab spaces, NLTK sentence tokenization [45]. To remove noise, we remove clinical field labels and two sentences from the beginning and end of each document. We also remove sentences with common locations (e.g. “Stanford Medical Center”, “Palo Alto”) and medical doctor names (e.g. “xx xx, M.D.”) as these are often prefix or suffix to note documents. To avoid including conditions patients do not have, we remove sentences if they contain a negation term (i.e. “no”, “denies”, “does not”, “none”) and contain less than 15 words.
We then identify biomedical entities from the preprocessed clinical notes with scispaCy [46]. scispaCy is a spaCy [47] based model for processing biomedical, scientific, and clinical text. The scispaCy models identifies a list of all the entities in the text that exist in a biomedical dictionary, such as the Unified Medical Language System [48]. We then lemmatize and combine all biomedical entities identified from the sentences for each patient into a single document. To further remove noise, we remove stop words using a combination of the NLTK stopwords [45] and data-specific stopwords such as medical units (e.g., “lb”, “oz”, “mmhg”), time terms (e.g., “months”, “days”), and medical or Stanford specific terms (e.g., “stanford”, “patient”, “doctor”) that are very common but irrelevant to the task at hand. We also create a dictionary of synonyms in the dataset and use the dictionary to combine these words. The dictionary includes lexical variations that are not reduced to the same root during lemmatization (e.g. “abnormality” → “abnormal”, “consult” → “consultation”), abbreviations (e.g. “hx” → “history”, “fu” → “followup”), and common synonyms (e.g. “assistance → “service”, “action” → “movement”).
Finally, we remove punctuation and generate term frequency representations of the text using bag-of-words (BOW) with Term frequency–inverse document frequency (TF-IDF) weighting [49]. Bag-of-words (BOW) model is a simplifying representation in natural language processing. It represents text (such as sentence or document) as a vector of word occurrence count. TF-IDF, is a score that reweighs the BOW matrix to reflect how important a word is to a document in a collection or corpus. We implement this with scikit-learn [43]. For prostate cancer, we use max features = 500 with ngrams = (1, 1). Although there are more prostate cancer patients, the lower number of death events makes it difficult to include a large number of covariates. For NSCLC, we use max features = 1000, ngrams = (1, 2) (including single words and two word phrases), and max df = 0.7 (filters out dataset-specific stopwords). Hence, we have 500 unstructured covariates for prostate cancer and 1000 unstructured covariates for NSCLC.
We scale and normalize both the structured and unstructured covariates before concatenating them. In total, we build 509 covariates for prostate cancer and 1007 covariates for NSCLC. These covariates are then used to uncover potential dataset-specific confounders.
4.4 Outcomes
We define our survival outcome as (Yi, Ei), where Yi ∈ Z+ is the number of survival days since the diagnosis and Ei ∈ {0, 1} is an indicator for whether a death event has been observed during follow-up. The treatment, Wi ∈ {0, 1}, is an indicator for either surgery, radiation, or monitoring, depending on the treatment group. The covariates, Xi, includes the structured dataset pulled from the EMR data and the bag-of-words matrix representation generated from EMR notes.
4.5 Uncover and Evaluate Confounders
We uncover interpretable potential confounders from the covariates and evaluate the confounders we’ve identified with survival analysis. The approach is shown in Figure 1B.
We find the potential confounders by identifying covariates that are predictive of both treatment and survival outcome. We train prediction models for treatment (Wi = 1) and the survival outcome (Yi, Ei) with Lasso [10] using glmnet [50]. Lasso is a L1−penalized linear that can produce coefficients for covariates that are exactly zero, and is, hence, often used for creating sparse models [51] or variable selection [9]. We select the intersection of covariates with non-zero coefficients from both the treatment and survival outcome models as potential confounders. For surgery vs. radiation and surgery vs. active monitoring for prostate cancer, we use lambda.1se to select the intersection covariates. With radiation vs. monitoring for prostate cancer and surgery vs. radiation for stage I NSCLC, we use lambda.min to select the intersection covariates. The intersection terms selected are more stable with lambda.1se. However, we choose lambda.min for the latter two treatment groups because lambda.1se did not select any covariates from the text.
We then evaluate each of the covariate combinations with propensity score adjusted survival analysis. Propensity scores for patient i is the probability of receiving the treatment of interest, Wi = 1, given the covariates Xi [52]. Conditional on the propensity score, the distribution of observed covariates is expected to be the same in both branches of the treatment group. It is often used to reduce the effect of confounding in observational studies [52, 53]. In survival analysis, the hazard rate h(t|X) is the probability the patient will die within time t given covariates X. The HR is the ratio of the hazard rate of the two treatments. In survival outcomes analysis, the HR is treated as the treatment effect of choosing the treatment of interest, Wi = 1.
We use the Cox-proportional hazard (Cox-PH) model to perform survival regression [54]. We assume the proportional hazards condition [55], which states that covariates are multiplicatively related to the hazard, e.g., a covariate may halve a subject’s hazard at any given time t while the baseline hazard may vary. Hence, the effect of covariates estimated by any proportional hazards model can be reported as the HR of the covariate.
In a Cox-PH model, the hazard rate of an individual is a linear function of their static covariates and a population-level baseline hazard that changes over time. We adjust for covariates (e.g. pat_age, race white, etc.) against duration of survival and a binary variable indicating whether the outcome event has occurred. We estimate where p is the number of covariates, b0(t) the baseline hazard, bW the effect size of the treatment, and bj the effect size of the jth covariate. The HR for a covariate is equal to . We define the HR of the treatment as .
We use 3 methods to estimate the HR:
Nearest-Neighbor Matching on Propensity Score (matching) [14]: We perform nearest-neighbor propensity score matching (NNM) on selected covariates and estimate the HR on the matched population using a univariate Cox-PH model regressed on the treatment.
Inverse Propensity of Treatment Weighting (IPTW) [56, 14]: We estimate the HR using a univariate Cox-PH model regressed on the treatment with inverse propensity score weighting with stabilization [56]. The weights are defined as
Multivariate Cox Proportional Hazard (multi.coxph) [54, 57, 5]: We estimate the HR using a multivariate regression model on the treatment and selected covariates to see how covariates interact with each other. The multivariate model is also weighted with the inverse propensity scores above to form a doubly-robust model.
All Cox-PH models are trained using the survival R package [58] with robust variance. NNM is performed using the Matching R package [59].
We estimate the propensity scores using logistic regression [60] with glmnet [50], stochastic gradient boosting [61] with gbm [62], and generalized random forests with grf [24]. We select the propensity score estimation method with the best overlap and covariate balance post propensity score adjustment.
We then compare the 3 methods for estimating HR using forest plots.
For each covariate in struct+intersect, we also show the univariate and multivariate Cox-PH model HR, 95% HR confidence interval, and p-value, using the analysis presented in Bradburn et al. [57]. Note that for the multivariate Cox-PH covariate analysis, we do not weight the model with the inverse propensity scores.
Data Availability
The data is not available for sharing.
A. Supplement
A.1 Propensity Score and Covariate Balance Plots
We show the propensity scores and covariate balance plots for each of the results plotted in Figure 3. In Figure 5, we show the plots for structured. In Figure 6, we show the plots for intersect. In Figure 7, we show the plots for struct+intersect.
A.2 Covariate Correlation
We show the R2 values for all combinations of the selected covariates for each of the treatment groups. The R2 is the square of the correlation from linear regression. It measures the proportion of variation in the dependent variable that can be attributed to the independent variable. In Table 7, we show the R2 values for surgery vs. radiation for prostate cancer. In Table 8, we show the R2 values for surgery vs. monitoring for prostate cancer. In Table 9, we show the R2 values for radiation vs. monitoring for prostate cancer. In Table 10, we show the R2 values for surgery vs. radiation for NSCLC.
5 Acknowledgements
The research is supported by funding from the Stanford Human-Centered Artificial Intelligence Institute and Department of Management Science and Engineering.
Footnotes
This research was supported by a seed grant from the Stanford Institute for Human-Centered Artificial Intelligence.
Athey thanks the Sloan Foundation and the Office of Naval Research grant ONR N00014-17-1-2131 for generous support.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵