Abstract
Background There has been increased enthusiasm regarding the use of real-world data (RWD) for clinical evidence generation. Cancer registries are important RWD sources that rely on data abstraction from the medical record, however, patients with unknown or missing data elements are under-represented in studies that use such data sources. Therefore, we sought to determine the prevalence of missing data and its associated survival outcomes among cancer patients.
Methods All data elements within the National Cancer Database were reviewed for missing or unknown values for the three most common cancers in the United States diagnosed from 2006 to 2015. Patient records with incompletely abstracted data elements were identified. Prevalence of patients with missing data and their associated survival outcomes were determined.
Results A total of 1,198,749 non-small cell lung cancer (NSCLC), 2,120,775 breast cancer, and 1,158,635 prostate cancer patients were included for analysis. For NSCLC, there were 851,295 (71.0%) patients with missing data in data elements identified as variables of interest; 2-year survival was 33.2% for patients with missing data and 51.6% for patients with complete data (p<0.001). For breast cancer, there were 1,161,096 (54.8%) patients with missing data; 2-year survival was 93.2% for patients with missing data and 93.9% for patients with complete data (p<0.001). For prostate cancer, there were 460,167 (39.7%) patients with missing data; 2-year survival was 91.0% for patients with missing data and 95.6% for patients with complete data (p<0.001). There were significant differences in demographics, tumor characteristics, cancer stage, and treatments received between cancer patients with complete and incomplete records.
Conclusions The majority of patients in a large cancer registry-based RWD source have missing data that was unable to be ascertained from the medical record. Moreover, patients with missing data show worse survival than those with more complete documentation. Increasing documentation quality and adoption of rigorous missing data correction methods are needed to best leverage RWD for clinical advancements.
Introduction
Real-world evidence derived from real-world data (RWD) holds substantial promise to accelerate innovation within oncology. RWD, which includes data on patient health status and/or the delivery of health care collected routinely,1 is becoming increasingly relevant due to the high cost and slow pace of randomized clinical trials as well as the growth of near real-time access to data electronic health records (EHR) and other digital sources of comprehensive health-related data. RWD sources may represent a more flexible, cost-effective way to investigate clinical interventions and can supplement clinical trials. Within oncology, there have been investments in developing RWD sources as sources of clinical evidence both at the national level and within professional societies.2-5 Moreover, the FDA has released guidance documents on the use of real-world evidence reflecting increased interest in using RWD for regulatory decisions.6-8 Despite their comprehensiveness, generating meaningful insights from RWD requires strict adherence to good observational research practices, particularly how missing data are handled.
Cancer registries have long been established as important sources of RWD within oncology for generating insights spanning cancer epidemiology and comparative effectiveness of therapies.2,9 Data quality including the completeness of data elements is a major consideration when working with these registries to generate evidence. This is particularly germane given emerging evidence suggest that treatment-associated survival outcomes using registry and similar randomized controlled trials are not concordant.10-12 As the reliance on clinical registries for evidence generation is likely to grow, there is critical need to assess the quality of clinical evidence generated from registry and other RWD sources, as well as their adherence to best data practices.
Of note, cancer registries rely on trained tumor registrars to abstract and record data from the patient medical record. Lack of quality documentation within the medical record can lead to incompletely abstracted data elements, and therefore lead to unknown or missing data values within cancer registries.13-15 While there are a variety of methods to account for missing data, patients with unknown values are likely under-represented in RWD studies, as it is common practice to simply exclude patients without complete information in variables used in cohort construction.16-19 However, because missing data within registries is a surrogate for poor quality documentation, such data may not be “missing completely at random”, and the exclusion of such patients with may introduce significant bias in studies based on these data. There is a paucity of literature studying the impact of incompletely recorded data within cancer registry datasets.
In our study, we aim to characterize the impact of unknown documentation across multiple cancer types within a large national registry-based RWD source. Specifically, we examine the prevalence of incomplete data across the three most common cancer types, and whether characteristics and survival outcomes of patients with missing data are comparable to those with complete data.
Methods
We examined the prevalence of incomplete records and associated cancer patient survival in a large RWD registry commonly used for comparative effective studies in oncology for the three most common cancers in the United States. We compared survival differences between patients who have complete versus incomplete data.
Data source
The National Cancer Database (NCDB) is a cancer registry established since the 1980s and jointly sponsored by the American College of Surgeons Commission on Cancer (CoC) and the American Cancer Society.20 All CoC-accredited facilities report eligible cases to the NCDB, and it is estimated that over 70 percent of newly diagnosed cancer cases in the United States are represented.21 There have been a large number of observational studies using NCDB data, some of which have been cited to support national cancer treatment guidelines.22 Data reporting to the NCDB follows national registry coding standards and the recording of individual data elements have detailed specifications from the CoC.23 Trained tumor registrars at individual cancer programs abstract specified data elements from patient charts in accordance to CoC registry data standards.24 If a data element is unable to be identified within the patient record, it may be recorded with a blank or unknown value following registry coding guidelines.
For each patient record within the NCDB, there are over 130 variables in the PUF capturing a range of facility and patient information, tumor characteristics, treatment information, and cancer outcomes.19 Specific information abstracted include complex data elements such as comorbidity score, cancer stage, diagnostic procedures, treatment information including receipt of surgery, radiation, chemotherapy, among others. Data elements in our study are categorized into facility and patient demographics (Demographics), tumor characteristics (Cancer Identification), cancer stage (Stage), cancer treatments (Treatment), and survival outcomes (Outcomes) variables based on the NCDB data dictionary.
Missing data ascertainment
We identified 96 variables which were in use for all diagnosis years and disease sites included in our analysis. From these, we identified variables containing missing data in at least one patient record. Missing data were defined as either “blank” or “unknown” for a variable included in the database. Two clinical oncologists reviewed all variables and excluded variables where blank data entry was allowed by the NCDB data dictionary and may not reflect incomplete clinical documentation. For example, days from diagnosis to chemotherapy was recorded as missing if the patient never received chemotherapy and therefore excluded from analysis. Furthermore, to be conservative, TNM pathologic staging were not considered missing if pathologic stage was not documented or a pathologic specimen was not collected. A final 63 variables were identified as “variables of interest” to compare patients with and without completely documented data (Figure 1). All variables are listed in supplemental eTable 1.
Patient selection
We chose the three most common cancers with the largest number of cancer-related deaths in the United States for analysis.25 We identified cancer cases in the 2019 release of non-small cell lung cancer (NSCLC), breast cancer, and prostate cancer PUFs. All cancer stages were included and the impact of incomplete data was assessed by stage as defined by the NCDB analytic stage group, which combines pathologic and clinical stage by using clinical stage where pathologic stage is missing, since a recognized limitation of the NCDB is that there is a significant portion of patients with missing stage information.26
We included patients with cancer diagnosis years from 2006 to 2015 for analysis. Due to changes in data coding rules which introduced new variables and lack of survival information for the most recent diagnosis year, we excluded cases diagnosed in 2016 (most recent diagnosis year available). Given changes in data reporting standards and completeness over time, we examined cancer cases diagnosed in the most recent 10 years prior to 2016.
Statistical analysis
We calculated the percentage of patients with missing values in all included variables. We used standard descriptive statistics, chi-squared test, and Wilcoxon rank-sum test to show differences in patient, tumor, and treatment characteristics between patients with incomplete versus complete data. Patient records were not used for comparison of patient, tumor, and treatment characteristics if it has a missing value in the variable being compared, which are tabulated in supplemental eTable 2. To evaluate the association of incomplete data with survival outcomes, we used Kaplan-Meier estimates to compare overall survival between patients with incomplete versus complete data. Log-rank test was used to identify statistically significant differences in survival. Sub-analysis stratifying by cancer stage and treatment were performed to identify groups of patients for which missing data was associated with largest differences in survival.
As a sensitivity analysis, we also tested an alternative approach of identifying variables for which data was missing in 1-20% of patient records. This range was determined a priori since <1% missing is likely to have little impact on outcomes of RWD studies, and a large percentage missing is more likely to be reflective of explainable differences in coding rules rather than poor documentation quality. For example, high percentages of missing clinical stage in the NCDB have been attributable to changes in stage coding rules.27 Different percentage thresholds of missing data were also tested (supplemental data).
Statistical analysis was performed using Stata 16 (StataCorp LLC, College Station, Texas). Our code is available at: https://github.com/Aneja-Lab-Yale
Results
Distribution of variables and incomplete data
Of the 96 data elements included for analysis, there were 22 (22.9%) demographics, 11 (11.5%) tumor characteristics, 18 (18.8%) cancer stage, 41 (42.7%) treatment, and 4 (4.2%) outcomes variables. After limiting to variables of interest, there were 14 (22.2%) demographics, 6 (9.5%) tumor characteristics, 13 (20.6%) cancer stage, and 30 (47.6%) treatment variables. (Figure 1). The majority of patients had missing data values in at least one variable of interest across all three disease sites (Table 1).
Patient, cancer, and treatment characteristics
There were significant differences in demographics, tumor characteristics, and cancer treatments received between patients with incomplete and complete data. Patients with missing data were more likely to be non-white, not insured, have more advanced stage disease, and less likely to receive surgery for all three disease sites. Cancer cases diagnosed in earlier years were more likely to have missing data (Table 2).
Association of incomplete data with overall survival
For NSCLC, there were 851,295 patients with incomplete data and 347,454 patients with complete data in variables of interest; 2-year survival was 33.2% for patients with missing data and 51.6% for patients with complete data (p<0.001). For breast cancer, there were 1,161,096 patients with missing data and 959,679 patients with complete data; 2-year survival was 93.2% for patients with missing data and 93.9% for patients with complete data (p<0.001). For prostate cancer, there were 460,167 patients with missing data and 698,468 patients with complete data; 2-year survival was 91.0% for patients with missing data and 95.6% for patients with complete data (p<0.001). This equates to an absolute 2-year survival difference of 18.4% for NSCLC patients, 0.7% for breast cancer patients, and 4.6% for prostate cancer patients overall. Kaplan-Meier survival curves are shown in Figure 2.
Large survival differences persisted among patients with metastatic disease when stratified by cancer stage. There were also survival differences in patients with non-metastatic disease, although the absolute difference was small for breast (0.4%) and prostate (1.3%) cancer patients, as compared to larger absolute survival differences of 4.6% and 16.7% in breast and prostate cancer patients with metastatic disease, respectively (p<0.001 for both, Figure 3). For metastatic NSCLC patients, 2-year survival was 13.1% for patients with missing data and 15.0% for patients with complete data (p<0.001); whereas among non-metastatic NSCLC patients, 2-year survival was 51.5% for patients with missing data and 63.2% for patients with complete data (p<0.001). Survival stratified by stage is shown in supplemental eFigures 1-3.
While rates of cancer treatment are different between patients with missing data and patients without missing data, survival differences were also observed when stratified between receipt of surgery, radiation, or chemotherapy, although the effect was less pronounced amongst prostate and breast cancer patients (supplemental eFigure 4).
Trends in data completeness and cancer stage over time
There were temporal changes in level of incomplete data from 2006 to 2015. The percentage of patients with missing data decreased from 81.8% to 67.1% (p<0.001) for NSCLC, from 78.1% to 46.5% (p<0.001) for breast cancer, and from 50.7% to 31.8% (p<0.001) for prostate cancer (supplemental eFigure 5). The distribution of cancer stage at diagnosis within the NCDB has also changed over time, which has previously been described. For example, there is increased proportion of metastatic NSCLC at diagnosis in recent years, possibly attributable to increased use of advanced cancer imaging.28,29 The change in overall stage is shown in supplemental eFigures 6. Moreover, we show survival differences persisted when further stratified by year of diagnosis (supplemental eFigure 7).
Sensitivity analysis using different percentages of missing data
When repeating our analysis using variables for which data was missing in 1-20% of patient records, for NSCLC, there were 622,831 patients with missing data and 575,918 patients without missing data in variables of interest; 2-year survival was 33.9% for patients with missing data and 43.5% for patients without missing data (p<0.001). For breast cancer, there were 1,481,729 patients with missing data and 639,046 patients without missing data in variables of interest; 2-year survival was 92.4% for patients with missing data and 96.0% for patients without missing data (p<0.001). For prostate cancer, there were 700,523 patients with missing data and 458,112 patients without missing data in variables of interest; 2-year survival was 91.7% for patients with missing data and 97.0% for patients without missing data (p<0.001, supplemental eFigure 8). Survival differences also persisted when we tested different thresholds using either 1-5% or 5-30% missing data as the cutoff (supplemental eFigure 9).
Discussion
In a large national cancer registry, we showed a high prevalence of incomplete patient records in all three of the most common cancer types, with the majority of patients missing essential data. The missing data has marked implications for clinical care and research and suggests that there are major gaps in documenting and capturing data via the medical records for patients with cancer. Moreover, patients with missing data were not representative of the overall cancer population and had significantly lower survival than those with complete data. These findings were replicated across NSCLC, breast, and prostate cancer patients. While the survival differences in patients with non-metastatic breast and prostate cancer were small, when stratified by cancer stage, larger differences persisted among patients with metastatic disease, suggesting differences in survival were driven by metastatic patients with incomplete data, whose courses of oncologic care have increased complexity.
Furthermore, we showed significant differences in terms of demographics, tumor characteristics, and treatments received between patients with missing data and patients without missing data. For example, incomplete records are more prevalent among blacks and other minorities, reflecting long-standing disparities in access to healthcare and cancer treatment.30-32 Patients with fewer comorbid conditions also appeared to more frequently have missing data, which may reflect less available documentation due to fewer medical visits. Advanced stage patients were significantly more likely to have missing data. We hypothesize this is due to increased complexity of care in advanced cancers, leading to increased difficulty in documenting and abstracting all data elements.33 The small survival differences in early-stage breast and prostate cancer patients are reflective of this in that definitive and adjuvant therapeutic management options in these settings have relatively less complexity. Moreover, completion of primary site surgery at the CoC facility submitting the cancer case also helps to enable more complete documentation and easier data abstraction, as primary cancer and treatment information would be available at the same facility.
Our findings have major implications for RWD studies. While incomplete documentation is ubiquitous in RWD sources, observational studies using large cancer registries often simply exclude patients with missing data, and how missing data is handled is inconsistently reported in the medical literature.34,35 Despite an increasing number of papers describing approaches for correcting missing data in observational studies, the practice of handling incomplete data amongst RWD sources has been slow to change.36 Recent systematic comparisons of registry studies and randomized trials do not demonstrate concordant results.10,12 Poor quality documentation is therefore a major obstacle to modern RWD sources and can introduce significant biases when measuring survival outcomes. We demonstrate a consistent association of missing data with worse survival across multiple cancer types, particularly amongst patients with advanced cancers. Therefore, adequate correction of missing data will be crucial for accurate treatment effect estimation in RWD studies.37,38
The clinical explanations for survival differences associated with incomplete data are likely multi-factorial. There were significant differences in distribution of cancer stage between patients with and without missing data. Underlying demographic disparities, differences in year of diagnosis, and treatments received are also contributory factors. There are also likely uncaptured confounders inherent to the observational nature of RWD studies. Our findings are corroborated by other studies examining missing data as a potential source of bias among RWD sources.39-42 Our results also corroborate previous analysis showing significant under ascertainment of stage and treatment data within cancer-specific registries.43-45 Given fragmented or multi-facility care more often occurs among advanced-stage patients with increased treatment complexity, this is another plausible explanation for differences in cancer patient survival.46 Since registry data abstraction necessarily depends on information available within the patient record, patients receiving care from multiple facilities are significantly more likely to have incomplete information. Therefore, fragmented care may be an explanatory link between documentation quality and survival outcomes particularly affecting patients with complex disease courses.47-49
There are limitations to our analysis. We examined survival and cannot draw conclusions on other outcomes such as toxicity, disease recurrence, or causes of death which are also likely associated with incomplete patient records. The dataset within our study is an observational cancer registry, and there may be limitations in the data abstraction process precluding complete capture of the medical record. However, all RWD sources likely face these limitations to a varying degree, and our analysis therefore should be interpreted as an exemplification of incomplete documentation within RWD sources in oncology.50 Our study population is also heterogenous. The patients’ cancer treatment paradigms including receipt and sequence of local and systemic therapies necessarily differ and do not reflect one specific clinical scenario. Nevertheless, survival differences between patients with missing and complete data persisted despite adjusting for multiple tumor and treatment-related factors. Additionally, the proportion of patients with incomplete data also depends on the number of variables examined, since it is more difficult to have complete documentation for a larger number of data elements. Given there are a large number of variables within the NCDB, we also undertook an alternative analysis of choosing variables with missing data in 1-20% of patient records as variables of interest to identify patients with incomplete versus complete data. We also tested this assumption in sensitivity analysis, where we show survival difference persists using either 1-5% or 5-30% missing as the cutoff.
Conclusions
In conclusion, we show that the majority of patients in a large cancer registry-based RWD source are subject to incomplete data. Patients with missing data that was unable to be ascertained from the medical record show worse survival than those with more complete documentation. Increasing documentation quality and adoption of rigorous missing data correction methods are needed to best leverage RWD for clinical advancements.
Data Availability
The primary data is available by application through the American College of Surgeons (https://www.facs.org/quality-programs/cancer/ncdb/puf). The datasets generated in our analysis can be reproduced using code available at https://github.com/Aneja-Lab-Yale