Abstract
Background There has been increased enthusiasm regarding the use of real-world data (RWD) for clinical evidence generation. Cancer registries are important RWD sources that rely on data abstraction from the medical record, however, patients with unknown or missing data elements are under-represented in studies that use such data sources. Therefore, we sought to determine the prevalence of missing data and its associated survival outcomes among cancer patients.
Methods All data elements within the National Cancer Database were reviewed for missing or unknown values for the three most common cancers in the United States diagnosed from 2006 to 2015. Patient records with incompletely abstracted data elements were identified. Prevalence of patients with missing data and their associated survival outcomes were determined.
Results A total of 1,198,749 non-small cell lung cancer (NSCLC), 2,120,775 breast cancer, and 1,158,635 prostate cancer patients were included for analysis. For NSCLC, there were 851,295 (71.0%) patients with missing data in data elements identified as variables of interest; 2-year survival was 33.2% for patients with missing data and 51.6% for patients with complete data (p<0.001). For breast cancer, there were 1,161,096 (54.8%) patients with missing data; 2-year survival was 93.2% for patients with missing data and 93.9% for patients with complete data (p<0.001). For prostate cancer, there were 460,167 (39.7%) patients with missing data; 2-year survival was 91.0% for patients with missing data and 95.6% for patients with complete data (p<0.001). There were significant differences in demographics, tumor characteristics, cancer stage, and treatments received between cancer patients with complete and incomplete records.
Conclusions The majority of patients in a large cancer registry-based RWD source have missing data that was unable to be ascertained from the medical record. Moreover, patients with missing data show worse survival than those with more complete documentation. Increasing documentation quality and adoption of rigorous missing data correction methods are needed to best leverage RWD for clinical advancements.
Competing Interest Statement
In the past three years, Harlan Krumholz received expenses and/or personal fees from UnitedHealth, IBM Watson Health, Element Science, Aetna, Facebook, the Siegfried and Jensen Law Firm, Arnold and Porter Law Firm, Martin/Baughman Law Firm, F-Prime, and the National Center for Cardiovascular Diseases in Beijing. He is an owner of Refactor Health and HugoHealth, and had grants and/or contracts from the Centers for Medicare & Medicaid Services, Medtronic, the U.S. Food and Drug Administration, Johnson & Johnson, and the Shenzhen Center for Health Information.
Funding Statement
This study was funded by grants from the American Cancer Society, Agency Health Research and Quality, and National Cancer Institute.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study uses de-identified information and was provided exemption by the Yale University Human Investigation Committee.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
The primary data is available by application through the American College of Surgeons (https://www.facs.org/quality-programs/cancer/ncdb/puf). The datasets generated in our analysis can be reproduced using code available at https://github.com/Aneja-Lab-Yale