Abstract
Background Routine clinical data from clinical charts are indispensable for retrospective and prospective observational studies and clinical trials. Their reproducibility is often not assessed.
Objective To develop a prostate cancer-specific database with a defined source hierarchy for clinical annotations in conjunction with molecular profiling and to evaluate data reproducibility.
Design, setting, and participants For men with prostate cancer and clinical-grade paired tumor–normal sequencing, we performed team-based retrospective data collection from the electronic medical record at a comprehensive cancer center. We developed an open-source R package for data processing. We assessed reproducibility using blinded repeat annotation by a reference medical oncologist.
Outcome measurements and statistical analysis We evaluated completeness of data elements, reproducibility of team-based annotation compared to the reference, and impact of measurement error on bias in survival analyses.
Results and limitations Data elements on demographics, diagnosis and staging, disease state at the time of procuring a genomically characterized sample, and clinical outcomes were piloted and then abstracted for 2,261 patients (with 2,631 samples). Completeness of data elements was generally high. Comparing to the repeat annotation by a medical oncologist blinded to the database (100 patients/samples), reproducibility of annotations was high to very high; T stage, metastasis date, and presence and date of castration resistance had lower reproducibility. Impact of measurement error on estimates for strong prognostic factors was modest.
Conclusions With a prostate cancer-specific data dictionary and quality control measures, manual clinical annotations by a multidisciplinary team can be scalable and reproducible. The data dictionary and the R package for reproducible data processing are freely available to increase data quality in clinical prostate cancer research.
Patient summary Information in the medical record is the backbone for clinical research on prostate cancer. The tools provided in this study can increase quality and efficiency of this research.
1. Background
Clinical data have a central role in any clinical research study. In prostate cancer, data elements often include demographics, cancer characteristics at diagnosis, time-updated information on the disease course, and clinical outcomes such as metastasis and survival. Defining which elements are measured and how has been recognized as critical for the success of clinical trials, leading to standardized definitions for metastatic castration-resistant prostate cancer by the Prostate Cancer Working Group [1].
Based on the premise that high-quality clinical data coupled with genomic profiling could identify predictive and prognostic genomic alterations [2], large-scale data extraction efforts from medical records are underway. Examples include the Genomics Evidence Neoplasia Information Exchange (Project GENIE) by the American Association for Cancer Research [3] and the Foundation Medicine—Flatiron Health database [4]. How well such pan-cancer approaches capture elements relevant to prostate cancer is unclear, as is the reproducibility of manual clinical annotations by investigators at medical centers.
A key source of clinical data is the medical record. Many studies are hospital-based observational studies that entirely rely on information from the medical record. Yet even prospective observational studies and clinical trials have the medical record as the sole source for key data elements, such as Gleason score, prostate-specific antigen, and staging. Data are distributed across narrative reports or structured data sources and are often internally discordant [5]. With notable exceptions [6], it is often not reported from what sources, how, and by whom clinical data are collected for research and how they are prepared for analysis.
In this study, we designed, piloted, and implemented a clinical database for prostate cancer research (Fig. 1). We describe and share prostate cancer-specific data elements for manual curation and a software pipeline to preprocess, recode, and deidentify the resulting dataset for analyses. We also report results from a reproducibility study using this framework.
2. Methods
2.1 Design and Implementation of a Clinical Database for Clinical Prostate Cancer Research
The clinical research database was designed for data from all men with prostate cancer who had provided written informed consent for an institutional review board-approved study of tumor–normal genomic profiling through MSK-IMPACT [8, 9]. The study was conducted in accordance with the U.S. Common Rule.
First, we designed data elements applicable to prostate cancer research, led by a board-certified medical oncologist and adapting Prostate Cancer Working Group 3 recommendations [1] as much as necessary for data retrieval from the medical record. These bespoke data elements were designed to be useful for prostate cancer research, without reference to cancer data capture models [3,7–9] existing or in development in late 2017, and they were not intended for interoperability across other tumor types.
The four data categories for each patient are demographics/at-diagnosis characteristics (“baseline form”); information about genomically profiled specimens (“sample form”); outcome data (“freeze form”); and lines of therapy (“treatment form”). Nearly all data elements are structured data, predominantly binary or categorical selections from predefined lists. Numeric and date values are captured through single-line text fields, for which data formats are recommended by written instructions (“enter PSA in ng/ml”) and allow for mixed-format entries, such as a PSA of “4.2”, “>1000”, or “undetectable.” For each data element, the source hierarchy is defined. Brief additional instructions address common questions, how missing data should be coded, and whether incomplete or discordant data need to be escalated for review.
Second, we implemented this preliminary set of data definitions in a Research Electronic Data Capture (REDCap) database, a research study database with a secure web application that is free to academic institutions [10]. (The database software is exchangeable.) We then piloted data extraction from the medical record. After a set of 20 patient records, we revised data elements, source hierarchy, and instructions based on feasibility and an informal assessment of reproducibility by a clinician. For example, biochemical recurrence was removed from the data dictionary, given feasibility challenges. A further pilot with 80 records followed, after which the data dictionary was finalized (examples are in Table 1).
Third, we scaled data extraction and completed the data on all patients who had had MSK-IMPACT profiling for prostate cancer. The current manuscript describes patients included by December 2019. Weekly data capture “in real time” has since been implemented, adding patients with genomic profiling, currently MSK-IMPACT [11] and MSK-ACCESS [12].
Extraction was done by a team of clinical research study assistants who specifically support clinical research on genitourinary cancers and who underwent supervised hands-on training on prostate cancer data extraction. Clinical subspecialty fellows (urology, radiation oncology) collaborated on extractions, as did a medical student with a background as a research study assistant.
2.2 Quality Control and Data Processing
We addressed data quality and reproducibility during two key steps, data entry and data processing. During data entry, questions on data elements were flagged as queries in order to open issues on specific data fields of an individual patient/sample record, route them to colleagues, and track their completion. Queries were resolved by discussion between research study assistants or escalated to project leaders, an epidemiologist with a background in internal medicine and a medical oncologist with a specialist practice in prostate cancer.
Raw data entered in the database, even if largely in structured fields, require substantial processing. Steps include, but are not limited to: (1) recoding of many categorical variables (e.g., the many combinations of Gleason patterns are collapsed to five Gleason grade groups for analyses); (2) imputation of date variables (e.g., “03/2015” should be converted into an appropriate date format for the mid-point of March 2015); (3) calculation of time intervals (e.g., a sequencing date of April 12, 2015 and a death date of June 12, 2016 correspond to 14.0 months of follow-up for overall survival from the time of sequencing); (4) creation of time-varying covariates (e.g., castration-resistance status at the time of genomic sequencing, based on the occurrence and date of castration resistance); (5) removal of protected health information that is required for the preceding steps (e.g., exact date of cancer diagnosis); (6) assessment for internal consistency (e.g., if stage is “M1,” the date of developing metastases cannot be months after diagnosis).
Manual data processing in a spreadsheet program like Microsoft Excel, as we suspect is frequently done, is time-intensive, introduces additional human error, and is, by definition, not reproducible. Instead, we developed the “prostateredcap” package for the free R statistical software. The package handles data processing starting with a labeled comma-separated file exported from REDCap, data de-identification, and consistency checks. In our experience, the latter step flagged approximately 10% of all records for missingness in required data elements or internal discrepancies, the vast majority of which were fixable. The output dataset with data elements recommended for analysis (see Table 1 for examples) is directly suitable for statistical analyses and can easily be merged, e.g., with molecular data, such as OncoKB-annotated MSK-IMPACT sample-level genomic data [13].
2.3 Reproducibility study
To assess the completeness and reproducibility of annotations, we conducted a nested quality control study based on 100 patients and tumor samples (one per patient), with 50 randomly selected samples from metastatic castration-sensitive disease and 50 randomly selected samples from metastatic castrate-resistant disease at the time of sample procurement. Blinded to the team-based annotations in the REDCap database, a board-certified medical oncologist reviewed the full medical record to re-extract data elements selected for the reproducibility study, without being limited to the narrow source hierarchies defined for the team-based annotation.
Completeness of data elements was expressed as proportions (percentages). Confidence intervals (CIs) for these and other proportions were score test-based [14]. Dates that could be not reached because of censoring were excluded from denominators.
Reliability of annotations for binary variables (e.g, present/absent) was evaluated by comparing team-based annotations to the medical oncologist as the reference “gold standard” and expressed as sensitivity, specificity, positive predictive value, and negative predictive value. To probe for differential misclassification based on the amount of time covered by the medical record, we repeated analyses after stratifying by stage at diagnosis (M0/metastatic recurrence years after primary therapy vs. M1/de novo metastatic).
For categorical variables (e.g., Gleason pattern; T stage), we calculated the proportion of agreement between gold standard and team-based annotations as well as Cohen’s κ, which accounts for agreement due to chance. Missing values were included as a separate category.
For date variables, we expressed the time difference between dates from team-based annotations and gold-standard annotations as median (2.5th, 97.5th percentile).
To evaluate the impact of measurement error on scientific inference, we compared inferential results from using team-based annotations to gold-standard annotations. For four strongly prognostic exposures measured at cancer diagnosis (age; prostate-specific antigen; primary treatment with androgen deprivation; Gleason score, per grade group), we quantified associations with three outcomes (castration resistance, metastasis, and death) using univariable Cox proportional hazards regression. These models for demonstration purposes on measurement error ignore late entry and are not suited for subject-matter inference.
3. Results
The prostate cancer clinical-genomic database was manually curated with clinical data on 2,261 men with prostate cancer (Table 2), including 2,631 genomically-profiled samples, on median 1 sample per person (maximum, 5). Men were diagnosed with prostate cancer between 1987 and 2019 (median year of diagnosis 2014) at a median age of 63 years (interquartile range 56–68, range 36–94). The first tumor sample per person was obtained on median 3 months after diagnosis (interquartile range 0–42) and underwent paired tumor–normal sequencing between 2014 and 2019. Survival follow-up after sequencing of the first sample, available on 2,204 men (97%), was on median 30 months (interquartile range, 16–46).
In the reproducibility study (Table 2), the majority of the selected data elements were 100% complete (Fig. 2). Completeness ranged between 55% to 99% for elements of clinical TNM staging, self-reported race, biopsy Gleason score, and presence of variant histologies, both for the team-based annotation and the gold standard annotation.
To assess reproducibility of binary data elements, we first evaluated sensitivity and specificity, thus taking the perspective of the gold standard and indicating what proportions of patients with any given feature (e.g., nodal metastasis at diagnosis) present or absent were correctly recorded as such by the team-based annotation (Fig. 2A, middle panel). For 7 data elements, both sensitivity and specificity of the team-based annotations reached or exceeded 90%. The 9 data elements with lower reproducibility were nodal metastases at diagnosis (stage N1; sensitivity 85%; specificity 76%); primary treatments with any form of radiation therapy (sensitivity 88%) or prostatectomy (sensitivity 88%); presence of prostatic tumor tissue (sensitivity 59%), lung metastases (sensitivity 80%), and other soft-tissue metastases (sensitivity 47%) at sample procurement; and absence of lymph node metastases at sample procurement (specificity 72%). Finally, specificity for absence of castration resistance by end of follow-up was only modest (62%, 95% CI 44–77).
We then evaluated positive and negative predictive values as quantifications of the probability of features being present or absent if recorded as such in the team-based annotations. These estimates, also incorporating feature prevalence, inform use of the team-based annotations when a gold standard is not available. With the exceptions of primary treatments as well as prostatic disease and other soft-tissue disease at sample procurement, predictive values were generally high (Fig. 2A, right panel).
For categorical data elements on baseline characteristics (Fig. 2B), including staging and histopathology, agreement between annotations was generally about 90%, with the exception of sub-categories of tumor (T) stage (agreement 67%, 95% CI 57–75). Agreement for T stage and variant histology was partially driven by chance, as indicated by lower Cohen’s κ (Fig. 2B), given that many tumors had missing T stage and most were adenocarcinomas.
Dates of birth, diagnosis, sample procurement, and censor dates were very similar between team-based and gold standard annotations (Fig. 2C). The outcomes of metastasis and castration resistance showed notable date differences, even if without directional bias on average (median difference, 0 months). 95% of the time (in 95/100 patients), differences between team-based annotation for metastasis were between 14 months earlier and 9 months later than the gold-standard annotation; for castration resistance, 95% of date differences were between 13 months earlier and 23 months later.
To assess the impact of measurement error in team-based annotations, we quantified the association between four baseline characteristics that are known strong prognostic factors—age at diagnosis, Gleason score, PSA, and treatment that included androgen deprivation therapy—with clinical outcomes. The outcomes were, in order of decreasing measurement error, castration resistance, metastasis, and overall survival. Hazard ratios for all four prognostic factors and overall survival did not differ between team-based or gold-standard annotations, as expected given the absence of measurement error for the outcome (Fig. 3). There were minor differences for metastasis, driven by date differences in when metastasis was recorded to have occurred. For castration resistance, for which team-based annotations had imperfect specificity and noticeable date differences, estimates using team-based annotations (e.g., a hazard ratio per Gleason grade group of 1.91, 95% CI 1.52–2.40) were more noticeably, but still only slightly different from estimates using gold standard annotations (hazard ratio per Gleason grade group of 1.69, 95% CI 1.35–2.12).
4. Discussion
The prostate cancer-specific clinical research database described here is notable for four key features: a data dictionary with a defined source hierarchy that was tested for feasibility; a data extraction pipeline that makes the conversion from medical record-derived raw data to an analyzable dataset a reproducible process; a reproducibility study that openly evaluates data quality in the setting that the database was implemented; and the provision of these tools to the scientific community for re-use.
Our undertaking was pragmatic. We intended to create a clinical research database that captured data elements essential in prostate cancer that could be linked with genomic profiling data. We relied on data captured during routine clinical practice. Data extraction had to be scalable to thousands of patient records without external funding, precluding desirable approaches such as blinded parallel annotation by more than one person. Earlier versions of the database have already been useful to shed light on the interplay of genomic and clinical features in prostate cancer [15–17], as are similar databases [6][18][19].
Unsurprisingly, for some data elements, reproducibility of annotations was suboptimal, including for data elements known to be challenging like tumor T stage [20]. Outcome data can be imperfect, which highlights one challenge for establishing surrogate endpoints [21], with castration resistance or the date when metastases first occurred being examples in our study. Some data definitions that are consensus for clinical trials [1] were not suitable, e.g., for castration resistance. Increasing reproducibility on these data elements would primarily require changing clinical care by mandating laboratory tests and imaging in regular intervals, as it is feasible in a clinical trial. Importantly, while we considered annotations by a medical oncologist an alloyed gold standard, the reproducibility study can ultimately merely assess whether two investigators would come to the same annotation, given the same medical record (repeatability), and not a comparison with “truth” (validity). Nevertheless, we believe that dedicated reproducibility studies like the current one should be done whenever data are collected for clinical research to help improve data quality and inform result interpretations [22].
We anticipate that the data dictionary, which can be directly uploaded into REDCap to create the database, and the data processing pipeline via the R package may be useful to other prostate cancer researchers. The data elements, their source hierarchy, and how they are post-processed can be adapted to local needs within these open tools. Feasibility, completeness, and reliability of data will differ depending on patient population, clinical setting, available data sources, the annotation approach and team, and other factors. They should not be inferred from the estimates from our cancer center. Principled approaches to improving data quality are needed. How manual approaches to clinical data curation in prostate cancer compare to larger-scale, pan-cancer, or computer-assisted (“machine learning”) data extraction would be important to compare, as would be comparisons of such data to true gold standards.
5. Conclusions
With a prostate cancer-specific data dictionary and quality control measures, manual annotations of clinical data by a multidisciplinary team can be scalable and reproducible. The data dictionary and the R package for reproducible data processing should help increase data quality in clinical prostate cancer research.
Data Availability
Data definitions to create the REDCap database, the prostateredcap R package, an overview of data elements recommended for analysis, and an example dataset are available at https://stopsack.github.io/prostateredcap.
Take home message
This study describes the design of a prostate cancer-specific clinical database in conjunction with molecular profiling and assesses its data quality. The data dictionary and an R package for reproducible data processing for statistical analysis are freely available.
Data sharing
Data definitions to create the REDCap database, the prostateredcap R package, an overview of data elements recommended for analysis, and an example dataset are available at https://stopsack.github.io/prostateredcap.
Footnotes
Funding This work was funded in part by the National Cancer Institute (1P01CA228696, to P.W. Kantoff; P30CA008748, Cancer Center Support Grant; P50CA092629, Prostate Cancer SPORE) and the Department of Defense (Early Investigator Research Award W81XWH-18-1-0330, to K.H. Stopsack; Physician Research Award W81XWH-17-1-0124, to W. Abida). D.E. Rathkopf, W. Abida, and K.H. Stopsack are Prostate Cancer Foundation Young Investigators. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; and preparation, review, or approval of the manuscript.
Conflicts of Interest M.J. Morris is an uncompensated consultant for Bayer, Advanced Accelerator Applications, Johnson and Johnson, Novartis, and Lantheus. He is a compensated consultant for Oric, Curium, Athenex, Exelexis, and Astra Zeneca. MSK receives funds for contracts for the conduct of clinical trials from Bayer, Advanced Accelerator Applications, Novartis, Corcept, Roche/Genentech, and Janssen.
D.E. Rathkopf is a consultant for Janssen, Genentech, AstraZeneca, Bayer, and Myovant Sciences, and has received research funding through her institution from Janssen Oncology, Medivation, Celgene, Tekeda, Millennium, Ferring, Novartis, Taiho Pharmaceutical, AstraZeneca, Genentech/Roche, TRACON Pharma, Bayer, and Phosplatin Therapeutics.
S.F. Slovin has received research support from Sanofi-Aventis, Novartis, Poseida, and the Prostate Cancer Foundation, and honoraria for advisory boards from Clovis, Janssen, Sanofi-Aventis, and PER.
D.C. Danila has received research support from the U.S. Department of Defense, American Society of Clinical Oncology, Prostate Cancer Foundation, Stand Up 2 Cancer, Janssen Research & Development, Astellas, Medivation, Agensys, Genentech, and CreaTV; he is a consultant for Angle LLT, Axiom LLT, Janssen Research & Development, Astellas, Medivation, Pfizer, Genzyme, and Agensys.
P.W. Kantoff reports the following disclosures for the last 24-month period: he has investment interest in ConvergentRx Therapeutics, Context Therapeutics LLC, DRGT, Placon, and Seer Biosciences; he is a company board member for ConvergentRx Therapeutics, Context Therapeutics LLC; he is a consultant/scientific advisory board member for Bavarian Nordic Immunotherapeutics, DRGT, GE Healthcare, Janssen, OncoCellMDX, Progenity, Seer Biosciences, and Tarveda Therapeutics; and he serves on data safety monitoring boards for Genentech/Roche and Merck.
W. Abida reports the following disclosures: he has received honoraria from CARET, Roche, Medscape, and Aptitude Health; is a consultant for Clovis Oncology, Janssen, MORE Health, ORIC Pharmaceuticals, and Daiichi Sankyo; he has received research funding through his institution from AstraZeneca, Zenith Epigenetics, Clovis Oncology, GlaxoSmithKline, ORIC Pharmaceuticals, and Epizyme; and he has had travel/accommodations/expenses paid by GlaxoSmithKline, Clovis Oncology, and ORIC Pharmaceuticals.
N.M. Keegan, S.E. Vasselman, E.S. Barnett, B. Nweji, E.A. Carbone, A. Blum, K.A. Autio, and K.H. Stopsack report no potential conflict of interest.