Comparison of Population Characteristics in Real-World Clinical Oncology Databases in the US: Flatiron Health, SEER, and NPCR ============================================================================================================================= * Xinran Ma * Lura Long * Sharon Moon * Blythe J.S. Adamson * Shrujal S. Baxi ## ABSTRACT **Background and Objective** The Surveillance, Epidemiology, and End Results Program (SEER) program and the National Program of Cancer Registries (NPCR), are authoritative sources for population cancer surveillance and research in the US. An increasing number of recent oncology studies are based on the electronic health record (EHR)-derived de-identified databases created and maintained by Flatiron Health. This report describes the differences in the originating sources and data development processes, and compares baseline demographic characteristics in the cancer-specific databases from Flatiron Health, SEER, and NPCR, to facilitate interpretation of research findings based on these sources. **Methods** Patients with documented care from January 1, 2011 through May 31, 2019 in a series of EHR-derived Flatiron Health de-identified databases covering multiple tumor types were included. SEER incidence data (obtained from the SEER 18 database) and NPCR incidence data (obtained from the US Cancer Statistics public use database) for malignant cases diagnosed from January 1, 2011 to December 31, 2016 were included. Comparisons of demographic variables were performed across all disease-specific databases, for all patients and for the subset diagnosed with advanced-stage disease. **Results** As of May 2019, a total of 201,570 patients with 19 different cancer types were included in Flatiron Health datasets. In an overall comparison to national cancer registries, patients in the Flatiron Health databases had similar sex and geographic distributions, but appeared to be diagnosed with later stages of disease and their age distribution differs from the other datasets. For variables such as stage and race, Flatiron Health databases had a greater degree of incompleteness. There are variations in these trends by cancer types. **Conclusions** These three databases present general similarities in demographic and geographic distribution, but there are overarching differences across the populations they cover. Differences in data sourcing (medical oncology EHRs vs cancer registries), and disparities in sampling approaches and rules of data acquisition may explain some of these divergences. Furthermore, unlike the steady information flow entered into registries, the availability of medical oncology EHR-derived information reflects the extent of involvement of medical oncology clinics at different points in the specialty management of individual diseases, resulting in inter-disease variability. These differences should be considered when interpreting study results obtained with these databases. Keywords * cancer * real-world data * cancer registries ## INTRODUCTION The field of oncology is undergoing rapid evolution as our understanding of the pathophysiology of cancer expands and therapeutic progress accelerates. This environment calls for the development of tools that facilitate the translation of these advances into improvements in patient care. Recent years have seen increasing use of real-world data (RWD) as a clinical research source, from descriptive epidemiology to intervention effectiveness studies. RWD analyses generate real-world evidence (RWE) that could supplement and complement the evidence for new drug approvals (traditionally gathered from prospective studies), for health services research, policy evaluation, or as a pharmacovigilance tool (1,2). RWD can be obtained from many different sources, including billing and administrative claim activities, product and disease registries, national surveys, and electronic health records (EHRs). The appropriate and optimal source of RWD will differ by the research question. Traditionally, registries have been a key RWD source for epidemiologic and population-based outcomes studies; in the US, the Surveillance, Epidemiology, and End Results program (SEER) registries and the National Program of Cancer Registries (NPCR), have been commonly used for oncology research. The SEER Program (3), supported by the Surveillance Research Program (SRP) in the National Cancer Institute’s (NCI) Division of Cancer Control and Population Sciences (DCCPS), collects data on patient demographics, primary tumor site, morphology and stage at diagnosis, and first course of treatment; the SEER program provides cancer incidence and survival data through 16 state level population-based cancer registries currently covering approximately 34.6% of the US population. The NPCR (4) is supported by the US Centers for Disease Control and Prevention (CDC) Division of Cancer Prevention and Control, covers 97% of the US cancer population, spans 46 states, the District of Columbia, Puerto Rico, the US Pacific Island Jurisdictions, and the US Virgin Islands. Following the passage of the Health Information Technology for Economic and Clinical Health Act (HITECH) (5) in January 2009, EHR systems have been rapidly adopted in the US and are now a key source of RWD. In the field of oncology, adoption has been even swifter than in other fields, and by 2015, approximately 90% of oncology practices had already adopted EHRs (6, 7). This, accompanied by an exponential growth of data storage and mining technology, has allowed for access to detailed clinical data at an unprecedented scale. In oncology, patient-level data can be derived from EHRs to generate granular information about baseline population characteristics as well as longitudinal views of sequential treatments, interventions, and associated outcomes. Flatiron Health is an oncology-focused health technology company that generates RWD from two EHR-derived primary sources: (i) OncoEMR®□, a proprietary oncology-specific EHR used by community oncologists throughout the US, and (ii) EHR data integrations with academic research centers that enable bidirectional transmission of RWD. The underlying data collection procedures and data model architecture make Flatiron Health a fundamentally different data source from SEER or NPCR (Table 1). SEER and NPCR have been well established as research resources since their inception in 1973 and 1992, respectively. As the most recent of the three sources, data derived from Flatiron Health have become a research resource in the last five years (8–18); therefore, it has become increasingly critical to understand their features. SEER and NPCR collect specific incident disease data points in a systematic and ordered fashion, fulfilling a public health reporting mandate justified by the public health burden of cancer as a disease. Flatiron Health data collection mirrors routine oncology EHR documentation practices; upon curation, this approach yields longitudinal clinical data models with considerable depth, including clinical, genetic, and outcome data. This feature, together with its 30-day recency, makes this a suitable source for detailed investigation of contemporary trends in cancer management, including sequential time-to-event endpoints. View this table: [Table 1.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T1) Table 1. General features of SEER, NPCR and Flatiron Health databases. Researchers considering Flatiron Health data for their studies may be interested in gaining insights into the nature of the originating source, on the principles of data collection, and on the characteristics of its patient population. Descriptive statistics of the Flatiron Health, SEER, and NPCR data can be informative for the interpretation of findings when these large data sets are used in cancer research. This study aims to describe data sources and collection procedures, and to provide a detailed comparison of the demographic characteristics in the disease-specific databases from Flatiron Health, SEER, and NPCR, identifying similarities and differences within the data elements common across the three. ## METHODS ### Flatiron Health databases Flatiron Health has developed de-identified disease-specific dynamic databases (termed enhanced data marts) derived from information available in an EHR. These databases combine curated manually-abstracted unstructured data with structured data. The starting point is a single database composed only of structured data elements available within an EHR (termed the Flatiron Health research database) refreshed on a monthly basis. This large general cross-tumor cohort includes all patients with at least one International Classification of Diseases (ICD)-9 or ICD-10 cancer code and at least one unique-date clinic encounter documented in the EHR (reflected by records of vital signs, treatment administration, and/or laboratory tests) on or after January 1, 2011, from both academic and community care sites combined. From there, patient data are sampled each month into each disease-specific database with randomized approaches implemented through software code, to ensure uniform application of the sampling approach and to avoid the potential for bias. Inclusion is based on cancer-specific cohort inclusion and exclusion criteria (i.e., relevant ICD-9 and ICD-10 codes and technology-assisted abstraction of unstructured information, such as metastatic status). All manual abstraction of unstructured information, including confirmation of diagnosis and stage, is carried out by abstractors (i.e., clinical oncology nurses or tumor registrars). Clinically-relevant details specific to each cancer type are abstracted from any form of clinical documentation available in the EHR including clinic visit notes, radiology reports, pathology reports, etc. Abstractors are trained to identify and extract relevant information by following policies and procedures tested and optimized for reliability and reproducibility through iterative processes, and oversight is provided by medical oncologists. Each month, the datasets grow with new cases as well as incremental abstraction of newly-available clinical documentation from pre-existing patients. Therefore, at any given cutoff time, the size of a dataset is dependent on the initiation date for that disease-specific dataset and the overall population prevalence (which determines the growth rate). Typical data recency is 30 days. Practically speaking, a data cutoff of December 31, 2019 would include all information entered into the EHR through November 30, 2019, and subsequent cutoffs (i.e., January 31, 2020, February 29, 2020, etc) will render increasingly larger sample sizes. In addition, each database undergoes continuous audit procedures to monitor abstractor performance while proprietary technology links each curated data variable to its source documentation within the EHR, enabling subsequent review, when necessary. At the individual patient level, this approach provides a recent and robust longitudinal view into the clinical course, capturing new clinical information as it is documented within the EHR. Flatiron Health data are available for research via Institutional review board (IRB) approval of a master study protocol with waiver of informed consent (IRB # RWE-001, “The Flatiron Health Real-World Evidence Parent Protocol”, Tracking # FLI1-18-044 by the Copernicus Group IRB), obtained prior to study conduct, which covers the data from all sites represented. As of May 31, 2019, there were 19 disease-specific databases available at Flatiron Health: advanced urothelial cancer, metastatic breast cancer, early breast cancer, chronic lymphocytic leukemia (CLL), metastatic colorectal cancer, diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), advanced gastric/esophageal carcinoma, advanced hepatocellular carcinoma (HCC), advanced head and neck cancer, advanced melanoma, malignant pleural mesothelioma, multiple myeloma (MM), advanced non-small cell lung cancer (NSCLC), ovarian carcinoma, metastatic pancreatic carcinoma, metastatic prostate cancer, advanced renal-cell carcinoma (RCC), and small-cell lung cancer (SCLC) (Appendix). At least two unique-date clinic encounters documented in the EHR in the Flatiron Health database (reflected by records of vital signs, treatment administration, and/or laboratory tests) on or after January 1, 2011, are required for patient data to be entered into a given dataset, with the exception of a few diseases with different start dates (Appendix). At the time of this analysis, the Flatiron Health EHR-derived database included de-identified data from over 280 cancer practices representing more than 2.2 million patients and about 800 distinct sites of care from all 50 states and Puerto Rico. The distribution of patients across community and academic practices largely reflects patterns of care in the US, where most patients are treated in community clinics, but can vary for each disease. Mortality information is captured via a composite variable that uses multiple data sources (structured and unstructured EHR content, commercial sources, Social Security Death Index) and is benchmarked against the National Death Index data as a gold standard (19). ### SEER and NPCR databases The SEER Program supports most aspects of cancer surveillance research, providing analytical tools, and methodological expertise in collecting, analyzing, interpreting and disseminating population-based statistics. SEER population-based data include cancer incidence and survival data by age, sex, race, year of diagnosis, and geographic areas (including SEER registry and county). SEER releases new research data each spring based on the previous November’s data submission. Data are available across various registries and versions of the SEER Program from 1975 through 2016. Mortality information in SEER is obtained from the CDC’s National Center for Health Statistics’ National Vital Statistics System (20) and includes mortality data along with cause-specific death classification information. The NPCR cancer registries routinely capture data elements including the type, extent, and location of the cancer, the type of initial treatment, and outcomes of newly diagnosed cancers. Medical facilities such as hospitals, physician offices, and pathology laboratories send information about cancer cases to their respective central cancer registry, and each central cancer registry submits electronically de-identified demographic and clinical information to the NPCR on a yearly basis (4). Mortality information in the NPCR is obtained from the CDC’s National Center for Health Statistics’ National Vital Statistics System (20). As of May 31, 2019, the most recent information available from NPCR included new incident malignancies diagnosed through December 31, 2016. NPCR data is made available through the US Cancer Statistics dataset, which combines NPCR data and data from 4 SEER-funded states (Connecticut, Hawaii, Iowa, and New Mexico). This data provides information on 100% of the US population. In this paper, “NPCR data” was obtained through the US Cancer Statistics public use research dataset and was restricted to the 46 NPCR funded states and D.C. ## Comparative analysis ### Variables For each cancer type, demographic and clinical characteristics including race, age, region, year and stage at diagnosis were compared between the Flatiron Health and the SEER and NPCR databases. To overcome coding discrepancies across databases, cancer types were matched using ICD-9, ICD-10, and histology codes (e.g. ICD-0-3). All comparisons were unadjusted. Variables not available across two or more data sources were not included in the comparison (e.g., smoking status, treatment detail, real world progression [rwP] information). In the Flatiron Health databases, cancer staging information was collected as entered into the EHR by the treating physician or otherwise as assessed by Flatiron Health abstractors; during the study time period, the applicable staging criteria for solid tumors were those of the American Joint Commission of Cancer (AJCC) 7th edition and 8th edition manuals (21, 22), Rai staging for CLL and the International Staging System (ISS) for MM. For SEER and NPCR, diagnosis and staging information was abstracted from various sources including medical records and pathology reports. For both programs, staging information followed the Collaborative Stage coding systems (23). See additional information on variable definitions in the Appendix. ### Patient eligibility and time frames Diagnostic codes and eligibility criteria used to select the patients eligible for each database are listed in the Appendix. For SEER and NPCR, only malignant cases diagnosed on or after January 1st, 2011 were included for all the analyses (benign, uncertain behavior, carcinoma in situ, secondary malignancy cancers were excluded). For the Flatiron Health databases, patients diagnosed between January 1, 2011 and May 31, 2019, were included (except those missing diagnosis year and/or birth year). In order to compare the particular data segments common across all three databases (the most recent SEER and NPCR data releases reach through 2016 as initial diagnosis year), descriptive analyses were performed not only for all patients available for analysis across the entire time frame of January 2011 - May 2019 in the Flatiron Health databases, but also in the subset available for analysis from January 2011 - December 2016. As sensitivity analyses to address potential biases related to temporal drifts, we performed separate comparisons for the patient subgroups who had stage IV disease at diagnosis in each cancer type, for whom survival times would be expected to be shorter and the date of diagnosis would be expected to be closer to the database entry point. The potential biases to address were twofold: (i) as noted above, SEER and NPCR only collect specific incident disease data points, whereas Flatiron Health databases include both incident and prevalent cases. Therefore, Flatiron Health databases may receive patients at the time of diagnosis but also patients with initial diagnosis dates in the past; these cases may have long intervening periods between the initial diagnosis date and the date of entry into the Flatiron Health database, introducing a potential bias for patient characteristics associated with longer survival times (when compared with strictly incident cases in cancer registries); (ii) in addition, temporal trends where certain patient characteristics (i.e., sex, age) may be associated with cancer diagnoses during discrete time periods and can affect distributions depending on diagnosis year. ### Analyses Case-level data for patients in SEER were extracted from the SEER 18 November 2018 data submission dataset to the SEER Program by using the Case Listing Session feature in SEER*Stat software (Version 8.3.6, Information Management Services, Inc., Silver Spring, MD) and processed by using R 3.6.1. For patients in NPCR, case listing is not publicly available in the US Cancer Statistics public use SEER*Stat dataset, and case-level data cannot be accessed or downloaded. Frequencies by demographic and clinical characteristics for all malignant cases were calculated in SEER*Stat software using the November 2018 data submission. ## RESULTS Among the 2.2 million patients with cancer in the Flatiron Health database as of May 2019, 201,570 were included in this analysis, as well as 1,719,277 and 6,308,342 cases from the SEER and NPCR, respectively. The disease-specific databases vary in size, depending on the incidence and prevalence of the disease, and in the case of Flatiron Health, on the length of time the database has been active. The largest comparisons corresponded to 55,554 vs 273,742 and 903,355 patients with NSCLC, and the smallest to 1,116 vs 18,148 and 72,575 for patients with FL. Tables 2–20 present the comparisons for the following tumor types: advanced urothelial (or bladder), metastatic breast cancer, early breast cancer, CLL, metastatic colorectal cancer, DLBCL, FL, advanced gastric/esophageal carcinoma, HCC, advanced head and neck cancer, advanced melanoma, malignant pleural mesothelioma, MM, advanced NSCLC, ovarian carcinoma, metastatic pancreatic carcinoma, metastatic prostate cancer, metastatic RCC, and SCLC. View this table: [Table 2.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T2) Table 2. Characteristics for patients with bladder cancer View this table: [Table 3.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T3) Table 3. Characteristics for patients with metastatic breast cancer in the Flatiron Health database, compared with patients with breast cancer (any stage) in SEER and NPCRa View this table: [Table 4.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T4) Table 4. Characteristics for patients with early breast cancer in the Flatiron Health database, compared with patients with breast cancer (any stage) in SEER and NPCRa View this table: [Table 5.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T5) Table 5. Characteristics for patients with chronic lymphocytic leukemia View this table: [Table 6.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T6) Table 6. Characteristics for patients with colorectal cancer View this table: [Table 7.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T7) Table 7. Characteristics for patients with diffuse large B-cell lymphoma View this table: [Table 8.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T8) Table 8. Characteristics for patients with follicular lymphoma View this table: [Table 9.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T9) Table 9. Characteristics for patients with esophageal/gastric carcinoma View this table: [Table 10.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T10) Table 10. Characteristics for patients with hepatocellular carcinoma View this table: [Table 11.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T11) Table 11. Characteristics for patients with head and neck cancer View this table: [Table 12.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T12) Table 12. Characteristics for patients with melanoma View this table: [Table 13.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T13) Table 13. Characteristics for patients with malignant pleural mesothelioma View this table: [Table 14.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T14) Table 14. Characteristics for patients with multiple myeloma View this table: [Table 15.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T15) Table 15. Characteristics for patients with non-small cell lung cancer (NSCLC) View this table: [Table 16.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T16) Table 16. Characteristics for patients with ovarian carcinoma View this table: [Table 17.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T17) Table 17. Characteristics for patients with pancreatic carcinoma View this table: [Table 18.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T18) Table 18. Characteristics for patients with prostate cancer View this table: [Table 19.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T19) Table 19. Characteristics for patients with renal-cell carcinoma View this table: [Table 20.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T20) Table 20. Characteristics for patients with small-cell lung cancer ## DISCUSSION Results reported in this descriptive study provide an overview of the originating sources, data collection methods, and comparative population characteristics for three oncology-specific RWD sources in the US: the Flatiron Health, SEER, and NPCR data. We focused our comparisons on baseline demographic and clinical variables at the time of initial cancer diagnosis that describe the populations included in these data sources, based on the data elements commonly found in all three of them. Each of these three data sources relies on different collection approaches (Table 1), and our findings reveal population differences likely stemming from those distinct collection strategies. These differences in data collection methods and resulting populations should be considered when determining whether a dataset is fit-for-use for a particular research question and can help to contextualize research results obtained when using each data source. To further assist in that contextualization, this discussion highlights some of the potential underlying explanations for the differences observed. The distribution of patients according to sex/gender in the three data sources was comparable, but there were noticeable differences for other variables. Regarding regional distribution, Flatiron Health and the NPCR data were most closely aligned to the regional population distribution in the most recent US census (24). Due to its design, SEER data diverges the most from the census, particularly overrepresenting the West, while the Flatiron Health database provides a convenience sample that is slightly weighted towards the South and underrepresents the Western region. The three data sources had differences in age distribution, most notably a lower proportion of patients over 80 years at diagnosis in the Flatiron Health database. The ultimate reason for this discrepancy is probably ingrained in the different information sources that feed each one of them. State registries collect information regardless of patients’ site of care and from death certificates and autopsy reports (25), while Flatiron Health databases accrue information only via oncology clinics. By focusing on specialized care, Flatiron Health has limited reach into general hospice or other geriatric care settings, where a substantial number of elderly patients may be referred before they complete two visits to an oncology clinic (therefore excluding them from eligibility into Flatiron Health databases). There was one exception to this finding: prostate cancer, which may be a second cancer diagnosis for some elderly patients who could have already been entered into the Flatiron Health database at the time of a prior cancer diagnosis. For information on race/ethnicity, the different data collection approaches result in expected differences in completeness and in population distribution across the three databases. The proportion of incomplete records for race/ethnicity in Flatiron Health data is greater than in the other databases. During routine oncology care in the US (i.e., in the source clinics for Flatiron Health), collection of race/ethnicity data is not mandatory or incentivized; on the other hand, the registries feeding both SEER and NPCR have a mandate to reach certain levels of completeness for this variable and thus, this information is collected both directly from self-reports and indirectly using algorithms (26). In addition, Flatiron Health relies on self-reported information by patients, which adds complexity and variability to an information category that is in constant evolution within a broader social context. These challenges probably contribute to the apparent lower completeness of this variable in the Flatiron Health databases. Furthermore, due to the purposeful design of the SEER program, there is an overrepresentation of certain groups compared to the US census (24, 27), a finding consistent with prior representativeness studies (28–30). Lastly, a combination of differences in sources and in data collection approaches across the three data sources leads to noticeable differences in their information about AJCC disease stage at diagnosis. In some diseases, particularly in HCC, malignant pleural mesothelioma, RCC, SCLC, prostate cancer, and DLBCL, detailed AJCC disease stage information is missing from Flatiron Health databases to a substantially larger extent than in SEER, although that incompleteness is mitigated in the simplified category metastatic/non-metastatic disease. To understand that finding, it is important to note that Flatiron Health data are generated from a pipeline of medical oncology EHR-derived data, where stage information is mostly as documented in unstructured notes by the treating oncology team. In contrast, registries rely on multiple sites of care as sources, and disease stage is intentionally entered into their databases via mandated calculation and coding by trained tumor registrars (31, 32). Ultimately, these fundamental differences lead to idiosyncratic fluctuations in information completeness in EHR-derived vs systematically-collected data. For instance, clinical scenarios where medical oncologists tend to be involved in initial diagnosis (when staging takes place) are more likely to have a complete capture of initial staging in the medical oncology EHR. To wit, compare cancers commonly diagnosed at an advanced stage (e.g., SCLC) or that are eligible for systemic adjuvant therapy from early stages (e.g., breast cancer) with diseases where the medical oncologist tends to be less involved in initial diagnosis (e.g., RCC, which is often managed surgically upon initial diagnosis). Completeness/incompleteness of EHR-derived data is also affected by practical patterns of clinical documentation; as staging algorithms become increasingly complex, practitioners may tend to be more attentive to staging specifics in settings where the link between staging and treatment is more crucial (i.e., local or locally-advanced settings, where patients are candidates for multi-modality therapy), and less stringent in other settings (i.e., advanced disease) where staging information is less critical to clinical decision making. For example, AJCC staging is not a key consideration for initial HCC treatment decisions; clinical and laboratory data to assess underlying liver functional status and inform potential transplant eligibility are far more clinically relevant. Treating clinicians may prefer to document and rely on clinically actionable, non-AJCC staging systems for the routine management of some diseases, like HCC and SCLC, resulting in less AJCC-based information available in those databases. In conclusion, the disease-specific Flatiron Health databases provide deep demographic, clinical, and treatment data models derived from EHR information. Several of the data elements in the Flatiron Health databases cannot be found in SEER or NPCR, such as date of metastatic diagnosis and sites of metastatic disease, comprehensive standard biomarker status, longitudinal treatment sequences, and disease progression dates. Within the portfolio of data elements commonly found across the three databases, comparing Flatiron Health to SEER and NPCR shows that Flatiron Health has a regional distribution closer to the general US census than SEER, an overall lower proportion of patients older than 80 years at diagnosis, less complete racial information, and disease-dependent variability in the capture of staging data. These differences stem from the originating EHR-source and from the rules for data capture and processing. Investigators should consider these inter-database demographic differences when designing studies and interpreting results obtained with Flatiron Health data, and when contextualizing their findings relative to SEER- or NPCR-based research. ## Data Availability Data supporting the findings of this study have been originated by Flatiron Health, Inc. These de-identified data can be made available upon request, and are subject to a license agreement with Flatiron Health; interested researchers should contact 'DataAccess@flatiron.com' to determine licensing terms ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/03/18/2020.03.16.20037143/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/F1) Figure 1. Schema of the structure of the Flatiron Health databases RD=research database; EDM=enhanced data mart ## TABLES **Note** Shaded areas are groups with no reported results. NA= not available from the source data, or numbers under the reporting suppression value to preserve patient confidentiality ## APPENDIX View this table: [Table A1.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T21) Table A1. Overview of diagnostic and histology codes used to define eligibility, and of additional entry criteria and definitions of ‘metastatic at diagnosis’. View this table: [Table A2.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T22) Table A2. Overview of variable definitions and steps taken to align variables across the different databases. View this table: [Table A3.](http://medrxiv.org/content/early/2020/03/18/2020.03.16.20037143/T23) Table A3. Distribution of patient characteristics excluding patients with unknown/missing status across the categories evaluated ## Acknowledgements Authors wish to thank Julia Saiz, PhD, and Cody Patton from Flatiron Health, for editorial support. We also wish to thank Mary Elizabeth O’Neil, MPH from the CDC DCPC, and Angela Mariotto, PhD and Donna R. Rivera, PharmD, MSc. from the DCCPS at the NIH NCI for their review and valuable feedback. ## Footnotes * **Funding:** This study was sponsored by Flatiron Health, Inc. (Flatiron Health), which is an independent subsidiary of the Roche group. * **Disclosures:** All authors are employees of Flatiron Health, Inc., which is an independent subsidiary of the Roche group, and own stock in Roche. S.S.B., L.L. own equity in Flatiron Health. * Received March 16, 2020. * Revision received March 16, 2020. * Accepted March 18, 2020. * © 2020, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## REFERENCES 1. 1.US Food and Drug Administration (b). Framework for FDA’s real-world evidence program. December 2018. Accessed at [https://www.fda.gov/media/120060/download](https://www.fda.gov/media/120060/download) on December 23, 2019 2. 2.Eichler H-G, Bloechl-Daum B, Broich K et al. Data rich, information poor: Can we use electronic health records to create a learning healthcare system for pharmaceuticals? Clin Pharmacol Ther. 2019;105: 912–922. 3. 3.National Cancer Institute. Surveillance, Epidemiology, and End Results Program. Overview of the SEER Program. Accessed at [https://seer.cancer.gov/about/overview.html](https://seer.cancer.gov/about/overview.html) on February 18, 2020 4. 4.Center for Disease Control and Prevention. National Program of Cancer Registries. Accessed at [https://www.cdc.gov/cancer/npcr/index.htm](https://www.cdc.gov/cancer/npcr/index.htm) on February 18, 2020 5. 5.Health Information Technology (HITECH Act). Accessed at [https://www.healthit.gov/sites/default/files/hitech\_act\_excerpt\_from\_arra\_with\_index.pdf](https://www.healthit.gov/sites/default/files/hitech\_act\_excerpt\_from_arra_with_index.pdf) on February 18 2020 6. 6.American Society of Clinical Oncology. The state of cancer care in America, 2015: a report by the American Society of Clinical Oncology. J Oncol Practice. 2015; 11:79–113 7. 7.The Office of the national Coordinator for Health Information Technology. 2016 Report to congress on health IT progress: examining the hitech era and the future of health it. Accessed at [https://www.healthit.gov/sites/default/files/2016\_report\_to\_congress\_on\_healthit\_progress.pdf](https://www.healthit.gov/sites/default/files/2016\_report\_to\_congress_on_healthit_progress.pdf) on February 18 2020 8. 8.Parikh RB, Feld EK, Galsky MD et al. First-line immune checkpoint inhibitor use in cisplatin-eligible patients with advanced urothelial carcinoma: results from a real-world analysis. Futur Oncol. 2019;16:4341–4345. 9. 9.Parikh RB, Adamson BJS, Khozin S et al. Association between FDA label restriction and immunotherapy and chemotherapy use in bladder cancer. JAMA. 2019;322:1209–1211. 10. 10.Khozin S, Miksad RA, Adami J et al. Real-world progression, treatment, and survival outcomes during rapid adoption of immunotherapy for advanced non-small cell lung cancer. Cancer. 2019;125:4019–4032. 11. 11.Feld EK, Harton J, Meropol NJ et al. Effectiveness of first-line immune checkpoint blockade versus carboplatin-based chemotherapy for metastatic urothelial cancer. Eur Urol. 2019;76:524–532. 12. 12.Steuten LM, Goulart BHL, Meropol NJ, et al. Cost effectiveness of multigene panel sequencing for patients with advanced non-small-cell lung cancer. JCO Clin Cancer Informatics. 2019;3:1–. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1200/CCI.18.00057&link_type=DOI) 13. 13.Bagley SJ, Talento S, Mitra N et al. Comparative effectiveness of carboplatin/pemetrexed with versus without bevacizumab for advanced nonsquamous non-small cell lung cancer. J Natl Compr Cancer Netw. 2019;17:469–477 14. 14.Singal G, Miller PG, Agarwala V et al. Association of patient characteristics and tumor genomics with clinical outcomes among patients with non-small cell lung cancer using a clinicogenomic database. JAMA. 2019;321:1391–1399. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F03%2F18%2F2020.03.16.20037143.atom) 15. 15.Parikh RB, Galsky MD, Gyawali B et al. Trends in checkpoint inhibitor therapy for advanced urothelial cell carcinoma at the end of life: Insights from real- world practice. Oncologist. 2019;24:397–399. 16. 16.Riaz F, Presley CJ, Chiang AC et al. Disparities in broad-based genomic sequencing for patients with advanced non-small cell lung cancer. J Geriatr Oncol. 2019;10:669–672. 17. 17.Winfree KB, Torres AZ, Zhu YE et al. Treatment patterns, duration, and outcomes of pemetrexed maintenance therapy in patients with advanced NSCLC in a real-world setting. Curr Med Res Opin. 2018;35:817–827. 18. 18.Presley CJ, Tang D, Soulos PR et al. Association of broad-based genomic sequencing with survival among patients with advanced non-small cell lung cancer in the community oncology setting. JAMA. 2018;320:469–477. 19. 19.Curtis MD, Griffith SD, Tucker MG et al. Development and validation of a high- quality composite real- world mortality endpoint. Health Serv Res. 2018;53:4460–4476. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F03%2F18%2F2020.03.16.20037143.atom) 20. 20.Center for Disease Control and Prevention. National Vital Statistics System. Accessed at [https://www.cdc.gov/nchs/nvss/index.htm](https://www.cdc.gov/nchs/nvss/index.htm) on February 18, 2020 21. 21.Edge SB, Byrd DR, Compton CC, et al., editors: AJCC cancer staging manual (7th ed). New York, NY: Springer; 2010. 22. 22.Greene FL, Byrd DR, Brookland RK, et al., editors: AJCC cancer staging manual (8th ed). New York, NY: Springer; 2017. 23. 23.American Joint Committee on Cancer. Collaborative stage data collection system. Accessed at [http://www.cancerstaging.org/cstage/Pages/default.aspx](http://www.cancerstaging.org/cstage/Pages/default.aspx) on February 19, 2020 24. 24.United States Census Bureau. Accessed at [https://www.census.gov/quickfacts/fact/map/US/POP010210](https://www.census.gov/quickfacts/fact/map/US/POP010210) on February 18, 2020 25. 25.National Cancer Institute. SEER training modules. Accessed at [https://training.seer.cancer.gov/abstracting/](https://training.seer.cancer.gov/abstracting/) on February 18, 2020 26. 26.National Cancer Institute. Surveillance Epidemiology and End Results Program. Race recode changes. Accessed at [https://seer.cancer.gov/seerstat/variables/seer/race\_ethnicity/](https://seer.cancer.gov/seerstat/variables/seer/race_ethnicity/) on February 18, 2020. 27. 27.National Cancer Institute. Surveillance Epidemiology and End Results Program. Population Characteristics. Accessed at [https://seer.cancer.gov/registries/data.html](https://seer.cancer.gov/registries/data.html) on March 11, 2020 28. 28.Warren JL, Klabunde CN, Schrag D et al. Overview of the SEER-Medicare data: content, research applications, and generalizability to the United States elderly population. Med Care. 2002;40(8 Suppl):IV–18. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/01.MLR.0000020942.47004.03&link_type=DOI) 29. 29.Nattinger AB, McAuliffe TL, Schapira MM. Generalizability of the surveillance, epidemiology, and end results registry population: factors relevant to epidemiologic and health care research. J Clin Epidemiol. 1997;50:939–945. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0895-4356(97)00099-1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=9291879&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F03%2F18%2F2020.03.16.20037143.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1997XU61900009&link_type=ISI) 30. 30.Kuo TM, Mobley LR. How generalizable are the SEER registries to the cancer populations of the USA? Cancer Causes Control. 2016;27: 1117–26. 31. 31.National Cancer Institute. Surveillance Epidemiology and End Results Program. Cancer Stage Variable Documentation. Accessed at [https://seer.cancer.gov/analysis/stage.html](https://seer.cancer.gov/analysis/stage.html) on February 18, 2020 32. 32.National Cancer Institute. Surveillance Epidemiology and End Results Program. Registry Operations. Accessed at [https://seer.cancer.gov/registrars/](https://seer.cancer.gov/registrars/) on February 19, 2020. 33. 33.NPCR and SEER Incidence – U.S. Cancer Statistics 2001-2016 Public Use Database – Data Standards and Data Dictionary. Accessed at [https://www.cdc.gov/cancer/uscs/public-use](https://www.cdc.gov/cancer/uscs/public-use) on February 19, 2020 34. 34.Printz C. Changes underway for SEER: Program leaders work to increase the breadth and depth of information. Cancer. 2015;121: 3183–3184. 35. 35.Griffith SD, Tucker M, Bowser B et al. Generating real-world tumor burden endpoints from electronic health record data: comparison of RECIST, radiology-anchored, and clinician-anchored approaches for abstracting real-world progression in non-small cell lung cancer. Adv Ther. 2019; 36, 2122–2136.