ABSTRACT
The representativeness of Real-world Data is assumed, but findings will rarely generalise to the target population when the potential outcomes under treatment are influenced by variables causative of selection into a study. We assess the extent of selection biases in a de-identified nationwide US Clinico-Genomic Database Non-Small Cell Lung Cancer cohort through each process using two referent populations: a superset of all NSCLC patients in the Flatiron Health network and the National Cancer Institute’s Surveillance, Epidemiology and End Results cancer registrations. Despite Standardised Differences suggesting differences in individual covariates between sample and referent populations, the conditional distributions of selection were alike, and indices suggest the results being generalizable (≥ 0.96 on a proportional scale of 0–1). Estimates of Real-world Overall Survival in a population weighted to be representative did not differ from naïve estimates in the unweighted cohort. We conclude with a counterfactual analysis highlighting how the Average Treatment Effect in the Sample and Population were concordant under an example having a Generalizability Index of 0.97. The Tipton Generalizability Index provides a quantitative assessment of the generalizability of findings that can be used to determine the influence of selection biases.
Competing Interest Statement
All authors, current employees of AstraZeneca at the time of writing, may hold stock options.
Funding Statement
The study was funded and undertaken by AstraZeneca.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Institutional Review Board at AstraZeneca waived ethical approval for this work according to Standard Operating Procedure.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
Revised with a justification for including the ML-E dataset and an example on how the method can be used when estiamting a causal estimand.
Data Availability Statement
The data that support the findings of this study have been originated by Flatiron Health, Inc. Requests for data sharing by license or by permission for the specific purpose of replicating results in this manuscript can be submitted to dataaccess{at}flatiron.com and cgdb-fmi{at}flatiron.com. Surveillance, Epidemiology and End Results (SEER) data is publicly available with a signed data-use agreement [https://seer.cancer.gov/data]
Abbreviations
- AJCC
- American Joint Committee on Cancer
- ASD
- Absolute Standardized Difference
- CGDB
- Clinico-Genomic Database [cohort]
- IPW
- Inverse Probability Weighting
- ML-E
- Machine Learning-Extracted [cohort]
- NGS
- Next Generation Sequencing
- NSCLC
- Non-Small Cell Lung Cancer
- PATE
- Population Average Treatment Effect
- rwOS
- Real-world Overall Survival
- SATE
- Sample Average Treatment Effect
- SEER
- Surveillance, Epidemiology, and End Results [cohort]