Abstract
Introduction Idiopathic pulmonary fibrosis (IPF) is a rare lung disease characterised by progressive scarring in the alveoli. IPF can be defined in population studies using electronic healthcare records (EHR) but recent genetic studies of IPF using EHR have shown an attenuation of effect size for known genetic risk factors when compared to clinically-derived datasets, suggesting misclassification of cases.
Methods We used EHR (ICD-10, Read (2 & 3)) and questionnaire data to define IPF cases in UK Biobank, and evaluated these definitions using association results for the largest genetic risk variant for IPF (rs35705950-T, MUC5B). We further evaluated the impact of exclusions based on co-occurring codes for non-IPF pulmonary fibrosis and restricting codes according to changes in diagnostic practice.
Results Odds ratio (OR) estimates for rs35705950-T associations with IPF defined using EHR and questionnaire data in UK Biobank were significant and ranged from 2.06 to 3.09 which was lower than those reported using clinically-derived IPF datasets (95% confidence intervals: 3.74, 6.66). Code-based exclusions of cases gave slightly closer effect estimates to those previously reported, but sample sizes were substantially reduced.
Discussion We show that none of the UK Biobank IPF codes replicate the effect size for the association of rs35705950-T on IPF risk when using clinically-derived IPF datasets. Further code-based exclusions also did not lead to effect estimates closer to those expected. Whilst the apparent increased sample sizes available for IPF from general population cohorts may be of benefit, future studies should take these limitations of the case definition into account.
What is already known on this topic UK Biobank is a very large prospective cohort that can be utilised to increase sample sizes for studies of rare diseases such as idiopathic pulmonary fibrosis (IPF). However, effect size estimates for genetic risk factors for IPF in UK Biobank and other general population cohorts, when defining cases using electronic healthcare records (EHR), are smaller than those estimated from clinically-derived IPF datasets.
What this study adds Using Hospital Episode Statistics (HES) data, primary care data, death registry data and self-report data in UK Biobank, we used the association rs35705950-T, the largest genetic risk factor for IPF, to evaluate code-based definitions of IPF. We show that none of the available IPF coding replicates the effect size for rs35705950-T on IPF risk that is observed in clinically-derived IPF datasets.
How this study might affect research, practice or policy Research using large general population cohorts and datasets for observational studies of IPF should take these limitations of EHR definitions of IPF into consideration.
Competing Interest Statement
LVW reports current and recent research funding from GSK, Genentech and Orion Pharma, and consultancy for Galapagos. RGJ is a trustee of Action for Pulmonary Fibrosis and reports personal fees from Astra Zeneca, Biogen, Boehringer Ingelheim, Bristol Myers Squibb, Chiesi, Daewoong, Galapagos, Galecto, GlaxoSmithKline, Heptares, NuMedii, PatientMPower, Pliant, Promedior, Redx, Resolution Therapeutics, Roche, Veracyte and Vicore. JKQ has received grants from The Health Foundation, MRC, GSK, Bayer, BI, AUK-BLF, HDR UK, Chiesi and AZ and personal fees for advisory board participation or speaking fees from GlaxoSmithKline, Boehringer Ingelheim, AstraZeneca, Chiesi, Insmed and Bayer.
Funding Statement
LVW holds a GSK/Asthma+Lung UK Chair in Respiratory Research (C17-1). LVW and RGJ are supported by MRC Programme grant MR/V00235X/1. RJA is an Action for Pulmonary Fibrosis Research Fellow. RGJ is supported by a National Institute for Health Research (NIHR) Research Professorship (NIHR reference RP-2017-08-ST2-014). LMK was funded by a Medical Research Council (MRC) PhD studentship (MR/N013913/1). MDT is supported by a Wellcome Trust Investigator Award (WT202849/Z/16/Z). The research was partially supported by the NIHR Leicester Biomedical Research Centre; the views expressed are those of the author(s) and not necessarily those of the National Health Service (NHS), the NIHR, or the Department of Health. This research has been conducted using the UK Biobank Resource under application 77050. This research used the SPECTRE High Performance Computing Facility at the University of Leicester.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
UK Biobank has approval by the Research Ethics Committee (REC) under approval number 16/NW/0274.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
UK Biobank data is publicly accessible upon approval of an application through www.ukbiobank.ac.uk.