Abstract
Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
PMD, EC, RAH, RC, KT and JH work in the Medical Research Council Integrative Epidemiology Unit at the University of Bristol which is supported by the UK Medical Research Council and the University of Bristol (Grant ref: MC_UU_00032/02). EC is supported by the UK Medical Research Council (Grant ref: MR/V020641/1). RAH is supported by a Sir Henry Dale Fellowship that is jointly funded by the Wellcome Trust and the Royal Society (Grant ref: 215408/Z/19/Z). For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. The MRC and Wellcome (Grant refs: 217065/Z/19/Z; 076467/Z/05/Z) and the University of Bristol provide core support for ALSPAC. This publication is the work of the authors and PMD will serve as guarantors for the contents of this paper. A comprehensive list of grants funding is available on the ALSPAC website (http://www.bristol.ac.uk/alspac/external/documents/grant-acknowledgements.pdf); Linked education records were funded by the Wellcome Trust, the MRC and the Department for Education and Skills (Grant refs: 092731; EOR/SBU/2002/121).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethical approval for the applied example (project B4170, searchable on https://proposals.epi.bristol.ac.uk/) was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees - http://www.bristol.ac.uk/alspac/researchers/research-ethics/.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
The ALSPAC data used in the applied example cannot be shared publicly for ethical reasons. The study website contains details of all available data through a fully searchable data dictionary (http://www.bristol.ac.uk/alspac/researchers/our-data/). The scripts and folder structure used to run the applied example analysis and simulation study can be found online at https://github.com/pmadleydowd/Missing_auxiliary_variables. Datasets for the simulation study are found within this repository.
Abbreviations
- ALSPAC
- Avon longitudinal study of parents and children
- CDF
- cumulative distribution function
- CRA
- complete records analysis
- DAG
- directed acyclic graph
- FCS
- fully conditional specification
- FMI
- fraction of missing information
- IQ
- intelligence quotient
- KS4
- key stage 4
- MAR
- missing at random
- MCAR
- missing completely at random
- MI
- multiple imputation
- MNAR
- missing not at random