RT Journal Article SR Electronic T1 Reducing Information and Selection Bias in EHR-Linked Biobanks via Genetics-Informed Multiple Imputation and Sample Weighting JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.10.28.24316286 DO 10.1101/2024.10.28.24316286 A1 Salvatore, Maxwell A1 Kundu, Ritoban A1 Du, Jiacong A1 Friese, Christopher R A1 Mondul, Alison M A1 Hanauer, David A1 Lu, Haidong A1 Pearce, Celeste Leigh A1 Mukherjee, Bhramar YR 2024 UL http://medrxiv.org/content/early/2024/10/29/2024.10.28.24316286.abstract AB Electronic health records (EHRs) are valuable for public health and clinical research but are prone to many sources of bias, including missing data and non-probability selection. Missing data in EHRs is complex due to potential non-recording, fragmentation, or clinically informative absences. This study explores whether polygenic risk score (PRS)-informed multiple imputation for missing traits, combined with sample weighting, can mitigate missing data and selection biases in estimating disease-exposure associations. Simulations were conducted for missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) conditions under different sampling mechanisms. PRS-informed multiple imputation showed generally lower bias, particularly when combined with sample weighting. For example, in biased samples of 10,000 with exposure and outcome MAR data, PRS-informed imputation had lower percent bias (3.8%) and better coverage rate (0.883) compared to PRS-uninformed (4.5%; 0.877) and complete case analyses (10.3%; 0.784) in covariate-adjusted, weighted, multiple imputation scenarios. In a case study using Michigan Genomics Initiative (n=50,026) data, PRS-informed imputation aligned more closely with a sample-weighted All of Us-derived benchmark than analyses ignoring missing data and selection bias. Researchers should consider leveraging genetic data and sample weighting to address biases from missing data and non-probability sampling in biobanks.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by National Cancer Institute grant P30CA046592 and the Training, Education, and Career Development Graduate Student Scholarship of the University of Michigan Rogel Cancer Center.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The institutional review board of the University of Michigan Medical School gave ethical approval for this work (HUM00155849).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesPatient confidentiality prevents the sharing of data publicly. However, the data underlying the study’s results are available from the Michigan Genomics Initiative at https://precisionhealth.umich.edu/ourresearch/michigangenomics/ for researchers who meet the criteria for confidential data access. The code used to conduct analyses in this paper is publicly available at https://github.com/maxsal/exprs_imputation.