RT Journal Article SR Electronic T1 Building a Best-in-Class De-identification Tool for Electronic Medical Records Through Ensemble Learning JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.12.22.20248270 DO 10.1101/2020.12.22.20248270 A1 Murugadoss, Karthik A1 Rajasekharan, Ajit A1 Malin, Bradley A1 Agarwal, Vineet A1 Bade, Sairam A1 Anderson, Jeff R. A1 Ross, Jason L. A1 Faubion, William A. A1 Halamka, John D. A1 Soundararajan, Venky A1 Ardhanari, Sankar YR 2020 UL http://medrxiv.org/content/early/2020/12/23/2020.12.22.20248270.abstract AB The natural language portions of an electronic health record (EHR) communicate critical information about disease and treatment progression. However, the presence of personally identifying information in this data constrains its broad reuse. In the United States, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) provides a de-identification standard for the removal of protected health information (PHI). Despite continuous improvements in methods for the automated detection of PHI over time, the residual identifiers in clinical notes continue to pose significant challenges - often requiring manual validation and correction that is not scalable to generate the amount of data needed for modern machine learning tools. In this paper, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep learning models and rule based methods, supported by heuristics for detecting PHI in EHR data. Upon detection of PHI, the system transforms these detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier. We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from the Mayo Clinic. We compared our approach with other existing tools considered best-in-class. The results indicated a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively.Competing Interest StatementJeff R. Anderson, John D. Halamka, and William A. Faubion Jr. do not have any conflicts of interest in this project. Bradley Malin is a contracted consultant of the Mayo Clinic. Karthik Murugadoss, Ajit Rajasekharan, Vineet Agarwal, Sairam Bade, Jason L. Ross, Venky Soundararajan, and Sankar Ardhanari are employees of and have a financial interest in nference. Mayo Clinic and nference may stand to gain financially from the successful outcome of the research.Funding StatementNo external fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This research was conducted with approval from the Mayo Clinic Institutional Review Board 20-003235 - CDAP: Data De-identification Methods Development. All analysis of EHRs was performed in the privacy-preserving environment secured and controlled by the Mayo Clinic. nference and the Mayo Clinic subscribe to the basic ethical principles underlying the conduct of research involving human subjects as set forth in the Belmont Report and strictly ensures compliance with the Common Rule in the Code of Federal Regulations (45 CFR 46) on Protection of Human Subjects. For more information, please visit https://www.mayo.edu/research/institutional-review-board/overviewAll necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.Yes2014 I2B2 data is publicly available subject to signed safe-usage and research only. The Mayo EHR clinical notes are not publicly available at this time.