Building a Best-in-Class Automated De-identification Tool for Electronic Health Records Through Ensemble Learning ================================================================================================================= * Karthik Murugadoss * Ajit Rajasekharan * Bradley Malin * Vineet Agarwal * Sairam Bade * Jeff R. Anderson * Jason L. Ross * William A. Faubion, Jr. * John D. Halamka * Venky Soundararajan * Sankar Ardhanari ## Abstract The natural language portions of electronic health records (EHRs) communicate critical information about disease and treatment progression. However, the presence of personally identifiable information (PII) in this data constrains its broad reuse. Despite continuous improvements in methods for the automated detection of PII, the presence of residual identifiers in clinical notes requires manual validation and correction. However, manual intervention is not a scalable solution for large EHR datasets. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Upon detection of PII, the system transforms these detected identifiers into plausible, though fictional, surrogates to further obfuscate any leaked identifier. We evaluated the system with a publicly available dataset of 515 notes from the I2B2 2014 de-identification challenge and a dataset of 10,000 notes from the Mayo Clinic. In comparison with other existing tools considered best-in-class, our approach outperforms them with a recall of 0.992 and 0.994 and a precision of 0.979 and 0.967 on the I2B2 and the Mayo Clinic data, respectively. The automated de-identification system presented here can enable the generation of de-identified patient data at the scale required for modern machine learning applications to help accelerate medical discoveries. ## Introduction The widespread adoption of electronic health records (EHRs) by healthcare systems has enabled digitization of patient health journeys. While the structured elements of EHRs (e.g., health insurance billing codes) have been relied upon to support the business of healthcare and front office applications for decades, the unstructured text (e.g., history & physical notes and pathology reports) contains far richer and nuanced information about patient care, supporting novel research1–5. However, this text often contains personally identifiable information (PII) as defined in the Health Insurance Portability and Accountability Act of 1996 (HIPAA), such as the personal name, phone number, or residential address6. As a consequence, such data has limited reuse for secondary purposes7. HIPAA permits data derived from EHRs to be widely shared and used when it is de-identified. Under the HIPAA Privacy Rule, de-identification can be accomplished in several ways. The most straightforward is the Safe Harbor implementation, which necessitates removal of an enumerated list of 18 categories of direct-(e.g., Social Security Number) and quasi-identifiers (e.g., date of service). Implementing a scalable method for de-identification has several competing requirements. First, from a regulatory perspective, it must achieve extremely high recall, in that it needs to detect nearly all instances of PII. Second, from a clinical utility perspective, it must achieve extremely high precision, so that we maximize the correctness of biomedical research performed. And, third, the approach needs to be cost effective, so that millions of records can be de-identified in a reasonable amount of time. The traditional approach of manual detection of PII is expensive, time consuming and prone to human error8,9, which makes automated de-identification a more promising alternative10,11. Several recent advances in natural language processing (NLP) have created an opportunity to build accurate and scalable automated de-identification systems. First, transfer learning of autoregressive and autoencoder models12 for a supervised task such as named entity recognition (NER) requires very little labelled data, reducing human effort and error. Second, attention-based deep learning models, such as transformers13, allow for the non-sequential processing of text and enable the generation of rich contextualized word representations. Third, semantic segmentation algorithms generate a subword-based vocabulary14,15 which can capture out-of-vocabulary words. Finally, the traditional transformer architecture has been improved upon through bidirectional encoder representations from transformers (BERT)16 and similar technologies that jointly train a *masked language model* (MLM) pre-training objective and a *next sentence prediction* task. BERT has set the stage for learning context independent representations of terms in text, and training context-sensitive models that transform those representations into context-aware representations based on the occurrence of a term in a sentence. We leverage these advances to support de-identification, which we formulate as a named entity recognition problem. In this paper, we integrate a collection of approaches, blending the beneficial aspects of modern deep learning along with rules and heuristics, to create a best-in-class approach to automated de-identification. The system transforms each detected PII instance into a suitable surrogate to mitigate the risk that any residual PII can be used to re-identify patients **(Fig. 1)**. The nference de-identification tool can be accessed at [https://academia.nferx.com/deid/](https://academia.nferx.com/deid/). ![Fig. 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/02/23/2020.12.22.20248270/F1.medium.gif) [Fig. 1.](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/F1) Fig. 1. Automated de-identification of EHRs involves two steps: (a) Detecting PII entities and (b) Transforming them by replacement with suitable surrogates. ## Results We first compare the performance of the nference de-identification system with other methods on the I2B2 2014 dataset17. The resulting models are evaluated using precision, recall and F1-scores (formulation provided in the Supplementary Methods) for NER on several groups of PII as defined in **Table 1**. We then compare the performance of these models on a substantially larger and diverse dataset from the Mayo Clinic and perform a deeper dive into the types of errors, distribution of errors per physician note and the distribution of errors per note type. It should be noted that this analysis focuses solely on the performance of detecting PII instances and does not address the risk of re-identification based on the semantics of any instances that the system fails to detect, an issue that is beyond the scope of this study. View this table: [Table 1:](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/T1) Table 1: The list of entities covered by each group of direct and quasi-identifiers. It should be noted that groups B and C encompass entities beyond HIPAA Safe Harbor. ### Performance on the 2014 I2B2 de-identification dataset The I2B2 2014 De-identification and Heart Disease Risk Factors challenge17 is a publicly available dataset of clinical documents with annotated PII elements. This dataset consists of a training set of 792 clinical notes and a test set of 515 clinical notes. We compared the performance of our approach on the 2014 I2B2 test set with six other established de-identification tools: the method proposed by Dernoncourt et al. that blends conditional random fields (CRFs) and artificial neural networks (ANNs)18, Scrubber19, Physionet8, Philter20, MIST21 and NeuroNER22 The results are provided in **Table 2**. Firstly, we cite the conditional random field and artificial neural network approach (CRF+ANN)18 scores against the group A entities (HIPAA only) as reported in their paper. We also directly report the results for Scrubber, Physionet, and Philter from prior publications20 without performing an empirical analysis because the dataset (2014 I2B2) and the set of PII entities is the same as that used in our investigation. We trained MIST using sentences from the I2B2 training corpus (see Supplementary Methods and **Supplementary Table 3**). We downloaded and used a pre-trained model for NeuroNER (see Supplementary Methods). We present the performance of these methods on group B (see **Table 1**) entities which we use as the basis of our comparison. View this table: [Table 2:](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/T2) Table 2: Performance of de-identification methods on the 2014 I2B2 test corpus. The results for Scrubber, Physionet, Philter and the CRF+ANN method are based on previous publications. The MIST method required training and, thus, was trained on the 2014 I2B2 training dataset. We used a pre-trained model for NeuroNER. The two versions of the nference approach were fine-tuned on (i) only the Mayo dataset and (ii) both the Mayo and I2B2 datasets. We present two versions of the nference system. The first version was fine-tuned only on Mayo data and did not utilize any characteristics of the I2B2 training data. When evaluated with group B, this model achieved a precision, recall, and F1 score of 0.961, 0.988, and 0.974, respectively. The second version of our system involved fine-tuning our model with sentences from the I2B2 training set. We could not incorporate inclusion lists and sentence templates associated with the I2B2 data since the dataset is small (see Methods section for details). The precision, recall, and F1 score increased to 0.979, 0.992, and 0.985, respectively. Precision and recall per identifier type is provided in **Supplementary Table 4**. ### Performance on the Mayo test dataset The Mayo Clinic dataset consisted of 10,000 randomly sampled notes from a corpus of 104 million notes corresponding to 477,000 patients’ EHR records. The evaluation performed on the Mayo test dataset was based on identifiers defined by group C since this group best represented the distribution of PII in the dataset. The performance of the de-identification methods (in terms of precision, recall and F1) are presented in **Table 3**. The nference method performed best with precision, recall, and F1 scores of 0.967, 0.994, and 0.979, respectively. Compared to the performance on the I2B2 dataset, we see improved recall (increase of 0.01) and a reduced precision value (decrease of 0.021). NeuroNER achieves precision, recall and F1 scores of 0.928, 0.933 and 0.931, respectively. The F1 scores of Scrubber, Physionet and Philter were lower than those achieved on the I2B2 dataset. Among these three methods, Philter demonstrates a relatively high recall of 0.918. Closely following Philter, the MIST model achieves a recall of 0.889 with overall performance similar to that on the I2B2 dataset. View this table: [Table 3:](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/T3) Table 3: Precision, Recall and F1-Score of various de-identification methods on the Mayo test dataset. These methods were evaluated against group C entities. ### Error analysis on the Mayo dataset We further investigated cases in the Mayo dataset where the nference de-identification model failed to successfully detect the PII element completely (i.e., false negatives). This occurred at a rate of 0.6% (see **Table 4**). Across the 10,000 notes considered in the test set, there were 848 error instances that contained these false negative errors. Accounting for duplicate occurrences of the same sentence, there were 797 unique error instances. We grouped these instances based on the type of identifier. The prevalence of the error category is shown in the second column while the third column in the table represents the contribution of each category to the error in recall (sums to 0.6%). View this table: [Table 4:](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/T4) Table 4: Prevalence and examples of types of false negatives encountered by the nference de-identification system when applied on the Mayo test set. The entity highlighted in bold indicates the word or phrase that the system failed to detect. The most prevalent error was in the recognition of entities pertaining to clinic locations (208 out of 797). Many of these were due to partially identified phrases (e.g., “*Room 7A*” was missed in “Out of *Southwest Building Room 7A”*). The second most prevalent error type was in dates with 183 false negatives. The third most prevalent error category was in doctor/nurse names and initials with 169 false negatives. Abbreviations and shorthand used by providers (typically while signing off on a clinical note) contributed to the errors in this category Ambiguous instances of PII also resulted in false negatives. These were cases that a human reader would have difficulty/uncertainty in deeming as PII. An example of this is the word *tp* in the phrase “*Comment: 03-12-2005 08:04:12 - verified tp*”. We found that 26% of errors were those in which the nurse abstractors themselves did not agree on the characterization of PII (Cohen’s Kappa for errors was lower than non-errors, at 0.7453), pointing to the inherent ambiguity. ### Distribution of errors per note We further investigated the rate at which errors in detecting PII(false negatives) occurred on a per note level. As shown in **Table 5**, the error instances were distributed across 637 notes. Furthermore, we see that a majority of false negatives are spread evenly across the notes (525 out of 637 notes, or 82.4%, contain a single error). For each subsequent error rate, we computed the coverage of PII entities. Here, coverage represents the fraction of PII present in the subset of notes up to the corresponding error rate. View this table: [Table 5:](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/T5) Table 5: Distribution of number of errors per note. PII coverage represents the fraction of PII present in the subset of notes up to the corresponding error rate. Average number of error types denotes the number of distinct errors types (such as date errors or name errors) per note. Even for notes with a large number of errors (more than six), the number of distinct error types is between two and three. This illustrates that most of the errors are of the same type and an artifact of repetition of text within a note. For example, in the note with ten errors, eight of the instances were related to location while the remaining two are related to date. Examples of the errors pertaining to location here are “*Location of INR sample* : *Other: Smallville Other: Smallville Other: Smallville*”, “*Recommend Recheck* : *Other: 04/01/2017 Smallville Other: 04/01/2017 Smallville”, “Recommend Recheck* : *Other: 04/01/2017 Smallville Other: 04/01/2017 Smallville Other: 04/01/2017 Other: 04/01/2017 Smallville”*. Here, the set of location errors all pertain to the same location “Smallville”, which illustrates how the effective amount of identifiable content is substantially smaller than suggested by the raw count. The date presented (“04/01/2017”) was successfully detected. Both the date and location have been replaced with synthetic values for the purpose of this example. ### Distribution of note types In the Mayo test set, a physician note is associated with a note type (e.g. progress note, emergency visit, telephone encounter). Given that the structure and semantics of these note types vary greatly from each other we analyze the enrichment of errors across them. From the 637 notes with errors, we found 134 distinct note types with at least 1 error. The top 14 note types with highest error content are listed in **Table 6**. Notes of the type “Anti Coag Service Visit Summary” contain the highest rate of errors (22 out of 26 sampled notes) followed by “Electrocardiogram” (19 out of 30 sampled notes). View this table: [Table 6:](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/T6) Table 6: Distribution of number of errors per note type. The proportion of sampled notes for a given type that contain at least one error is presented in the last column. This indicates in which note type an error is more likely to occur. ## Methods ### Usage of Mayo Clinic Dataset The Mayo EHR dataset is based on data from 477,000 patients that originated from multiple EHR data systems (including Epic and Cerner) spanning over 20 years. The dataset includes 104 million physician notes that capture the healthcare journey of patients in addition to structured tables containing lab test measurements, diagnosis information, orders, and medicine administration records. This research was conducted with approval from the Mayo Clinic Institutional Review Board. We randomly sampled 10,000 notes, which were reduced to the set of unique sentences. This yielded a test set of 172,102 sentences. These were subsequently annotated by six Mayo Clinic nurse abstractors to create a ground truth label for every word and/or phrase. Each sentence was annotated by at least two different nurse abstractors. The inter-annotator agreement on labelling a token as PII had a Cohen’s Kappa of 0.9694 (see Supplementary Methods for details). An additional set of 10,000 notes were selected to fine-tune the models. We manually annotated 61,800 unique sentences from these notes to create a tagged fine-tuning set. See Supplementary Methods for more details. ### Detection of PII entities The ensemble architecture described in this section leverages state of the art attention-based deep learning models in conjunction with rules harvested from the data (each of which is described below) to handle semi-structured text. **(Fig. 2)** ![Fig. 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/02/23/2020.12.22.20248270/F2.medium.gif) [Fig. 2.](http://medrxiv.org/content/early/2021/02/23/2020.12.22.20248270/F2) Fig. 2. Sentence-based inclusion lists and template matching prune out sentences that either 1) lack PII or 2) contain PII in specific well-defined patterns. An ensemble of attention-based neural networks identify complementary features across different PII types. For each entity type, multiple model versions (*v**1*, *v**2*, *v**N*) are used in tandem. Additionally, pattern recognition modules and structured EHR content from matched patients support the anonymization process. The results from each component of the ensemble are aggregated to yield the original note labelled with PII tags. There are several salient features of this approach that are worth noting. ### Hybrid Deep Learning Models The newer breed of attention based deep learning models, in conjunction with transfer learning, allow for faster tuning of these models with significantly smaller sets of labeled data for detecting PII identifiers. We use pre-trained language models based on the BERT16 architecture that are then fine tuned for detecting (a) personal names, (b) organizations, (c) locations, and (d) ages. We employed the bert-base-cased model ([https://huggingface.co/bert-base-cased](https://huggingface.co/bert-base-cased)) through the HuggingFace/Transformers ([https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)) library. This is a case-sensitive English language pre-trained model based off of the BERT architecture trained using a masked language modelling (MLM) objective. The fine-tuning process involves training the pre-trained language model on a named entity recognition task using a training set of annotated sentences. We used a total of 61,800 tagged example sentences to fine-tune the models. We fine-tuned each transformer model with a maximum sequence length of 256 (after tokenization) over 4 epochs. We use a training batch size of 32 and a learning rate of 5e-5 with a warmup proportion of 0.4. We then evaluated the model on a validation dataset and computed the accuracy. We performed the fine-tuning and model validation processes in an iterative manner (see Supplementary Methods and **Supplementary Table 1** for complete implementation details). Identifiers such as names, locations, organizations and ages are well suited to a statistical entity recognition method because they can use the context of the surrounding text to disambiguate the entity type of a word. By contrast, pattern matching rules are significantly hampered in this respect. It would be hard, for instance, to detect “Glasgow” as a medical term in “He had no helmet and his Glasgow Score was 6” and as a location in “Mr. Smith had visited his family in Glasgow using lookup dictionaries. However, we use patterns to deterministically tag reasonably well-defined PII identifiers, which are almost entirely context independent and unambiguous. This category includes dates and times, phone and pager numbers, clinical IDs and numeric identifiers, email, URLs, IP addresses, and vehicle numbers. In addition, harvested sentence templates (described further below) are relied upon to deterministically tag PII instances matched by the template patterns. Our methods apply to content in both structured (e.g lab comments) and free form text (e.g progress notes). Additionally, it should be noted that we designed our method to detect and transform information about those who provide care, such as physicians, nurses, and pharmacies. Though this is not required by HIPAA Safe Harbor, it allows healthcare organizations to protect the identities of their employees as well. ### Ensemble of models framework and iterative fine tuning Given the regulatory necessity of extremely high recall for de-identification, we aggregate the results of multiple models trained for the same PII type. Our ensemble involved employing at least one individual model for names, organizations, locations and ages (see **Supplementary Table 2**). An additional text normalized model was also trained and utilized for names. In this respect, if a term is detected as PII in any of the models for that type, then it is tagged. A divide and conquer approach has been implemented that harnesses the power of multiple models to identify PII or extract meaningful entities (***Fig. 2***). In contrast to a “one size fits all” model, this approach enables each individual model to be fine-tuned to learn different (and complementary) features of the unstructured EHR data as has been shown to be used in prior de-identification systems23. For instance, one model focuses on identifying peoples’ names while another is geared towards addresses and locations. Furthermore, there are additional models corresponding to cased and uncased variants of the raw data (referred to as “*Name Model 1*” and “*Name Model 2*” in **Fig. 2**). Each model here corresponds to an attention-based deep neural network. One advantage of carving out the entity space to be handled individually by separate models is that each model needs to only learn the distribution of entities of a specific type as opposed to all entities. However, this introduces a challenge in resolving terms in a sentence that have conflicting and/or ambiguous entity types. These conflicts are resolved in the aggregation phase of our ensemble where a simple voting threshold of one claim is employed (i.e., an entity is considered PII even if one model in the system tags it as such). Since the majority of the components in the ensemble are designed to detect complementary features, we are able to improve recall without much loss of precision. ### Integrating databases as part of core model We use publicly available databases of names, locations, and addresses to supplement the model fine-tuning process. First names with supporting gender information were downloaded from the US Census database. Cities across the US as well as lists of hospitals were obtained from Wikipedia. These public databases were used to augment training of our models.In addition, patient-specific information from structured EHRs, including patient names and residential addresses, are used to augment the model training and match against PII in the text. ### Sentence-based inclusion list Clinical note corpora contain a large number of repeated sentences. These stem from various processes, including automated reminders (e.g., “*Please let your doctor know if you have problems taking your medications*”), repeated phrases in the writing style of physicians (e.g. “*Rubella: Yes*”, “*Pain symptoms: No*”) or shared elements in the clinical notes such as section headers (e.g. “*History of Present Illness*”). From the corpus of physician notes from the Mayo Clinic, a set of 1,600 sentences, that did not contain PII, were incorporated into an “inclusion list”. This inclusion list was further expanded with a set of 25,000 sentences containing medically relevant entities, such as disease or drug names (see Supplementary Methods for details on how the inclusion list was constructed). This has the added benefit of improving the precision of the de-identification system because it reduces the risk of misclassifying these important entities as PII by the neural network models. Additionally, sentences marked as being devoid of PII during the validation phase in the iterative fine-tuning process are also added to the inclusion list (see Supplementary Methods). ### Auto-Generating templates using statistical NER models In addition to exact sentences with high prevalence there are also a large number of PII containing sentences that can be mapped to a template (e.g., *“Electronically signed by: SMITH, JOHN C on 01/02/1980 at 12:12 PM CST”* maps to a template of the form “*Electronically signed by: , on at