Scaling text de-identification using locally augmented ensembles

Karthik Murugadoss; Saivikas Killamsetty; Deeksha Doddahonnaiah; Nakul Iyer; Michael Pencina; Jeffrey Ferranti; John Halamka; Bradley A. Malin; Sankar Ardhanari

doi:10.1101/2024.06.20.24308896

Abstract

The natural language text in electronic health records (EHRs), such as clinical notes, often contains information that is not captured elsewhere (e.g., degree of disease progression and responsiveness to treatment) and, thus, is invaluable for downstream clinical analysis. However, to make such data available for broader research purposes, in the United States, personally identifiable information (PII) is typically removed from the EHR in accordance with the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA). Automated de-identification systems that mimic human accuracy in identifier detection can enable access, at scale, to more diverse de-identified data sets thereby fostering robust findings in medical research to advance patient care.

The best performing of such systems employ language models that require time and effort for retraining or fine tuning for newer datasets to achieve consistent results and revalidation on older datasets. Hence, there is a need to adapt text de-identification methods to datasets across health institutions. Given the success of foundational large language models (LLMs), such as ChatGPT, in a wide array of natural language processing (NLP) tasks, they seem a natural fit for identifying PII across varied datasets.

In this paper, we introduce locally augmented ensembles, which adapt an existing PII detection ensemble method trained at one health institution to others by using institution-specific dictionaries to capture location specific PII and recover medically relevant information that was previously misclassified as PII. We augment an ensemble model created at Mayo Clinic and test it on a dataset of 15,716 clinical notes at Duke University Health System. We further compare the task specific fine tuned ensemble against LLM based prompt engineering solutions on the 2014 i2b2 and 2003 CoNLL NER datasets for prediction accuracy, speed and cost.

On the Duke notes, our approach achieves increased recall and precision of 0.996 and 0.982 respectively compared to 0.989 and 0.979 respectively without the augmentation. Our results indicate that LLMs may require significant prompt engineering effort to reach the levels attained by ensemble approaches. Further, given the current state of technology, they are at least 3 times slower and 5 times more expensive to operate than the ensemble approach.

Competing Interest Statement

M.P., J.F., and J.H. do not have any competing interests in this project. B.M. is a contracted consultant of Mayo Clinic and Duke University Health System. The authors on this article from nference have a financial interest in nference. A patent application has been submitted by K.M. and S.A. Mayo Clinic, Duke University Health System and nference may stand to gain financially from the successful outcome of the research.

Funding Statement

No external funding.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

All work was carried out in compliance with Mayo Clinic and Duke University Health System IRB-approved protocols. The Mayo Clinic and Duke IRB reviewed and approved this research study protocol to ensure that it complies with applicable regulations, meets commonly accepted ethical standards, follows institutional policies, and adequately protects research participants.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

- The 2014 i2b2 dataset and the CONLL 2003 dataset are publicly available datasets subject to signed safe usage for research purposes. - The Duke and Mayo clinical notes are not publicly available at this time.