Abstract
Background Medical research with real-world clinical data can be challenging due to privacy requirements. Ideally, patient data are handled in a fully pseudonymised or anonymised way. However, this can make it difficult for medical researchers to access and analyze large datasets or to exchange data between hospitals. De-identifying medical free text is particularly difficult due to the diverse documentation styles and the unstructured nature of the data. However, recent advancements in natural language processing (NLP), driven by the development of large language models (LLMs), have revolutionized the ability to extract information from unstructured text.
Methods We hypothesize that LLMs are highly effective tools for extracting patient-related information, which can subsequently be used to de-identify medical reports. To test this hypothesis, we conduct a benchmark study using eight locally deployable LLMs (Llama-3 8B, Llama-3 70B, Llama-2 7B, Llama-2 70B, Llama-2 7B “Sauerkraut”, Llama-2 70B “Sauerkraut”, Mistral 7B, and Phi-3-mini) to extract patient-related information from a dataset of 100 real-world clinical letters. We then remove the identified information using our newly developed LLM-Anonymizer pipeline.
Results Our results demonstrate that the LLM-Anonymizer, when used with Llama-3 70B, achieved a success rate of 98.05% in removing text characters carrying personal identifying information. When evaluating the performance in relation to the number of characters manually identified as containing personal information and identifiable characteristics, our system missed only 1.95% of personal identifying information and erroneously redacted only 0.85% of the characters.
Conclusion We provide our full LLM-based Anonymizer pipeline under an open source license with a user-friendly web interface that operates on local hardware and requires no programming skills. This powerful tool has the potential to significantly facilitate medical research by enabling the secure and efficient de-identification of clinical free text data on premise, thereby addressing key challenges in medical data sharing.
Competing Interest Statement
JNK declares consulting services for Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK; AstraZeneca, UK; Scailyte, Switzerland; Mindpeak, Germany; and MultiplexDx, Slovakia. Furthermore he holds shares in StratifAI GmbH, Germany, has received a research grant by GSK, and has received honoraria by AstraZeneca, Bayer, Eisai, Janssen, MSD, BMS, Roche, Pfizer and Fresenius. ICW has received honoraria by AstraZeneca. CBW has received honoraria from Amgen, Bayer, BMS, Chugai, Celgene, Falk, GSK, Janssen, Ipsen, MSD, Merck, Roche, Servier, SIRTeX, Taiho; served on advisory boards for Bayer, BMS, Celgene, GSK, Incyte, Janssen, Lily, MSD, Servier, Shire/Baxalta, Rafael Pharmaceuticals, RedHill, Roche, has received travel support by Bayer, Celgene, Janssen, RedHill, Roche, Servier, Taiho and research grants (institutional) by Roche. Furthermore, he serves as faculty for European Society of Medical Oncology (ESMO), Deutsche Krebshilfe (DKH) and Arbeitsgemeinschaft internistische Onkologie (AIO), is a member of the EU Commission expert group: Mission Board for cancer, is a member of the forum #Zukunftsstrategie of the German government. MW declares honoraria by Lilly, Boehringer Ingelheim, SYNLAB, Janssen, Merck Serono, Merck Serono, GWT, Amgen, Novartis; Consulting services for Bristol-Myers Squibb, Novartis, Lilly, Boehringer Ingelheim, ISA Pharmaceuticals, Amgen, immatics, Bayer, ImCheck therapeutics and Research funding by Roche and Travel support by Pfizer, Bristol-Myers Squibb, AstraZeneca, Amgen, GEMoaB, Sanofi/Aventis, immatics, Merck Serono. No other potential conflicts of interest are declared by any of the authors.
Funding Statement
This study did not receive any funding
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics committee of Technical University Dresden, reference number BO-EK-400092023.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
The data used in this study contains patient sensitive information and will therefore not be published. We provide fictitious examples of clinical letters in the Supplement.