Abstract
Named Entity Recognition (NER) is extremely relevant in the clinical field since it allows the extraction of information, such as diagnoses or medical procedures, from non-structured data (doctor’s letters, vignettes, etc.) and coding them based on international classification systems. As a result, language models should be trained to recognize and classify these items accurately. While Large Language Models (LLMs) like ChatGPT are capable of recognizing medical entities in texts, they are not reliable at performing this task. Unlike English, where there are a variety of resources to assist with this task, other languages, such as German, lack appropriate language models. This study presents a methodology for the generation of high-quality full-synthetic datasets and the implementation of a workflow for the identification and classification of diseases, co-diseases, and medical procedures for clinical narratives in oncology.
Competing Interest Statement
FEM works for QuiBiQ GmbH JGDO works for PerMerdiQ GmbH and QuiBiQ GmbH
Funding Statement
This project was supported by the Ministry for Economics, Labor and Tourism from Baden-Wuerttemberg, Germany via grant agreement number BW1_1456 (AI4MedCode).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in the present study, which is fully synthetic, are available upon reasonable request to the authors