Abstract
Advancing health interoperability can significantly benefit health research, including phenotyping, clinical trial support, and public health surveillance. Federal agencies, including ONC, CDC, and CMS, have been collectively collaborating to promote interoperability by adopting Fast Healthcare Interoperability Resources (FHIR). However, the heterogeneous structures and formats of health data present challenges when transforming Electronic Health Record (EHR) data into FHIR resources. This challenge becomes more significant when critical health information is embedded in unstructured data rather than well-organized structured formats. Previous studies relied on multiple separate rule-based or deep learning-based NLP tools to complete the FHIR resource transformation, which demands substantial development costs, extensive training data, and meticulous integration of multiple individual NLP tools. In this study, we assessed the ability of large language models (LLMs) to transform clinical narratives into HL7 FHIR resources. We developed FHIR-GPT specifically for the transformation of clinical texts into FHIR medication statement resources. In our experiments using 3,671 snippets of clinical texts, FHIR-GPT demonstrated an exceptional exact match rate of over 90%, surpassing the performance of existing methods.. FHIR-GPT improved the exact match rates of existing NLP pipelines by 3% for routes, 12% for dose quantities, 35% for reasons, 42% for forms, and over 50% for timing schedules. Our findings provide the foundations for leveraging LLMs to enhance health data interoperability. Future studies will aim to build upon these successes by extending the generation to additional FHIR resources.
Introduction
Interoperability enhances the ability of healthcare providers to deliver safe, effective, and patient-focused care. It also offers novel avenues for individuals and caregivers to access electronic health data for care coordination and management1. The promotion of interoperability has become an integral aspect of various health initiatives, spanning from ensuring health equity to responding to public health emergencies 2. Federal agencies, including the Office of the National Coordinator of Health IT (ONC)1, the Centers for Disease Control and Prevention (CDC)3, and the Centers for Medicare & Medicaid Services (CMS)4, collectively collaborate to promote interoperability through the adoption of Fast Healthcare Interoperability Resources (FHIR), which is a next-generation interoperability standard developed by the Health Level 7 (HL7®) standards development organization5. FHIR is specifically designed to facilitate the swift and efficient exchange of health data. FHIR has seen growing adoption in the modeling and integration of both structured and unstructured data for various health research purposes. Its applications range from developing computational phenotyping6–8 to supporting clinical trials9–12, building surveillance systems13,14, and much more. We refer to these two review papers15,16 for further insights into FHIR applications.
Transforming health data into the FHIR format presents a challenge, as various health organizations have their unique infrastructure, standards, and formats for generating, storing, and organizing health data17. This challenge becomes more significant when critical health information is embedded in unstructured data other than well-organized structured formats. There are existing efforts for promoting the transformation of unstructured data into FHIR resources, offered by both academic and commercial sectors. In academic research, Hong et al. 18 integrated clinical NLP tools, including cTAKES19, MedXN20, and MedTime20, to extract clinical entities from corresponding document sections and standardize them into FHIR resources. Wang et al. developed Opioid2FHIR21, a system that employs multiple deep learning-based natural language processing (NLP) techniques for opioid information extraction and normalization. In the commercial domain, Google Cloud has released the Healthcare Natural Language API22, capable of converting medical text input into FHIR resources. Amazon Medical Comprehend23 can extract and normalize medical concepts into clinical vocabulary, although it lacks the ability to map all extracted information to FHIR resources. Azure Health Data24 is proficient at converting semi-structured data into FHIR resources but does not handle free-text unstructured input. All the above FHIR transformation tools necessitate sequential collaboration with multiple NLP tools. These include a Named Entity Recognition (NER) tool for extracting medical concepts, a relation extraction tool for identifying relations related to a target concept, a normalization tool for standardizing the extracted concepts into vocabularies, and a reconciliation tool for integrating the normalized concepts into a valid FHIR format. The development and training of each NLP tool is resource-intensive and demands a significant amount of time and data. Creating a pipeline that integrates multiple NLP tools requires substantial computational resources, annotated data, and human effort. Furthermore, as the transformation progresses along the pipeline, the accuracy of the conversion also decreases.
Therefore, we propose harnessing pre-trained large language models (LLMs) to streamline the existing approach which relies on a pipeline of multiple NLP tools, to facilitate the transformation of free-text input into FHIR resources. Our contributions can be summarized as follows:
- We manually annotated a dataset containing 3,671 snippets extracted from discharge summaries, along with their corresponding transformed MedicationStatement resources. To the best of our knowledge, this represents the largest and neatest human-annotated dataset of free-text to FHIR resource transformation pairs.
- We demonstrated that LLMs, especially FHIR-GPT, are able to outperform the existing NLP methods in transforming FHIR resources when evaluated by the exact match rate.
Results
The annotation results are presented in Table 1. In summary, we annotated a total of 3,671 pairs of free-text to FHIR MedicationStatement resources transformations. The free-text input was derived from discharge summaries for 280 admissions. The character lengths of the input data exhibit an average of approximately 66 characters, with a relatively high standard deviation of 65. The annotated resources encompass 625 distinct medications in 26 different forms and are associated with 354 different reasons, as well as 16 administration routes. These elements display varying levels of availability, ranging from approximately 30% for reasons to 65% for timing schedules. SNOMED CT is the most commonly used terminology system, which was applied to medication, form, route, and reason, while HL7’s own code set was used for timing schedules. The annotated resources in the .JSON structure have an average number of objects of 58.2 (standard deviation = 16.2) and an average depth of 6.7 (standard deviation = 0.5).
The transformation results are presented in Table 2. In summary, transformation with GPT-4, namely FHIR-GPT, achieved an exceptional exact match rate of over 0.90 for all elements, outperforming both baseline models and all other LLMs. Specifically, when compared to existing NLP pipelines, FHIR-GPT improved the exact match rate by 3% for routes, 12% for dose quantities, 35% for reasons, 42% for forms, and over 50% for timing schedules. Among all LLMs, we observed a trend of increasing accuracy as the parameter size increased. GPT-4, with approximately 1.7 trillion parameters, surpassed the 180 billion parameter Falcon models and further improved upon the 70 billion parameter Llama-2 models. Within all elements, the most challenging ones for LLMs and existing methods are timing schedules and reasons. Timing schedules, consisting of 10 objects, require calculations and inferences (e.g., inferring the duration based on frequency and distribution), while reasons involve relationship extraction and handling cardinality, as a medication can be taken for more than one reason.
Methods
In this section, we delve into the technical details employed in data annotation, LLMs usage, and the evaluation process. For an illustrative visual representation of the workflow, please refer to Figure 1.
Data Annotation
The HAPI FHIR public test server25 hosts millions of examples of converted FHIR resources. However, we are unable to retrieve their source data before the conversion. To the best of our knowledge, there is no largely publicly available dataset in the FHIR standard that has been generated from the clinical notes. Therefore, we have decided to annotate a dataset that contains both free-text input and structured output in FHIR resources. The latter will serve as the ground truth against which we can evaluate the performance of our LLMs in FHIR transformation.
We manually annotated the medication-related clinical narratives to adhere to the MedicationStatement resource as per FHIR v6.0.0: R6 implementation guide26. According to the official FHIR definition, a MedicationStatement indicates that a patient may currently be taking a medication, has taken it in the past, or will take it in the future. This transformation holds particular significance because many medication-related details, such as the reasons for administration and dosage instructions, often remain absent in structured data. Clinical notes within the Electronic Health Record (EHR) system frequently represent the sole available source for retrieval and conversion into a standardized format. Clinical notes within the EHR system might be the sole source available for the retrieval and transformation of this information into a standardized format. The MedicationStatement encompasses various contents of medication, including dosage, schedule, reason, form, route, strength, and more. For detailed examples of the elements in the MedicationStatement resource, please refer to Table 1.
The clinical text input is obtained from the discharge summaries in the MIMIC-III dataset27. The 2018 n2c2 medication extraction challenge28, essentially a named entity recognition task, provided mentions of medications and the word spans of the medications’ associated entities (including drug routes, frequencies, durations, adverse effects, forms, strengths, dosages, and reasons) within the discharge summaries in the MIMIC-III dataset. All entities were manually annotated by clinical experts. We extracted text snippets, each containing mentions of one medication and all its associated entities, from the discharge summaries. We also included some buffer words from the original discharges summaries before and after the extracted word spans to ensure that these snippets are complete sentences. These extracted snippets, each related to a specific medication, serve as input for both annotations and transformations.
The human annotation for transformation to the FHIR standard consists of three key steps. The first step involved identifying the elements associated with each medication, and this task was effectively addressed by re-using expert annotations from the n2c2 dataset, which accurately pinpointed the word spans of each element. The second step required standardizing the elements from free-text into clinical terminology coding systems. The elements were linked to different coding systems, and we have provided a detailed description of which code systems were used in Table 1. Notably, the medication name was encoded in three distinct coding systems. Initially, the medication name was mapped to the patient’s prescription table in MIMIC-III, where NDC codes were provided. Input data, for which the medication name couldn’t be mapped to the patient’s prescription table, were excluded from the dataset. Subsequently, NDC codes were mapped to RxNorm codes and SNOMED CT Medication Codes using the APIs provided by the RxNav toolkit29. For all other elements, such as reasons, routes, and forms, the SNOMED CT coding system was primarily used, unless HL7.org provided its own code set. The transformation of these codes relied primarily on manual lookup. We looked up the display names, codes, and other SNOMED CT terminology details form the SNOMED CT Browser, International Edition 30. . The third step involved assembling the identifiers, codes, texts, extensions, and structures into a complete MedicationStatement resource. Throughout the study, we utilized the .json structure format. The converted FHIR medication statements undergo validation by the official FHIR validator31 to ensure compliance with FHIR standards, including structure, datatypes, cardinalities code sets, display names, etc.
The annotation tasks were conducted by Y. Li, and H.W., who worked collectively to resolve ambiguities or uncertainties. We will make the annotated dataset available to the public for authorized use upon paper acceptance.
Large Language Models
The LLMs we experimented with include OpenAI GPT-432, Llama-2-70B33, and Falcon-180B34. We accessed the GPT-4 APIs through the Azure OpenAI service, as recommended by the responsible use guideline of MIMIC data. The specific model we used is gpt-4-32k in its 2023-05-15 version. To enhance efficiency, we made multiple asynchronous API calls. For Llama-2-70B and Falcon-180B, we deployed them on our HIPAA-compliant firewalled local servers with multiple GPU backends. GPTQ 35 was used to accelerate the inference time for Llama-2-70B and Falcon-180B.
We required these Language Models (LLMs) to transform the free-text entries into MedicationStatements conforming to the FHIR standard, employing the few-shot prompt settings. Each clinical snippet was individually input into the LLMs to generate the MedicationStatement resource. We used five separate prompts to instruct the LLM to transform the free-text input into the elements of a MedicationStatement resource, including medication details (such as drug name, strength, and form), route, timing, dosage, and reason, respectively. All few-shot prompts adhered to a template with the following order: task instructions, expected output FHIR templates in .JSON format, 4-5 examples of transformations, a comprehensive list of codes from which the model could make selections, and the input text to be transformed. As there was no fine-tuning or domain-specific adaptation in our experiments, we initially had the LLM generate the FHIR format for a small subset of the dataset (N=∼100). Then, we manually reviewed the discrepancies between the LLM-generated FHIR output and our human annotations. Common mistakes were identified and used to refine the prompts. There were slight differences in the prompts for each LLM, as different LLMs may be sensitive to different prompts. It’s important to note that we did not have access to comprehensive lists of NDC, RxNorm, and SNOMED Medication codes for all medication names, as well as SNOMED Finding codes for reasons. We did not instruct the LLMs to look up the SNOMED codes for the ’medication’ and ’reason’ elements, as the complete list of SNOMED CT Medication and Finding codes, numbering in the thousands or more, exceeds the token limits of LLMs. Instead, our instructions were for them to identify the contexts mentioned in the input text and convert them into the appropriate JSON format. For instance, the expected output is {“reason”: [{“concept”: {“text”: “Headach”}}]} rather than the more detailed {“reason”:[{’concept’: {’text’: ’headache’, ’coding’: [{’system’: ’SNOMED’, ’code’: ’25064002’,’display’: ’Headache’}]}}]}.. For other code sets, such as SNOMED CT Form codes, numbering in the hundreds, we allowed LLMs to directly code them. Please see the appendix for prompts.
Evaluation
We compared the transformed resources with the outputs from two existing approaches: NLP2FHIR 18 and Google Healthcare Natural Language (NL) API 22. The transformation results from both approaches lacked some elements covered by our human annotation and LLMs generation. NLP2FHIR was built based on a previous version of the FHIR implement guide, and the Google Healthcare NL API primarily standardized concepts to UMLS CUIs, rather than SNOMED CT codes, which are used in our annotations and LLMs’ transformations. We made adaptations and conversion to ensure a fair comparison. We deployed the NLP2FHIR pipeline on our HIPAA-compliant firewalled local servers. We accessed the Google Healthcare NL API through the Google Cloud Healthcare API, which is also compliant with HIPAA regulations.
When evaluating the FHIR resources generated by the LLMs, our initial step was to verify that the output was in valid JSON format. Once the JSON format check was successfully passed, our primary criterion for evaluation was the exact match rate. This criterion required that the resources generated by the LLMs exactly matched the human annotations in all aspects, including structures, codes, and cardinality. Unlike previous studies that reported word scan F1, precision, and recall scores, which considered the transformation as a NER (Named Entity Recognition) task, we did not use these metrics. This decision was made because those metrics may overlook the essential aspects of inferring and standardizing the content based on contexts. Exact identification of the word span does not guarantee the correct corresponding codes can be identified and that the accurate FHIR schema can be derived.
Conclusion
In this study, we provided the foundations of leveraging LLMs to enhance health data interoperability by transforming free-text input into the FHIR resources. The FHIR-GPT model is not only training-free but also improves transformation accuracy. Future studies will aim to build upon these successes by extending the generation to additional FHIR resources and comparing the performance of more LLM models.
Data Availability
https://mimic.mit.edu/docs/iii/ https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
Footnotes
Emails: yikuan.li{at}northwestern.edu, hanyin.wang{at}northwestern.edu, halid.yerebakan{at}siemens-healthineers.com, yoshihisa.shinagawa{at}siemens-healthineers.com, yuan.luo{at}northwestern.edu
Resolved some format issues.