Evaluating machine learning approaches for multi-label classification of unstructured electronic health records with a generative large language model ====================================================================================================================================================== * Dinithi Vithanage * Chao Deng * Lei Wang * Mengyang Yin * Mohammad Alkhalaf * Zhenyua Zhang * Yunshu Zhu * Alan Christy Soewargo * Ping Yu ## Abstract Multi-label classification of unstructured electronic health records (EHR) poses challenges due to the inherent semantic complexity in textual data. Advances in natural language processing (NLP) using large language models (LLMs) show promise in addressing these issues. Identifying the most effective machine learning method for EHR classification in real-world clinical settings is crucial. Therefore, this experimental research aims to test the effect of zero-shot and few-shot learning prompting strategies, with and without Parameter Efficient Fine-tuning (PEFT) LLMs, on the multi-label classification of the EHR data set. The labels tested are across four clinical classification tasks: agitation in dementia, depression in dementia, frailty index, and malnutrition risk factors. We utilise unstructured EHR data from residential aged care facilities (RACFs), employing the Llama 2-Chat 13B-parameter model as our generative AI-based large language model (LLM). Performance evaluation includes accuracy, precision, recall, and F1 score supported by non-parametric statistical analyses. Results indicate the same level of performance with the same prompting template, either zero-shot or few-shot learning across the four clinical tasks. Few-shot learning outperforms zero-shot learning without PEFT. The study emphasises the significantly enhanced effectiveness of fine-tuning in conjunction with zero-shot and few-shot learning. The performance of zero-shot learning reached the same level as few-shot learning after PEFT. The analysis underscores that LLMs with PEFT for specific clinical tasks maintain their performance across diverse clinical tasks. These findings offer crucial insights into LLMs for researchers, practitioners, and stakeholders utilising LLMs in clinical document analysis. Keywords * Natural language processing * Large language models * Electronic health records * Machine learning * Multi-label classification ## 1 Introduction A substantial amount of medical predictive models have been trained, tested, and published, yet the majority of them have never been deployed into the clinical setting, which is coined “a last mile problem” [1]. This is because most of these predictive models are relied on structured health data, while many important clinical information is captured in free text clinical notes, which introduces complexity for model development and deployment. Electronic health records in residential aged care facilities in Australia are digitised systems designed to collect, store, and display data about clients’ demographics, medical diagnoses, assessments, progress notes, charts, and forms [1]. Similar as other healthcare settings [2], besides the structured diagnosis data, many important clinical information in RACFs are captured in unstructured, narrative, free-text nursing progress notes. Because free text is a more expressive and natural way for care staff to record care encounters and communicate among team members, these notes are often updated and the closest to real-time reflection of an older person’s health condition. Therefore, effectively extracting information from unstructured clinical notes in EHR is important to support clinical decision-making, improve aged care quality, and advance translational research. Multi-label classification of free-text data is a specialised area in machine learning and NLP. Multi-label classification refers to the task of assigning multiple labels or categories to a single input instance. It involves the automated extraction of entities, concepts, events, and their relations from unstructured text [4], a challenging task because text data often has different meanings and interpretations [3] and requires the use of precise and expeditious information extraction tools [7]. Despite the advancement of various transformer-based encoder-type language models, e.g., various BERT models, Clinical NLP remains a labour-intensive process that demands a substantial amount of expertise and human effort to prepare the training data [2, 4]. This limitation has hindered the effective application of the early NLP technique in information extraction from the unstructured, free-text EHR. The recent advancements in LLMs, such as GPT variants, T5, OPT, and Llama [5], have demonstrated the ability of these decoder models to generate text that is not only human-like but also surpasses human-level performance in certain tasks [5]. These models, when combined with machine learning techniques like pre-training, fine-tuning, and prompt-based learning [6], offer transformative potential for NLP, enabling the development of automated and adaptable systems that can extract valuable insights from the free-text EHR. This marks a significant step towards the goal of integrating health predictive models into real-world clinical systems. We are still in the early days of applying generative AI-based LLMs to extract clinical insights from the free-text EHR. While LLMs have shown potential in answering clinical questions [7–9] and extracting clinical data from public health data sets [10], their practical application in specific clinical tasks within real-world clinical settings using clinic data remains limited [4, 8, 11]. It is yet to be determined whether prompt engineering for LLMs can meet the stringent safety standards required for healthcare applications, given their limitations in generating outputs that may contain disinformation, misinformation, bias or hallucinations [12] [8]. The optimal prompting strategies for healthcare information extraction, whether zero-shot or few-shot learning, in various contexts remain unclear. Therefore, this research aims to investigate the differential effect of zero-shot and few-shot learning prompting strategies on multi-label classification across diverse clinical domains. Understanding prompting behaviour is crucial for the safe and effective deployment of LLMs in healthcare settings. A prompt is an input a user enters to instruct a LLM to autonomously generate sequential output [13]. A LLM uses pattern matching to identify the relationships between the words, phrases, and concepts in the prompt and, connect these with its learned patterns from the previous training and uses natural language generation to respond in a human understandable format. Prompts enable the model to adapt and comprehend specific information in a new domain, leveraging its learned knowledge stored within the pre-trained models like Llama 2, thereby expanding the model’s applicability and effectiveness. Prompt learning reduces the need to introducing new parameters or extensive retraining of the model using labelled data for various tasks, thus improves efficiency and reduces computational resources required for machine learning. There are different formats of prompt-based learning. In this study, we test zero-shot and few-shot learning. ### 1.1 Zero-shot learning Zero-shot learning uses single-prompt instruction to train LLMs for specific NLP tasks, directly applying previously trained models to predict both seen and unseen classes without using any labelled training instances [14]. Zero-shot learning has achieved impressive performance in a variety of NLP tasks, such as summarisation, dialogue generation, and question-answering [8]. Ge et al. use zero-shot learning to extract six data elements from patients’ abdominal imaging reports using an API implementation of the OpenAI GPT-3.5 turbo LLM, achieving an overall high accuracy of 88.9%. They find that the level of accuracy of zero-shot learning reduces with more complex use cases. Their findings prove the feasibility of using general-purpose LLMs to extract structured information from clinical data with minimal technical expertise. ### 1.2 Few-shot learning Few-shot learning, also coined as in-context learning, refers to the ability of LLMs to perform tasks guided by a small set of representative examples provided in the prompt [15, 16]. These in-context examples not only teach the LLM the mapping from inputs to outputs but also activate the LLM’s parametric knowledge [17]. Only requiring a handful of labelled training examples is a clear advantage of few-shot learning, making it data-efficient and accessible to knowledge domain users without expertise in machine learning [15]. Few-shot learning is particularly useful in situations where annotating text data is not convenient or expensive. By providing just a few examples, domain experts can quickly create a generative AI system for a new task. Importantly, few-shot learning does not change the underlying model weights [18]. This allows for efficient adaptation to new tasks without risking the loss of previously learned information. However, the performance of few-shot learning varies and is highly task-dependent [15]. Its accuracy is also sensitive to the choice of prompt templates and in-context examples [17]. Prior research finds that using semantically similar in-context examples to those with prior success can significantly enhance the performance of few-shot learning [19]. ### 1.3 Parameter-efficient fine-tuning Fine-tuning involves modifying the LLM, or the parameters used to train the LLM, to improve model response to the same prompt [13]. Fine-tuning changes a model’s weight, thus, the model’s behaviour to perform better at a specific task. Full fine-tuning will fine-tune all layers of the pre-trained model, which can be computationally expensive and may lead to catastrophic forgetting, i.e., the model forgets the knowledge it gained during pre-training. Thus, it may significantly increase the cost of computational resources and computational skill sets. Parameter-efficient fine-tuning only fine-tunes a small number of (extra) parameters while freezing most parameters of the pre-trained LLMs. It thus overcomes the computational resource constraint and catastrophic forgetting observed in the full-scope fine-tuning of LLM. Low-Rank Adaptation (LoRA) is a PEFT technique designed to improve training efficiency for LLMs. It freezes the weight of per-trained LLMs and inserts low-rank decomposition matrices into the transformer layers. Previous research has demonstrated that LoRA can allow the fine-tuning process to focus on crucial parameters specific to the target task or domain, thus optimising the model’s performance without extensive resource requirements or overfitting concerns. By focusing on PEFT, LoRA minimises the dependency on extensive labelled data for model optimisation, which maximises the utility of available data, making the fine-tuning process more effective and feasible in scenarios with limited annotated datasets [20]. Extracting symptoms of various geriatric diseases is important for early diagnosis, personalised treatment, and improving patient outcomes. To date, there is no reporting of effective tools to execute this multi-label classification task accurately and reliably from free-text notes in an EHR system. Therefore, this study focuses on a comparative analysis of the performance of prompt engineering with and without PEFT in multi-label classification. In this study, we include four clinical tasks with careful consideration of the following factors: (1) the information is recorded in the free text nursing progress notes; (2) the information meets aged care information needs; and (3) the research team has curated labelled datasets to allow model training, validation, and testing to evaluate the performance of the machine. We identified four clinical tasks: agitation in dementia, depression in dementia, frailty index and malnutrition risk factors (see Table 1). Each task has various numbers of labels, ranging from 13 to 83. View this table: [Table 1:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T1) Table 1: Clinical tasks for multi-label classification. There is a lack of prior research on the difference in performance between zero-shot and few-shot learning for the same clinical classification task and on the effect of PEFT on the tasks. As healthcare demands high safety standards for machine learning, it is imperative to conduct experimental comparisons of the performances of various machine learning methods. In this research, we focus on the comparison of zero-shot and few-shot learning, with and without PEFT, on multi-label clinical classification tasks. We design experiments to test the research hypotheses (see Table 2). View this table: [Table 2:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T2) Table 2: Research hypotheses in the study. ## 2 Methodology We conduct the experiment in seven stages: generative AI-based large language model selection, data set selection, data preprocessing, designing prompting templates for zero-shot and few-short learning in each clinical task, machine learning methods execution, model performance evaluation and statistical analysis. ### 2.1 Ethics approval The Human Research Ethics Committee of the University of Wollongong approved the study (Ethics Number 2019/159). ### 2.2 Generative AI-based LLM selection We select the Llama 2-Chat 13B-parameter model as the generative AI-based LLM. The selection considers the following factors: (1) the optimal model in terms of open source and favourable review at the time of the experiment; (2) practical considerations regarding the availability of GPU resources; (3) feasibility for local server deployment, convenience and control over usage; (4) compliance with health data privacy regulations in Australia; (4) the presence of diverse variants spawned through fine-tuning, including Alpaca, Baizem, Koala, and Vicuna [5, 21]. We obtain the Llama 2-Chat 13B-parameter model from the Hugging Face repository ([https://huggingface.co/meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)). ### 2.3 Data set selection De-identified demographic data and free-text nursing progress notes are collected for the same population of older people living in 40 RACFs in New South Wales, Australia, for the period of 2019 to 2021. Residential aged care facilities are the equivalent of long-term care facilities in the USA. The dataset encompasses over 890,000 records of 3,528 de-identified individuals. The structured demographic information includes masked sequence number for client de-identification, age, and gender. The unstructured nursing notes include nursing assessment and progress reporting. They document clients’ daily activities, care staff’s clinical observations, assessments of client’s care needs (including risk factors), and carer interventions. ### 2.4 Data preprocessing Text preprocessing involves the removal of URLs and non-textual characters, such as extra delimiters and empty spaces in the dataset. We make a choice not to exclude stop words because many of them, like “a,” “be,” “very,” “should,” etc., held semantic relevance to the content [22]. ### 2.5 Designing prompting templates for zero-shot and few-short learning in each clinical task First, we select prompt-based training via zero-shot and few-shot learning. We adopt the template developed by Abdallaha et al. [23] to construct our prompt (see Figure 1). ![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F1.medium.gif) [Figure 1:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F1) Figure 1: Prompt template adapted from Abdallaha et al. [23] The final prompts use in our experiment are listed in Table 3. The example results generated from the final prompts are showcased in Supplementary Table 1. View this table: [Table 3:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T3) Table 3: Prompts used in the study. ### 2.6 Machine learning methods execution We select prompt-based learning on Llama 2 and Llama 2 with PEFT to test the LLM’s ability to adapt, generalise, and optimise performance in clinical multi-domain classification tasks. #### 2.6.1 Experimental setup ##### Prompt-based learning with zero-shot and few-shot learning on Llama 2 The experiment is conducted on two NVIDIA GeForce GTX -1080 Ti, each equipped with 11GB of memory. Our software environment is Ubuntu 18.04, the programming language is Python 3.10.0, and the deep learning framework is Pytorch 2.0.0. We employ Llama 2 without PEFT, utilising zero-shot and few-shot learning prompts as outlined in Table 3. To prevent model contamination, we approach each clinical task (see Table 1) in two distinct steps. Initially, we employ the Llama 2 model, which is directly downloaded from the Hugging Face repository, for zero-shot learning. Afterwards, we downloaded a second copy of the same model from the same repository to conduct few-shot learning. We have performed multiple iterations of zero-shot learning and few-shot learning with Llama 2 in order to test the research hypotheses outlined in Table 2, using a test dataset of 100 nursing notes in each clinical task. The maximum token limit employed in the Lama 2 is 4096, as none of the test notes available exceeds this token count. The model iteratively processes each note within this defined token limit during testing. ##### Parameter Efficient Fine-tuning with LoRA on Llama 2 We use the PEFT method to fine-tune the Llama 2 model. The experiment is conducted on four NVIDIA GeForce GTX -1080 Ti, each equipped with 11GB of memory, and employing the hyperparameter setting shown in Table 4. Our software environment is Ubuntu 18.04, the programming language is Python 3.10.0, and the deep learning framework is Pytorch 2.0.0. The instruction data points are employed in the PEFT process. View this table: [Table 4:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T4) Table 4: Hyperparameters used in PEFT The maximum token limit is set as 4096, the maximum number token that can be taken by the Llama 2 model, which is large enough to encompass the available token for each nursing note. During the fine-tuning process, the model iteratively processes each note within the defined token limit. We randomly divide the labelled data presented in Table 5 in each clinical task into 90% training and 10% validation data sets, respectively. First, we apply PEFT to the training data, i.e., free-text nursing notes for each clinical task (see Table 5). We ensure that no overlapping free text notes in the labelled dataset are used for different labelled clinical tasks in the PEFT processes. View this table: [Table 5:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T5) Table 5: Number of labelled data and file size for each clinical task Subsequently, separate test datasets of 100 nursing notes are employed for model performance evaluation for each clinical task in order to test the research hypotheses outlined in Table 2. This dedicated test dataset is explicitly employed for assessing the prompt-based learning method of Llama 2 without PEFT, facilitating a comprehensive comparison and analysis. We use the prompts delineated in Table 3 to evaluate the test data. First, we conducted zero-shot learning with PEFT Llama 2 across the four test datasets for each clinical task. This is followed by few-shot learning with PEFT Llama 2 across the same four test datasets for each clinical task. To ensure that the few-shot learning does not benefit from the residual effect of the previous zero-shot learning in the test process, we download the original Llama 2 model from the Hugging Face repository for training in the fine-tuning process. ### 2.7 Model performance evaluation We calculated accuracy, precision, recall and F1score to assess each model’s performance for the four clinical tasks. The annotated ground truth is curated by the large research team. Each annotation is independently corroborated by at least two, and sometimes three, domain experts. We compared the machine learning output to the annotated ground truth, employing exact and semantic matching criteria. In our approach, an extracted entity or phrase that overlaps with the text and shares the exact meaning of the entity type or phrase as the annotated ground truth is considered a true positive response [2]. For example, suppose the original text in the annotated ground truth is “shouting” as an agitation in dementia, whereas the model extracted output is “shouting“; the model output is judged as true positive because it exactly matches the annotated ground truth. Since Llama 2 is a generative model capable of producing text that conveys semantic meaning, we also apply the semantic similarity matching to assess true positive results. For example, suppose the original text in the annotated ground truth is “reject meals” as an agitation in dementia, whereas the model extracted output is “refuse meals“; the model output is judged as true positive because its meaning matches the annotated ground truth. The words/phrases that are output from the model but not in the annotated ground truth are considered false positives, and the words/phrases that are in the annotated ground truth but not in the Llama 2 model output are considered false negatives [2]. The example results generated from the evaluation criteria are showcased in Supplementary Table 2. ### 2.8 Statistical analysis As the accuracy, precision, recall and F1 score, serving as measurement indicators, are continuous variables and do not adhere to the assumption of normality of variance required for parametric tests, we utilise the non-parametric Kruskal-Wallis test for comparing results across three or more independent groups and the Mann-Whitney U test for comparing two independent groups to test the hypotheses, as suggested by the previous research [24, 25]. A significant difference is decided if the p-value is smaller than 0.05. ## 3 Results ### 3.1 Results of testing Hypothesis 1: Zero-short and few-shot learning with similar prompting templates have different levels of performance when applied to multi-label classification for different clinical tasks To evaluate Hypothesis 1, we undertake the following comparisons among the four clinical tasks: (1) the performance of zero-shot learning without PEFT; (2) the performance of zero-shot learning with PEFT; (3) the performance of few-shot learning without PEFT; and (4) the performance of few-shot learning with PEFT. #### 3.1.1 Comparing the performance of zero-shot learning without PEFT for four clinical tasks Figure 2 compares the evaluation results of zero-shot learning among the four clinical classification tasks without PEFT. There is no statistically significant difference in accuracy, precision, recall, and F1 score between these classification tasks when utilising zero-shot learning without PEFT (p >0.05). However, there is a trend that the classification tasks related to agitation in dementia and malnutrition risk factors perform better than those related to frailty index and depression in dementia. ![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F2.medium.gif) [Figure 2:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F2) Figure 2: Comparative evaluation of zero-shot learning for clinical multi-label classification tasks without PEFT. #### 3.1.2 Comparing the performance of zero-shot learning with PEFT for four clinical tasks Figure 3 compares the evaluation results of zero-shot learning for the four clinical classification tasks with PEFT. Once again, no statistically significant difference is found in accuracy, precision, recall, and F1 score between these classification tasks when utilising zero-shot learning with PEFT (p >0.05). However, there is a trend that the classification tasks related to agitation in dementia and malnutrition risk factors perform better than those related to frailty index and depression in dementia. ![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F3.medium.gif) [Figure 3:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F3) Figure 3: Comparative evaluation of zero-shot learning for the four clinical multi-label classification tasks with PEFT. #### 3.1.3 Comparing the performance of few-shot learning without PEFT for the four clinical tasks No statistically significant difference is found in accuracy, precision, recall, and F1 score between these tasks when utilising few-shot learning without PEFT (Figure 4, p >0.05). However, the same trend as above was found, i.e., the classification tasks related to agitation in dementia and malnutrition risk factors perform better than those related to frailty index and depression in dementia. ![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F4.medium.gif) [Figure 4:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F4) Figure 4: Comparative evaluation of few-shot learning for clinical multi-label classification tasks implemented without PEFT. #### 3.1.4 Comparing the performance of few-shot learning with PEFT for the four clinical tasks Again, no statistically significant difference is found in accuracy, precision, recall, and F1 score among the four clinical tasks (Figure 5, p >0.05); however, the same trend as above is observed; i.e., the classification tasks related to agitation in dementia and malnutrition risk factors perform better than those related to frailty index and depression in dementia. ![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F5.medium.gif) [Figure 5:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F5) Figure 5: Comparative evaluation of few-shot learning for clinical multi-label classification tasks with PEFT. ### 3.2 Results of testing Hypothesis 2: Few-shot learning performs better than zero-shot learning for the multi-label classification of the same clinical tasks without PEFT To evaluate Hypothesis 2, we undertake the following comparison: the performance of zero-shot and few-shot learning without PEFT for all four clinical tasks. Few-shot learning adaptation significantly improves model accuracy, precision, recall and F1 score in the multi-label clinical classification task than zero-shot learning (Figure 6, p < 0.05). The level of improvement is as follows: an 18% increase in model accuracy, an 18% increase in precision, a 25% increase in recall, and a 28% increase in F1 score. ![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F6.medium.gif) [Figure 6.](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F6) Figure 6. Comparative evaluation of zero-shot learning versus few-shot learning for all four clinical classification tasks without PEFT. Few-shot learning adaptation significantly improves model accuracy in each multi-label clinical classification task (Figure 7, p < 0.05). The level of improvement ranges from 15% in agitation in dementia, 15% in malnutrition risk factors, 17% for frailty index, and the highest of 19% for depression in dementia. Few-shot learning adaptation significantly improves model precision in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 15% in agitation in dementia, 15% in malnutrition risk factors, 17% for frailty index, and the highest of 19% for depression in dementia. Few-shot learning adaptation significantly improves model recall in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 17% in malnutrition risk factors, 18% in agitation in dementia, and the highest of 20% for frailty index and depression in dementia. Finally, few-shot learning adaptation significantly improves model F1 score in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 17% in malnutrition risk factors, 17% in agitation in dementia, 18% for frailty index, and the highest of 20% for depression in dementia. ![Figure 7.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F7.medium.gif) [Figure 7.](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F7) Figure 7. Comparative evaluation of zero-shot learning versus few-shot learning for each clinical classification task without PEFT. ### 3.3 Results of testing Hypothesis 3: Parameter-efficient finetuning can improve both zero-shot and few-shot learning performance To evaluate Hypothesis 3, we undertake the following comparisons: (1) the performance of the model of zero-shot learning without PEFT and with PEFT for all four tasks; (2) the performance of the model of few-shot learning without PEFT and with PEFT for all four tasks. #### 3.3.1 Comparing the performance of the model of zero-shot learning without PEFT and with PEFT for all four tasks Zero-shot learning with PEFT adaptation significantly improves model accuracy, precision, recall and F1 score in the multi-label clinical classification tasks (Figure 8, p < 0.05). The level of improvements is as follows: a 37% increase in model accuracy, an 37% increase in precision, a 35% increase in recall, and a 33% increase in F1 score. ![Figure 8:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F8.medium.gif) [Figure 8:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F8) Figure 8: Comparative evaluation of zero-shot learning without PEFT versus with PEFT for all four clinical classification tasks. Note: ‘-PEFT’ denotes without PEFT, ‘+PEFT’ denotes with PEFT. Zero-shot learning with PEFT adaptation significantly improves model accuracy in each multi-label clinical classification task (Figure 9, p < 0.05). The level of improvement ranges from 23% in frailty index, 29% in malnutrition risk factors, 30% for depression in dementia, and the highest of 38% for agitation in dementia. Zero-shot learning with PEFT adaptation significantly improves model precision in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 23% in frailty index, 29% in malnutrition risk factors, 30% for depression in dementia, and the highest of 38% for agitation in dementia. Zero-shot learning with PEFT adaptation significantly improves model recall in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 26% for depression in dementia, 29% for frailty index, 32% for malnutrition risk factors, and the highest of 36% for agitation in dementia. Zero-shot learning with PEFT adaptation significantly improves model F1 score in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 28% in depression in dementia, 31% in malnutrition risk factors, 33% for frailty index, and the highest of 36% for agitation in dementia. ![Figure 9:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F9.medium.gif) [Figure 9:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F9) Figure 9: Comparative evaluation of zero-shot learning without PEFT versus zero-shot learning with PEFT for each clinical classification task. Note: ‘-PEFT’ denotes without PEFT, ‘+PEFT’ denotes with PEFT. #### 3.3.2 Comparing the performance of the model of few-shot learning without PEFT and few-shot learning with PEFT for all four multi-label clinical classification tasks Few-shot learning with PEFT adaptation significantly improves model accuracy, precision, recall and F1 score in all four multi-label clinical classification tasks (Figure 10, p < 0.05). The level of improvements is as follows: a 15% increase in model accuracy, a 15% increase in precision, a 23% increase in recall, and a 24% increase in F1 score. ![Figure 10:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F10.medium.gif) [Figure 10:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F10) Figure 10: Comparative evaluation of few-shot learning without PEFT versus few-shot learning with PEFT for various clinical classification tasks. Note: ‘-PEFT’ denotes without PEFT, ‘+PEFT’ denotes with PEFT. Few-shot learning with PEFT adaptation significantly improves model accuracy in each multi-label clinical classification task (Figure 11, p < 0.05). The level of improvement ranges from 9% in frailty index, 10% in malnutrition risk factors, the highest of 11% in agitation in dementia, and 11% for depression in dementia. Few-shot learning with PEFT adaptation significantly improves model precision in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 9% in frailty index, 10% in malnutrition risk factors, and the highest of 11% in agitation in dementia, and 11% for depression in dementia. Few-shot learning with PEFT adaptation significantly improves model recall in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 13% in frailty index, 13% in malnutrition risk factors, 15% in agitation in dementia, and the highest of 17% for depression in dementia. Few-shot learning with PEFT adaptation significantly improves model F1 score in each multi-label clinical classification task (p < 0.05). The level of improvement ranges from 13% for malnutrition risk factors, 15% for agitation in dementia, 15% for frailty index, and the highest of 20% for depression in dementia. ![Figure 11:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F11.medium.gif) [Figure 11:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F11) Figure 11: Comparative evaluation of few-shot learning without PEFT versus few-shot learning with PEFT for each clinical classification task. Note: ‘-PEFT’ denotes without PEFT, ‘+PEFT’ denotes with PEFT. ### 3.4 Results of testing Hypothesis 4: Zero-shot learning reaches the same level of performance as few-shot learning for the same clinical task with PEFT To evaluate Hypothesis 4, we undertake the following comparison: the performance of zero-shot and few-shot learning with PEFT for all four tasks. Although no statistically significant difference is found in accuracy, precision, recall, and F1 score between the zero-shot and few-shot learning with PEFT in the multi-label clinical classification tasks (Figure 12, p >0.05), there is a trend that few-shot learning performs above zero-shot learning. ![Figure 12:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F12.medium.gif) [Figure 12:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F12) Figure 12: Comparative evaluation of zero-shot learning versus few-shot learning for various clinical classification tasks with PEFT. Note: ‘-PEFT’ denotes without PEFT, ‘+PEFT’ denotes with PEFT. Overall, there is no statistically significant difference between zero-shot and few-shot learning for each clinical task (see Figure 13). The slight variation in values is task-specific, and there is no overall pattern. ![Figure 13:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/06/27/2024.06.24.24309441/F13.medium.gif) [Figure 13:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/F13) Figure 13: Comparative evaluation of zero-shot learning versus few-shot learning for each clinical classification task with PEFT. Note: ‘-PEFT’ denotes without PEFT, ‘+PEFT’ denotes with PEFT. ### 3.5 Results of testing Hypothesis 5: Fine-tuning for one clinical task impacts model performance across other clinical tasks To evaluate hypothesis 5, we undertake the following comparisons: (1) the performance of a clinical task-specialised PEFT model with zero-shot learning and its impact across other clinical tasks with zero-shot learning; (2) the performance of a clinical task-specialised PEFT model with few-shot learning and its impact on across other clinical tasks with few-shot learning. #### 3.5.1 Comparing the performance of a clinical task-specialised PEFT model with zero-shot learning and its impact on other clinical tasks trained via zero-shot learning No significant difference is found in accuracy, precision, recall, and F1 score for the other group of clinical tasks between the zero-shot learning with or without PEFT in one clinical task (see Table 6, p >0.05). View this table: [Table 6:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T6) Table 6: Comparing the impact of zero-shot learning on one clinical task with or without PEFT on other clinical tasks trained via zero-shot learning. No significant difference is found in accuracy, precision, recall, and F1 score for each individual clinical tasks between the zero-shot learning with or without PEFT (Table 7, p >0.05). View this table: [Table 7:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T7) Table 7: Comparing the impact of zero-shot learning on one clinical task with or without PEFT on each other clinical task trained via zero-shot learning. #### 3.5.2 Comparing a clinical task-specialised PEFT model performance with few-shot learning and its impact on other clinical tasks trained via few-shot learning No significant difference is found in accuracy, precision, recall, and F1 score for the group of other three clinical tasks between the few-shot learning with or without PEFT in one clinical task (see Table 8, p >0.05). View this table: [Table 8:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T8) Table 8: Comparing the impact of few-shot learning on one clinical task with or without PEFT on a group of other clinical tasks trained via few-shot learning. No significant difference is found in accuracy, precision, recall, and F1 score for the other individual clinical tasks between the few-shot learning with or without PEFT on one clinical task (Table 9, p >0.05). View this table: [Table 9:](http://medrxiv.org/content/early/2024/06/27/2024.06.24.24309441/T9) Table 9: Comparing the impact of few-shot learning on one clinical task with or without PEFT on each other clinical task trained via few-shot learning. ## 4 Discussion This study explores the impact of zero-shot and few-shot prompt learning strategies, both with and without PEFT, on multi-label classification across four clinical tasks. These include agitation in dementia, depression in dementia, frailty index and malnutrition risk factors. To achieve this, five research hypotheses have been formulated, and experimental designs have been implemented to rigorously test these hypotheses. Three of the five proposed hypotheses are confirmed and two rejected. Our findings do not support Hypothesis 1, which proposed that zero-short and few-shot learning with similar prompting templates have different levels of performance when applied to multi-label classification for the different clinical task. Our study consistently did not find statistically significant difference among the four clinical tasks. We also observe the pattern that two clinical tasks, agitation in dementia and malnutrition risk factor classification tasks, achieved a slightly higher performance in accuracy, precision, recall and F1 score than the other two clinical tasks, frailty index and depression in dementia classification tasks in all tests. This slight difference may be attributed to Llama 2 possessing more knowledge, with higher number of training data, for the two former clinical tasks than the two later clinical tasks. Our finding supports Hypothesis 2, which proposes that few-shot learning performs better than zero-shot learning for the multi-label classification of the same clinical task without PEFT, few-shot domain adaptation effectively minimises false positives and false negatives while increasing true positives in classification tasks. These results imply that exposing a LLM to initial information from the target domain can significantly improve the model’s performance in classification tasks. Our finding supports Hypothesis 3, which posits that PEFT can improve both zero-shot and few-shot learning performance. The outcomes underscore the performance improvement with fine-tuning techniques. When implementing PEFT within a specified domain, it showcases the capability to mitigate false positives and false negatives while concurrently increasing true positives in information extraction tasks unique to that domain. This strategy facilitates more effective modifications to the model’s parameters within the domain, thereby enhancing the model’s overall performance in handling the multi-label classification tasks within that domain. Our finding supports Hypothesis 4, proposing that zero-shot learning reaches the same level of performance as few-shot learning for the same clinical task after PEFT. Our study consistently indicates no performance difference when we employ the model with PEFT and zero-shot or few-shot learning. Due to the model’s prior exposure to fine-tuning with the domain, the provision of additional exposure through few-shot learning does not significantly impact its performance. Contrary to Hypothesis 5, proposing that fine-tuning for one clinical task impact model performance across other clinical tasks, our findings do not support this assertion. Instead, our study indicates that fine-tuning a specific task does not significantly hinder the model’s performance when it is applied to another classification task. The rationale behind this lies in the methodology of PEFT, which focuses on training only a selective subset of the pre-trained model’s parameters. This strategy entails identifying and updating only the most relevant parameters for the new task during training. This insight suggests that the model can be effectively tailored to a specific clinical task without compromising its effectiveness in handling diverse tasks. This adaptability underscores the potential of the PEFT approach within the LLM for various clinical tasks. Additionally, another factor for this result may be that the four clinical tasks are overall similar to each other in term of the nature. This study encompasses three notable limitations. Firstly, our study encompasses four multi-label clinical classification tasks. However, we recognize that these tasks may not represent a diverse spectrum of clinical scenarios. To address this, we intend to broaden our scope by incorporating additional clinical classification tasks into our study. Secondly, the limitation pertains to the study’s scope. Although we have examined four multi-label clinical classification tasks, we plan to expand our research by incorporating additional tasks, such as question answering, summarisation, and relation extraction in the future. This expansion would offer a more comprehensive view of the LLM’s performance in EHR data. Thirdly, the limitation relates to the selection of evaluation metrics. This research solely utilises accuracy, precision, recall, and F1 score as the primary evaluation metrics. In future studies, we will broaden our evaluation process to encompass metrics like calibration, robustness, fairness, bias, toxicity, and efficiency [26]. Diversifying our evaluation criteria will provide a more comprehensive and nuanced assessment of the model’s performance across various dimensions. This will enhance our findings’ reliability and applicability in real-world EHR applications. ## 5 Conclusion This study compares the performance of zero-shot and few-shot learning on multi-label clinical classification tasks and the impact of PEFT on their performance. Our findings indicate that the same prompting template (either zero-short or few-shot) has the same level of performance when it is applied to different multi-label classification tasks. Few-shot learning outperforms zero-shot learning without PEFT in classification tasks, while zero-shot learning achieve the same performance level as few-shot learning in classification tasks with PEFT. Furthermore, the study reveals the notably enhanced effectiveness of PEFT with both zero-shot and few-shot learning performance. Our analysis demonstrates that fine-tuning LLMs for a particular clinical task does not significantly compromise the model’s performance when applied to other clinical tasks. These insights emphasise the adaptability and effectiveness of PEFT within the LLMs for various clinical tasks. ## Supporting information Supplemental Table [[supplements/309441_file02.docx]](pending:yes) ## Data Availability Data is not available. ## Author Approval All authors have reviewed and approved the final version of this manuscript. ## Author Contributions Conception of idea: [Dinithi Vithanage], [Ping Yu]. Literature search and data analysis: [Dinithi Vithanage], [Ping Yu]. Critical review and revision: [Dinithi Vithanage], [Ping Yu], [Lei Wang], [Chao Deng]. Data acquisition: [Mengyang Yin], Data annotation: [Mengyang Yin], [Mohammad Alkhalaf], [Zhenyua Zhang], [Yunshu Zhu], [Alan Christy Soewargo]. ## Declarations ### Availability of Data and Material Data is not available. ### Funding Statement The authors did not receive support from any organization for the submitted work. ### Competing Interest The authors have no conflicts of interest to declare. ### Consent for Publication All authors have approved the manuscript and agree with its submission to Journal of Healthcare Informatics Research. ### Supplementary Material The Supplementary Material for this article can be found from Supplementary Material.pdf. ### Statements and Declarations This manuscript has not been published elsewhere and is not under consideration by another journal. All authors have approved the manuscript and agree with its submission to the Journal of Healthcare Informatics Research. ## Acknowledgements This research is a part of a PhD project supported by the University of Wollongong, Australia and the University Grant Commission, Sri Lanka, and the authors would like to acknowledge and thank them. * Received June 24, 2024. * Revision received June 24, 2024. * Accepted June 27, 2024. * © 2024, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## References 1. 1.Jiang, L.Y., et al., Health system-scale language models are all-purpose prediction engines. Nature, 2023. 619(7969): p. 357–362. 2. 2.Bhate, N.J., et al., Zero-shot Learning with Minimum Instruction to Extract Social Determinants and Family History from Clinical Notes using GPT Model. arXiv preprint arXiv:2309.05475, 2023. 3. 3.Dhingra, L.S., et al., Cardiovascular Care Innovation through Data-Driven Discoveries in the Electronic Health Record. Am J Cardiol, 2023. 203: p. 136–148. 4. 4.Ge, J., et al., A comparison of large language model versus manual chart review for extraction of data elements from the electronic health record. medRxiv, 2023. 5. 5.Ji, B., VicunaNER: Zero/Few-shot Named Entity Recognition using Vicuna. arXiv preprint arXiv:2305.03253, 2023. 6. 6.Yu, H., et al., Open, Closed, or Small Language Models for Text Classification? arXiv preprint arXiv:2308.10092, 2023. 7. 7.Singhal, K., et al., Large language models encode clinical knowledge. Nature, 2023. 620(7972): p. 172–180. 8. 8.Zakka, C., et al., Almanac: Retrieval-augmented language models for clinical medicine. Res Sq, 2023. 9. 9.Lee, P., S. Bubeck, and J. Petro, Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med, 2023. 388(13): p. 1233–1239. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsr2214184&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36988602&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F06%2F27%2F2024.06.24.24309441.atom) 10. 10.Goel, A., et al. Llms accelerate annotation for medical information extraction. in Machine Learning for Health (ML4H). 2023. PMLR. 11. 11.Wornow, M., et al., The shaky foundations of large language models and foundation models for electronic health records. NPJ Digit Med, 2023. 6(1): p. 135. 12. 12.Yu, P., et al. Leveraging generative AI and large language models: A comprehensive roadmap for healthcare integration. Healthcare, 2023. 11, DOI: 10.3390/healthcare11202776. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/healthcare11202776&link_type=DOI) 13. 13.Shah, M. Prompt engineering vs. fine tuning: Which approach is right for your enterprsie generative AI strategy? 2023; Available from: [https://www.prophecy.io/blog/prompt-engineering-vs-fine-tuning-which-approach-is-right-for-your-enterprise-generative-ai-strategy](https://www.prophecy.io/blog/prompt-engineering-vs-fine-tuning-which-approach-is-right-for-your-enterprise-generative-ai-strategy). 14. 14.Liu, P., et al., Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 2023. 55(9): p. 1–35. 15. 15.Fu, H.Y., et al., Estimating large language model capabilities without labeled test data. arXiv preprint arXiv:2305.14802, 2023. 16. 16.Brown, T., et al., Language models are few-shot learners. Advances in neural information processing systems, 2020. 33: p. 1877–1901. 17. 17.Lee, Y., et al., Crafting in-context examples according to LMs’ parametric knowledge. arXiv preprint arXiv:2311.09579, 2023. 18. 18.Williams, K. Building confidence in LLM outputs: Approaches to increase confidence in content generated by large language models. 2023. 19. 19.Rubin, O., J. Herzig, and J. Berant, Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633, 2021. 20. 20.Ding, N., et al., Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 2023. 5(3): p. 220–235. 21. 21.Nguyen, T.T., C. Wilson, and J. Dalins, Fine-tuning llama 2 large language models for detecting online sexual predatory chats and abusive texts. arXiv preprint arXiv:2308.14683, 2023. 22. 22.Chai, C.P., Comparison of text preprocessing methods. Natural Language Engineering, 2023. 29(3): p. 509–553. 23. 23.Abdallah, A., et al., Amurd: annotated multilingual receipts dataset for cross-lingual key information extraction and classification. arXiv preprint arXiv:2309.09800, 2023. 24. 24.Breslow, N., A generalized Kruskal-Wallis test for comparing K samples subject to unequal patterns of censorship. Biometrika, 1970. 57(3): p. 579–594. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/57.3.579&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1970H989500012&link_type=ISI) 25. 25.Nachar, N., The Mann-Whitney U: A test for assessing whether two independent samples come from the same distribution. Tutorials in quantitative Methods for Psychology, 2008. 4(1): p. 13–20. 26. 26. P. Liang et al., “Holistic evaluation of language models,” arXiv preprint arXiv:2211.09110, 2022.