Abstract
Anxiety, depression, and other mental health conditions are affecting millions of people worldwide each year. However, limited access to mental health professionals and the stigma surrounding mental illness often deter individuals from seeking help. Many areas, especially rural and underserved communities, face a significant shortage of mental health professionals, making it difficult for individuals to access timely support and treatment. Traditional therapy can be expensive, time-consuming, and intimidating, discouraging individuals from seeking care and delaying essential treatment. The goal of this project is to harness the power of large language models
In the medical domain, large language models (LLMs) have the potential to significantly enhance clinical practice by assisting with tasks such as diagnostic support, therapeutic interventions, and summarization. However, these models often generate inaccurate responses, or “hallucinations,” when faced with queries they cannot effectively handle, raising concerns in the medical community. To address the limitations of LLMs, Retrieval-Augmented Generation (RAG) was leveraged to enhance their performance. By integrating external knowledge sources such as ICD-10-CM guidelines and psychiatric diagnostic manuals, RAG enables LLMs to retrieve relevant information in real time to support their predictions. This study examines whether LLMs can understand and accurately predict mental health-related medical codes from clinical notes. These codes are crucial for clinical documentation and treatment planning. We tested several LLMs (e.g., GPT, LLaMA, Gemini-Pro) enhanced with reliable resources like ICD-10-CM guidelines to evaluate their ability to identify and understand mental health terms and ICD-10-CM codes in psychiatric clinical notes. Our findings reveal that current models lack a robust understanding of the meaning and nuances of these codes, limiting their reliability for mental health applications. This underscores the need for improved strategies to represent and integrate these complex alphanumeric codes within LLMs. Enhancing their capability to accurately process mental health terminologies would make LLMs more reliable and trustworthy tools for mental health professionals, ultimately supporting better care and outcomes for patients.
Introduction
Mental health is a global concern in public health today. According to the US National Institute of Mental Health (NIMH)1, more than one in five U.S. adults live with a mental illness (59.3 million in 2022; 23.1% of the U.S. adult population. A large part of managing mental health involves natural language, including the assessment of symptoms and signs of mental health disorders and interventions such as talk therapies2 The textual and conversational data from these interactions offer a valuable resource for improving research and practical approaches to mental health care. One way to prevent serious mental conditions is to identify the symptoms in earlier stages. Natural Language Processing (NLP), a field of computer science focused on enabling computers to process and interpret unstructured natural language data, has shown great potential in supporting mental health-related tasks. These applications include identifying specific mental health conditions3, developing emotion-support chatbots4, and assisting with interventions5. NLP leverages data from diverse sources, such as clinical records6 and social media platforms7,8. In mental healthcare, Large Language Models (LLMs), a recent breakthrough in NLP, have opened new possibilities for innovative mental healthcare9,10. LLMs have the potential to streamline processes by quickly analyzing patient data, summarizing therapy sessions, and assisting with complex diagnostic challenges, saving valuable time for both users and healthcare providers11. In recent studies, Psy-LLM12, Mental-LLM13, MentalLLaMA14, and MindWatch15, were used to generate predictions or counselling based on input prompts related to mental health conditions. This signifies a growing interest in leveraging generative LLMs and prompt learning for mental health-related tasks.
Clinical notes in the psychiatry domain contain a wide range of information that helps mental health professionals assess, diagnose, treat, and monitor patients with mental health conditions. These notes serve as a detailed record of patient encounters, providing essential insights into the patient’s mental state, history, and treatment progress. LLMs have a high potential to analyze psych clinical notes and identify diagnoses in an automated way. However, studies on LLMs for diagnosing psychiatric disorders via automatic coding remain limited. To the best of our knowledge, this is the first study that evaluates LLMs as an automated ICD code detection method using clinical notes from the psychiatry domain.
ICD coding often faces errors and inconsistencies due to variations in how coders interpret and apply the coding rules and criteria or the omission of relevant codes. To tackle these challenges, automatic ICD coding has gained prominence as a research focus in clinical medicine, utilizing machine learning and NLP algorithms20,21. Large language models (LLMs), such as GPT models, Llama, and Gemini, have been also explored for this task22,23. However, these studies highlight several challenges, including LLMs’ limited domain-specific knowledge and vocabulary, the complexity of multi-label and long-tail code spaces, and vulnerability to noisy or adversarial inputs22,24
This study evaluates the baseline coding performance of various LLMs, including GPT-4, Gemini-1.5-Flash, and LLaMa-70b Chat. It also investigates the effectiveness of different LLM configurations, including retrieval-augmented generation (RAG)25 and zero-shot prompting in detecting ICD-10 codes in the context of real-world clinical narratives or notes from the EHR. Using a systematic approach, the models were instructed to generate medical codes based on their descriptions. This assessment aims to identify areas for improvement in code-mapping accuracy and provide insights for future advancements.
II Methodology
A Data Preparation
We extracted Clinical Modification (ICD-10-CM) codes from the University of Illinois health system for mental health disorders and diseases. We have randomly chosen 50 clinical notes for our experiment. All sensitive information was removed to maintain privacy. Our data was stored on secure HIPAA-compliant servers. We used the GPT-4o model provided by Microsoft’s Azure OpenAI service (www.portal.azure.com), Gemini-1.5 Flash within a secured Google cloud environment and Llama3.2 in a locally secured server to generate AI-based predictions (the probability) for the diagnoses of mental health disorders. Clinical diagnoses made by qualified psychiatrists were treated as gold standards (target) for evaluation.
B Model Configurations
We evaluated several LLM configurations to compare their performance on ICD-10 code detection tasks:
Retrieval-Augmented Generation (RAG)25: It is a framework that combines the strengths of two core components in Natural Language Processing (NLP): retrieval and generation. In the context of psychiatry, where specialized and accurate knowledge is crucial, RAG leverages external knowledge sources like ICD-10-CM guidelines to generate contextually relevant and highly informed responses.
Zero-Shot26: Models were applied without prior training on ICD-10-specific tasks, relying on general language capabilities.
C Models
The following LLMs were selected based on their prominence and applicability:
LLAMA 3.227
Developed by Meta, Llama 3.2 is a collection of multilingual large language models designed to process both text and images. It includes models with 1 billion and 3 billion parameters optimized for on-device applications, as well as vision models with 11 billion and 90 billion parameters capable of handling multimodal inputs. This versatility makes Llama 3.2 suitable for various applications, from augmented reality to document analysis.
Gemini-1.5-Flash28
Released by Google DeepMind, Gemini 1.5 Flash is a lightweight, multimodal model optimized for speed and efficiency. It can process inputs such as audio, images, video, and text, generating text responses suitable for tasks like code generation, data extraction, and text editing. The model is designed for high-frequency tasks where performance and cost-efficiency are paramount
GPT-4o29
Developed by OpenAI, GPT-4 is a large multimodal model capable of processing both text and image inputs to generate text outputs. It excels in tasks requiring advanced reasoning, complex instruction, understanding, and creativity. GPT-4 has been integrated into various applications, including chatbots and content generation tools, demonstrating its versatility and robustness.
D Evaluation Metrics
Performance was measured using the following metrics:
Jaccard Similarity
Measures overlap between predicted and actual ICD-10 codes. The Jaccard coefficient is calculated by dividing the size of the intersection of two sets by the size of their union:
Macro Precision30
It is the proportion of true positive results (correct predictions) out of all the results that were predicted as positive. In this work, for multilabel classification, the precision was calculated independently for each label and then averaged across labels by treating all labels equally.
Macro Recall30
Recall is calculated independently for each label and then averaged across labels. where L is the total number of labels. This approach treats each label equally, regardless of its frequency in the dataset.
F1 Score30
The harmonic mean of Precision and Recall offers a balanced metric.
Hamming Loss30
Reflects the fraction of labels that are incorrectly predicted.
Completely Different
Refers to instances where the predicted set of labels has no overlap with the actual set of labels.
III Results
Zero-Shot Configuration
GPT-4o emerged as the top-performing model, achieving the highest scores across all key metrics, including Jaccard Similarity (0.674), Precision (0.793), Recall (0.717), and F1 Score (0.730), with the lowest Hamming Loss (0.032) and only 12 completely different outputs. This indicates that GPT-4o is robust even without additional context or task-specific training. Google demonstrated moderate performance, achieving Jaccard Similarity (0.573) and F1 Score (0.612), though it struggled with 26 completely different outputs and higher Hamming Loss (0.042) than GPT-4o. LLaMA showed the weakest performance, with the lowest scores for Jaccard Similarity (0.509), Precision (0.559), and F1 Score (0.536). Additionally, it produced the highest number of completely different outputs (39) and exhibited a higher Hamming Loss (0.043) compared to GPT-4o and Google.
RAG Configuration
GPT-4o maintained its lead under the RAG setup, with metrics showing slight improvements or stability: Jaccard Similarity (0.664), Precision (0.786), Recall (0.705), F1 Score (0.720), and Hamming Loss (0.032). The model also reduced its number of completely different outputs to 13, highlighting its effectiveness in leveraging additional context. Google exhibited noticeable improvement in the RAG configuration, with increased Jaccard Similarity (0.578), Precision (0.646), and F1 Score (0.624). However, it still produced 24 completely different outputs, reflecting some inconsistency in predictions. LLaMA showed limited improvement in the RAG configuration, with Jaccard Similarity (0.506), Precision (0.544), and F1 Score (0.531) remaining low. It also had the highest Hamming Loss (0.047) and the most completely different outputs (41), suggesting minimal benefit from the additional context provided by RAG. The model also reduced its number of completely different outputs to 13, highlighting its effectiveness in leveraging additional context.
Detection of the higher level of code
Clinical notes might not always contain enough detail to assign the full ICD-10 code accurately. Focusing on three digits ensures the evaluation reflects what the notes actually provide, avoiding over-reliance on assumptions. We have evaluated three models with zero-shot prompting techniques for a general category of diseases or conditions. Comparison results with fine-grained evaluation are shown in Table 2.
For a higher level of code detection, GPT-4o achieved the highest Jaccard Similarity (0.77), Precision (0.87), Recall (0.84), and F1 Score (0.83). It also showed the lowest number of Completely Different predictions (4). These results demonstrated the strongest ability to predict ICD codes, particularly in higher-level evaluations.
Human Evaluation
The evaluation of the model (GPT4o) against human-coded ICD diagnoses in mental health clinical notes revealed several discrepancies. While the GPT-4o model demonstrated the ability to pick up relevant codes based on text patterns and keywords in the assessment/plan, it frequently missed certain codes identified by clinicians. For example, the model often failed to capture multiple codes that clinicians identified, including recurrent depressive disorders (e.g., F33.2, F33.41), anxiety disorders (e.g., F41.9, F41.1), and substance use disorders (e.g., F11.90, F12.90, F10.21).In many cases, the model assigned the primary code correctly but failed to include additional codes that the doctor deemed relevant, potentially missing conditions. The model occasionally assigns codes for conditions that are part of the family history/initial diagnosis which the doctor disregarded at a later stage(e.g. F43.9 F41.1 F10.10 F17.200) from which the doctor only mentioned F41.1 Also, the model occasionally assigned incorrect codes based on partial matches in text, such as assigning F41.1 (Generalized Anxiety Disorder) for mentions of anxiety, even when the clinician specified a more precise diagnosis (e.g., F43.23 (Adjustment Disorder with Mixed Anxiety and Depressed Mood)).
IV Discussion
A Analysis of Results
We evaluated the performance of various Large Language Models (LLMs), including GPT-4o, Google, and LLaMA, for ICD coding tasks under two configurations: Zero-Shot and Retrieval-Augmented Generation (RAG). The evaluation highlights significant differences in the models’ capabilities for ICD coding tasks: GPT-4o consistently outperformed Google and LLaMA in both configurations, demonstrating superior precision, recall, and F1 scores, as well as the lowest Hamming Loss. Its ability to maintain high performance under both Zero-Shot and RAG conditions underscores its robustness and adaptability. Google displayed moderate performance, benefiting from the RAG setup but still falling short of GPT-4o’s reliability and consistency. LLaMA lagged in all metrics, with minimal gains from RAG, suggesting that it is not well-suited for medical coding tasks in its current state. The RAG configuration improved the performance of all models to varying degrees, indicating that integrating external knowledge sources can enhance LLMs’ ability to handle complex and domain-specific tasks. In addition, GPT-4o consistently outperformed Google and LLaMA across all metrics in both zero-shot and higher-level evaluations.
B Limitations
Due to the sensitive nature of the information and the cost associated with the computing environment, we limited our evaluation to 50 notes, which may impact the comprehensiveness of the assessment. In the RAG approach, we have only used one guideline, ICD-10-CM, which may have a limited impact on retrieval.
C Implications for Future Work
Future research should explore further customization of LLMs with domain-specific training data, particularly in psychiatry. Other reliable documents, such as APA guidelines, UMLS code mapping may improve the detection accuracy.
V Conclusion
This study demonstrates the potential of large language models for detecting ICD-10 codes in psychiatry notes, with GPT-4o (RAG) showing the highest overall performance across most metrics. The findings indicate that LLMs, especially when combined with retrieval-augmented techniques, could support automated ICD-10 coding in clinical practice. However, achieving exact matches remains challenging, and further refinement of multi-agent and RAG methods could lead to even greater accuracy. This work underscores the transformative potential of LLMs in healthcare, particularly for psychiatry applications where language complexity is high.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
VII Ethical Considerations
All personal and sensitive information within the dataset has been anonymized to safeguard individuals’ identities and privacy. Data is stored securely and accessed only by authorized personnel. This study has been approved as exempt under the University of Illinois Chicago’s Institutional Review Board (IRB) protocol.
VI. Acknowledgements
The authors would like to thank Dr Smalheiser and two medical coder experts from the Psychiatry department at the University of Illinois Chicago for their invaluable feedback.