Abstract
Purpose To evaluate the impact of a structured tutorial on the use of a large language model (LLM)-based search engine on radiology residents’ performance in LLM-assisted brain MRI differential diagnosis.
Materials & Methods In this retrospective study, nine radiology residents determined the three most likely differential diagnoses for three sets of ten brain MRI cases with a challenging yet definite diagnosis. Each set of cases was assessed 1) with the support of conventional internet search, 2) using an LLM-based search engine (© Perplexity AI) without prior training, or 3) with LLM assistance after a structured 10-minute tutorial on how to effectively use the tool for differential diagnosis. The tutorial content was based on the results of two studies on LLM-assisted radiological diagnosis and included a prompt template. Reader responses were rated using a binary and numeric scoring system. Reading times were tracked and confidence levels were recorded on a 5-point Likert scale. Binary and numeric scores were analyzed using chi-square tests and pairwise Mann-Whitney U tests each. Search engine logs were examined to quantify user interaction metrics, and to identify hallucinations and misinterpretations in LLM responses.
Results Radiology residents achieved the highest accuracy when employing the LLM-based search engine following the tutorial, indicating the correct diagnosis among the top three differential diagnoses in 62.5% of cases (55/88). This was followed by the LLM-assisted workflow before the tutorial (44.8%; 39/87) and the conventional internet search workflow (32.2%; 28/87). The LLM tutorial led to significantly higher performance (binary scores: p = 0.042, numeric scores: p = 0.016) and confidence (p = 0.006) but resulted in no relevant differences in reading times. Hallucinations were found in 5.1% of LLM queries.
Conclusion A structured 10-minute LLM tutorial increased performance and confidence levels in LLM-assisted brain MRI differential diagnosis among radiology residents.
Clinical Relevance Statement Our findings highlight the considerable benefits that even low-cost, low-effort educational interventions on LLMs can provide. Integrating LLM education in radiology training programs could augment practitioners’ capacity to harness AI technologies effectively.
Introduction
Large language models (LLMs) are advanced artificial intelligence (AI) systems capable of processing and generating human language. Trained on vast amounts of text data and based on an innovative transformer architecture, these models have demonstrated remarkable performance in various tasks across sectors (1).
With the rapid technological advancements of LLMs in recent years, numerous studies have explored applications of LLMs in radiological workflows. These include the definition of imaging protocols (2–4), performing differential diagnosis based on case presentations (5–9), error checking in radiology reports (10), generation of impressions in radiology reports (11,12), information extraction from free-text radiology reports (13–15) and more. Yet, despite the promising applications, the integration and adoption of LLMs in radiology is not without challenges. Data privacy concerns, bias and error propagation, lack of contextual understanding, and overreliance have been pointed out as relevant limitations of LLMs (1,16–19). Against this background, the critical role of educating healthcare professionals on the appropriate use and potential pitfalls of LLMs has been emphasized (20–22). This may include training in prompt engineering, which describes the strategic crafting of a textual instruction that serves as input to LLMs (22).
One area where insufficient human oversight of LLMs could lead to clinical errors is radiological differential diagnosis. An earlier study on human-LLM collaboration in brain MRI differential diagnosis found that inadequate formulation of prompts can result in misleading LLM outputs, and lacking critical validation of LLM responses can lead to incorrect conclusions (23).
However, whether and how radiology readers can be trained to more effectively apply LLMs in radiological differential diagnosis has not been investigated yet. This study therefore aimed to evaluate the impact of a structured LLM tutorial on the performance of radiology residents in brain MRI differential diagnosis.
Methods
Informed patient consent was waived by the Ethics Committee of the Technical University of Munich.
Study Sample
Thirty challenging brain MRI exams acquired between 01/01/2016 and 12/31/2023 were selected from the local Picture Archiving and Communication System (PACS) system and randomized into three sets (Figure 1). Included exams were deemed as sufficiently complex for use in radiological board certification exams by two board-certified neuroradiologists (DMH and BW) and contained an abnormal finding with a confirmed diagnosis (histopathologically or through independent agreement of at least two neuroradiologists). In each exam, one or more arrows marked the image finding in question. The included exams have been published previously (24). A case overview is provided in Supplement 1.
Nine radiology residents with less than six months of neuroradiology experience were recruited from the local departments of radiology and neuroradiology and randomized into three groups (Table 1). Informed consent was provided by all participants.
Study Design
Over the course of three sessions, each reader assessed three sets of ten brain MRI cases with varying workflows and provided up to three differential diagnoses for the annotated image findings, ranked by likelihood. Each case was reviewed only once per reader. To control for the confounding effects of case difficulty, each set of cases was assessed by the same number of readers with each workflow (Figure 1). For every case, demographic and condensed medical history was provided. Sessions took place between 01/03/2024 and 22/05/2024.
First, conventional internet research was conducted to support differential diagnosis, either using web-based search engines, e.g. Google Search, or directly accessing trusted websites (Conventional). Residents were instructed to behave as they would in a clinical routine setting, mimicking the current practice in clinical care. Second, readers utilized an LLM-based search engine (© Perplexity AI Inc., San Francisco, USA) but didn’t receive any training beforehand (LLM-Pre-Training). PerplexityAI had been chosen as LLM interface for its ability to access real-time web content and provide source citations. Search queries were powered by GPT-4-Turbo (Generative Pre-trained Transformer 4 Turbo) by OpenAI. Third, another subset of ten cases was evaluated with the assistance of PerplexityAI. This time, however, the session was preceded by a structured tutorial on how to effectively use the tool (LLM-Post-Training). Tutorial details are provided below. In both LLM-assisted workflows, participants were allowed to conduct additional internet search to validate LLM suggestions.
Reading times were recorded using a time tracking software (Toggl Track, © Toggl OÜ, Talinn, Estonia). Confidence levels were documented for each case on a 5-point Likert scale (1: not at all confident, 5: very confident). Following the second and third session, readers completed questionnaires to evaluate the experience with the LLM-assisted workflow.
LLM Tutorial
In a short tutorial of no more than 10 minutes, readers were given tips on how to effectively utilize the LLM-based search engine. The content of the tutorial was based on two earlier studies on the application of LLMs for brain MRI differential diagnosis. One study evaluated the contribution of varying multimodal input elements on the diagnostic performance of GPT4(V) and identified the textual description of radiological image findings as the key element (24). The other demonstrated superior accuracy of LLM-assisted differential diagnosis over a workflow supported by a conventional search engine, but also determined several pitfalls in human-LLM interaction (23). The full script of the tutorial is provided below:
“A detailed description of image findings is by far the most important factor for accurate LLM responses. The description should include details about location, contrast enhancement, morphology, size and more. An accurate description of the finding location is particularly critical, an inaccurate specification of the location can result in misleading suggestions.
Providing relevant information about the medical history can improve the accuracy of LLM responses. However, clinical information unrelated to the image finding might result in misleading LLM outputs. Therefore only clinical information deemed to be relevant for the image finding in question should be provided.
Uploading screenshots of key image findings can help improve LLM responses, although their effect is only marginal.
Use of connotative terminology can lead to bias and should be avoided (e.g. the term ‘juxtacortical’ is strongly associated with multiple sclerosis).
Instructions can be made regarding the extent (number of differential diagnoses mentioned) and format (bullet points, table) of the LLM output.“
Readers were further encouraged to use the following prompt template:
“You are a senior neuroradiologist. Below, you will find information regarding a brain MRI scan. Based on this information, identify the three most likely differential diagnoses, ranked by their likelihood. Present your findings in a table format with the following columns: ‘Rank’, ‘Differential Diagnosis’, and ‘Explanation’.
[Medical history] [Image description]”
Analysis
To ensure performance did not solely depend on prior knowledge but the quality of research, cases where the correct diagnosis could be determined confidently without further research were excluded.
Accuracy of differential diagnoses was determined using two different scoring systems, as described previously (23). The first method used a binary scoring system, where responses were labeled as “correct” if the correct diagnosis was included among the submitted differentials, and “incorrect” if it was not. The second approach assigned scores ranging from 0 to 3 based on the rank of the correct diagnosis within the response (0: correct diagnosis not included, 1: correct diagnosis ranked third, 2: correct diagnosis ranked second, 3: correct diagnosis ranked first). Cases where a correct but less granular response was indicated were rated in consensus (by SHK and SS). For binary scores, a chi-square test was initially applied across all groups, followed by pairwise chi-square tests. For numeric scores and confidence levels, a Kruskal-Wallis test was used to assess differences among all groups, with subsequent pairwise comparisons conducted using the Mann-Whitney U test. To control for false discovery rates, p-values were adjusted using the Benjamini-Hochberg procedure for both scores and confidence. Reading times were analyzed using an ANOVA test across all workflows, followed by pairwise t-tests. The significance level was set at p < 0.05. 5-point Likert-scale questionnaire results are reported using descriptive statistics.
Logs of the LLM-based search engine (Perplexity AI) were examined to quantify the number of queries and source references. Queries were categorized by query type (keyword-based vs instruction-based) and by content (differential diagnosis, radiographic features, sample images, anatomy, other). Sources were classified into journal articles and other online sources. The content of LLM responses were screened for incorrect or inconsistent information by two radiology residents (SHK and SS; 1.5 and 2.5 years of experience in reading brain MRI exams) and confirmed by a certified neuroradiologist (DH). As described previously (25), incorrect responses were classified into hallucinations (inconsistent with widely accepted radiological knowledge), misinterpretations (miscomprehending a question and giving contextually irrelevant replies), and clarifications (lacking comprehension of a prompt requiring its rephrasing). Radiographic features described in LLM responses were checked against reference articles of www.radiopaedia.org, which is a validated source of radiological knowledge.
Data curation, analysis and visualization were performed using Python (version 3.9.7).
Results
8 out of 270 responses (3.0%) were excluded from the analysis as readers were able to determine the correct diagnosis confidently without requiring further research. A sample LLM query and its results are shown in Figure 2. 12 out of 262 cases (4.6%) required a consensus decision because the reader provided a correct but less specific diagnosis (e.g. “encephalitis” was counted as correct in a case of limbic encephalitis).
Binary and Numeric Scores
Based on the binary scoring system, 62.5% (55/88) of responses in the LLM-Post-Training workflow were correct, compared to 44.8% (39/87) in the LLM-Pre-Training and 32.2% (28/87) in the Conventional group (Figure 3). An initial chi-square test across all groups indicated a significant overall difference (p < 0.001). Subsequent pairwise chi-square tests showed significant differences between LLM-Pre-Training and LLM-Post-Training (p = 0.042), as well as between LLM-Post-Training and Conventional (p < 0.001), but not between LLM-Pre-Training and Conventional (p = 0.119) (Table 2).
Comparison of numeric scores revealed a median score of 3 in the LLM-Post-Training group, compared to 0 for both the LLM-Pre-Training and Conventional groups (Figure 3). The Kruskal-Wallis test confirmed significant overall differences among the workflows (p < 0.001). Pairwise comparisons using the Mann-Whitney U test revealed significant differences between LLM-Pre-Training and LLM-Post-Training (p = 0.016) and between LLM-Post-Training and Conventional (p < 0.001), but not between LLM-Pre-Training and Conventional (p = 0.092) (Table 3).
Confidence
Median confidence ratings were highest in LLM-Post-Training (median = 4), followed by LLM-Pre-Training (median = 3) and Conventional (median = 3). The proportion of high or very high confidence ratings (4 or 5) was 18% for Conventional, 31% for LLM-Pre-Training, and 54% for LLM-Post-Training (Figure 4). The Kruskal-Wallis test showed a significant overall difference in confidence (H = 27.20, p < 0.001). Pairwise Mann-Whitney U tests revealed a highly significant difference between Conventional and LLM-Pre-Training (p = 0.006), LLM-Pre-Training and LLM-Post-Training (p = 0.006) as well as between Conventional and LLM-Post-Training (p < 0.001).
Reading Times
Mean reading times amounted to 07:43 min (Conventional), 08:59 min (LLM-Pre-Training) and 08:35 min (LLM-Post-Training) (Figure 5). An ANOVA test showed a statistically significant overall difference (p = 0.030). Pairwise t-tests showed significant differences between Conventional and LLM-Pre-Training (p = 0.013), but not between LLM-Pre-Training and LLM-Post-Training (p = 0.403) or between Conventional and LLM-Post-Training (p = 0.069).
Questionnaire Results
The proportion of readers satisfied or very satisfied with the LLM-assisted diagnostic workflow increased from 55.6% (5/9) to 88.9% (8/9) following the tutorial. 55.6% (5/9) of readers indicated they would consider using the tool in clinical practice before the training, compared to 88.9% (8/9) after the training. 88.9% (8/9) of readers found the tutorial helpful or very helpful.
Reader Feedback and Observations
Before the tutorial, most readers tended to use keyword-based queries rather than detailed instructions, similar to how conventional search engines operate. In general, the LLM tool was found particularly useful in generating an initial list of possible differential diagnoses to be evaluated through additional searches. In addition, the possibility to pose follow-up questions to initial query results was perceived as an advantage over conventional search engines. When using the provided prompt template, readers overwhelmingly appreciated the concise tabular format of the results which also included the rationale for the suggestion. However, some readers struggled to formulate accurate image descriptions, owing to their insufficient knowledge of neuroanatomy and brain MRI sequences.
LLM Response Evaluation
A total of 413 LLM queries in 169 patient cases were examined (2.44 queries per case). In 11 out of 180 cases, LLM logs were not available because the user did not perform any LLM queries or because logs could not be retrieved due to technical errors of PerplexityAI. 5.8% of queries included incorrect inputs, such as inaccurate descriptions of finding locations or imaging characteristics. 7.3% of queries included screenshots of MRI findings.
The proportion of instruction-based queries increased substantially from 35.2% to 95.4% after the tutorial, while keyword queries decreased inversely from 64.8% to 4.6%. Whereas the majority of queries prior to the training were classified as “Other” (58.4%), most queries following the tutorial were directed at relevant differential diagnoses (56.7%) and sample images of those (34.0%). Overall, 45.3% of sources indicated by PerplexityAI were peer-reviewed journal articles, with only minor differences between queries before and after the tutorial (49.2% and 40.6% each).
Hallucinations were observed in 5.1% of LLM queries (12.4% of cases; 20 responses in total). The LLM tutorial resulted in only a minor change in hallucination frequency (Pre-Training: 4.6%, Post-Training: 5.7%). 35.0% of hallucinations involved the misinterpretation of MRI screenshots provided as input or the LLM returning sample MRI images irrelevant to the context of the query. Hallucination details are provided in Supplement 2. Misinterpretations were found in 1.0% of queries (2.4% of cases), while clarifications occurred in 1.2% of queries (3.0% of cases).
Discussion
This study evaluated the impact of a structured 10-minute LLM tutorial on the performance of radiology residents in LLM-assisted brain MRI differential diagnosis. We found that readers displayed higher performance, confidence levels and overall satisfaction after completing the tutorial. Compared to differential diagnosis supported by conventional internet search, both LLM-assisted workflows resulted in better performance, although only the post-training workflow showed a statistically significant difference.
Analysis of reader-LLM interactions revealed that following the tutorial, almost all queries were phrased as specific instructions, whereas most queries before the training consisted of mere keywords, resembling conventional search engine queries. This observation is consistent with “Jakob’s Law” which is a well-known phenomenon in user experience (UX) stating that users prefer systems to behave like other familiar ones (26). Similar to hallucination rates described previously (25), we found statements inconsistent with widely accepted medical knowledge in 5.1% of LLM responses. Many of these involved incorrect interpretations of MRI screenshots provided as input, confirming earlier studies demonstrating low performance of current state-of-the art LLMs in diagnostic tasks based on radiological images (27–30). Interestingly, hallucinations were even found with PerplexityAI, which – unlike other Chatbots such as ChatGPT - combines LLMs with real-time information retrieval from the internet to support its answers with relevant sources (31). Additional research is needed to develop further safeguarding measures against potentially harmful effects of hallucinations in clinical LLM applications. Notably, in 5.8% of queries misleading responses were generated not because of undesired LLM behavior, but due to incorrect finding descriptions provided by readers, emphasizing the essential role of conventional radiological skills in effectively employing LLMs for diagnostic tasks.
Our findings indicate that even minor, low-cost educational interventions for LLMs can yield remarkable outcomes, and support the notion that courses focused on the practical application of AI should become a core part of medical curricula and training programs (32–34). Yet, given the novelty of the technology, validated educational content on the effective utilization of LLMs for specific clinical tasks is extremely scarce. The tutorial provided to the readers in this work was based on two previous studies specifically focusing on human-LLM collaboration and prompt engineering in brain MRI differential diagnosis (23,24). As the corpus of scientific evidence on LLM applications grows, medical societies should provide guidelines and courses on their appropriate use. Furthermore, platforms should be created to allow for healthcare professionals to exchange validated and effective prompts, similar to previous initiatives by radiological societies for sharing structured reporting templates (35–37).
Unlike diagnostic performance, reading times did not improve with either of the LLM-assisted workflows. As prior work on AI-based image analysis algorithms and structured reporting has illustrated (38,39), integration of technologies into the local IT infrastructure is critical for user acceptance and can boost efficiency. Vendors of radiology reporting solutions and Picture Archiving and Communication Systems (PACS) should explore ways to seamlessly embed LLM-based features supporting differential diagnosis and other tasks.
Limitations
The following limitations need to be acknowledged.
First, only radiology residents with very little neuroradiology experience were included. This study design ensured that most cases could not be solved by readers with prior knowledge alone and allowed to investigate the isolated effect of distinct web research workflows, but limits generalizability of the findings to more experienced readers. Yet, our observation that inexperienced readers showed performance improvements ensuing the LLM tutorial despite struggling to formulate image descriptions suggests that moderately experienced readers, who are proficient enough to create accurate finding descriptions but not yet skilled enough to conduct differential diagnoses without assistance, might benefit even more.
Second, the findings in question were presented with annotations to isolate readers’ classification performance, but this approach reduced the realism of the scenario. It is likely that in actual practice, some of the more subtle findings would have been missed, as remarked by several readers.
Third, this study employed only a single LLM (GPT-4 by OpenAI) accessed through a specialized search engine (PerplexityAI). Future studies should compare several closed-source and open-source LLMs with respect to their utility in supporting radiology readers in differential diagnosis, including ones fine-tuned with domain-specific training data.
In conclusion, a concise but structured 10-minute LLM tutorial increased performance and confidence levels in LLM-assisted brain MRI differential diagnosis among radiology residents. These findings highlight the considerable benefits that even low-cost, low-effort educational interventions on LLMs can provide.
Data Availability
All data produced in the present study are available upon reasonable request to the authors