Abstract
Purpose This study aimed to evaluate the potential of GPT-4, a large language model, in assisting radiologists to determine brain magnetic resonance imaging (MRI) protocols.
Materials and methods We used brain MRI protocols from a specific hospital, covering 20 diseases or examination purposes, excluding brain tumor protocols. GPT-4 was given system prompts to add one MRI sequence for the basic brain MRI protocol and disease names were input as user prompts. The model’s suggestions were evaluated by two radiologists with over 20 years of relevant experience. Suggestions were scored based on their alignment with the hospital’s protocol as follows: 0 for inappropriate, 1 for acceptable but nonmatching, and 2 for matching the protocol. The experiment was conducted in both Japanese and English to compare GPT-4’s performance in different languages.
Results GPT-4 scored 27/40 points in English and 28/40 points in Japanese. GPT-4 gave inappropriate suggestions for Moyamoya disease and neuromyelitis optica in both languages and cerebral infarction in Japanese. For the other protocols, the suggested sequences were either appropriate or better. The suggestions in English differed from those in Japanese for seven protocols.
Conclusion GPT-4 generally suggested suitable MRI sequences. The study demonstrates that GPT-4 can help radiologists to determine protocols; however, further research is required for the radiological application of LLMs.
Introduction
The imaging methods of MRI vary depending on the disease under investigation, and their determination is entrusted to radiologists. However, this task is often extensive and complex. Herein, we investigated whether large-scale language models (LLMs) could help support this decision-making process based on their medical knowledge. Additionally, discussing GPT-4’s outputs will help understand the features of LLMs.
Materials and methods
We used the brain MRI protocols that are employed at our hospital as a standard. These protocols include 20 disease names or examination purposes, including screening. However, this time, we did not include brain tumor protocols in this experiment because our hospital utilizes advanced and specialized imaging methods for brain tumors, such as navigation protocols, which are not suitable for evaluation as general sequences.
We used the latest model of Generative Pretrained Transformer 4 (GPT-4) (OpenAI Inc., San Francisco, CA, USA), namely GPT-4o (gpt-4o-2024-05-13) [1]. We also used an in-house application to make sure that the data sent would not be used for GPT’s training. The temperature parameter, which controls the diversity of the results, was set to 0.0 to obtain a deterministic output. Messages sent to GPT consist of system prompts and user prompts, and the main part of instructions to GPT were described in the system prompt as shown below. Only the disease name was sent as the user prompt. We conducted this experiment in both Japanese and English and obtained 40 protocols from GPT. Our brain MRI protocol includes an axial T1-weighted image (T1WI Ax), axial T2-weighted image (T2WI Ax), axial T2-weighted fluid-attenuated inversion recovery (FLAIR Ax), axial diffusion-weighted image (DWI Ax), and time-of-flight magnetic resonance angiography (TOF MRA) as the standard set, with the option to add one additional sequence if necessary. We also asked GPT-4 to suggest only one additional sequence.
System prompt for GPT-4 (Witten in both English and Japanese):
“This GPT will assist in determining the MRI sequences for brain imaging. The basic imaging methods are T1WI Ax, T2WI Ax, FLAIR Ax, DWI Ax, and TOF MRA, and these should be performed in all cases. For the given disease name, if there is an additional sequence that needs to be imaged, please suggest only one. Since we do not want to increase unnecessary tests, if the necessary sequence is included in the above, it is okay not to add any additional sequences. Do not include contrast-enhanced examinations. Do not include examinations of areas other than the brain.
In making the suggestion, please follow these steps:
Describe the MRI imaging features of the disease.
Based on those features, select one additional imaging method to be added.”
The protocols suggested by GPT were evaluated by two radiologists with over 20 years of relevant experience working together. The responses from GPT-4 were scored as follows: 0: Inappropriate, 1: Acceptable but does not match protocol used in our institute, 2: Matches our protocol.
Result
GPT-4 suggested additional sequences for 17 of the 20 disease names in English and 13 sequences in Japanese. The scores for the sequences proposed by GPT were 27/40 points in English and 28/40 points in Japanese. GPT-4 provided inappropriate suggestions for Moyamoya disease and neuromyelitis optica in both languages and for cerebral infarction in Japanese. For the other protocols, the suggested sequences were either appropriate or better. The list of suggested sequences is shown in the table. Since the responses from GPT-4 were long sentences that included features of imaging findings, the full text is provided as a supplement in an Excel sheet.
Here is a sample of GPT-4’s reply suggesting a sequence for cerebral aneurysm:
1. MRI Imaging Features of Cerebral Aneurysm
- Cerebral aneurysms are often identified as saccular outpouchings of blood vessels in the brain. They may appear as flow voids on standard MRI sequences due to the rapid flow of blood within the aneurysm. The surrounding brain tissue may appear normal unless there has been a rupture, in which case there may be evidence of subarachnoid hemorrhage.
2. Additional Imaging Method
- 3D TOF MRA (Time-of-Flight Magnetic Resonance Angiography): This sequence is particularly useful for visualizing blood vessels and detecting aneurysms. It provides high-resolution images of the cerebral vasculature and can help in assessing the size, shape, and location of the aneurysm. Since TOF MRA is already included in the basic imaging methods, no additional sequence is necessary.”
Discussion
GPT-4 generally proposed appropriate additional sequences and explained the imaging findings supporting these suggestions. Though it would be possible to construct prompts to align with the facility’s specific circumstances or to pre-specify the details of certain protocols with additional resources, we imposed minimal directions to evaluate GPT’s inherent knowledge and ability. In a preliminary experiment prior to this study, the protocols generated freely by GPT-4 were more diverse and the results were unstable. The most effective approach was to force GPT-4 to recall the disease’s imaging findings. This method significantly constrained the subsequent sequence suggestions, resulting in more stable outcomes. This approach is an example of prompt engineering known as the chain of thought [2]. Since we were able to see the rationales of image findings, this approach was also useful in validating GPT’s suggestions.
Due to the limited time available for MR examinations in out institution, we are only allowed to add one sequence to the standard protocol. This limitation further constrained GPT-4’s output, which may have led to more favorable outcomes.
According to GPT, TOF MRA is important for several protocols; however, it did not include this suggestion as it is already part of our basic protocol. This demonstrates that LLMs can handle complex decisions written in natural language.
However, GPT-4 made several inappropriate suggestions. For neuromyelitis optica, GPT-4 suggested coronal T2WI to observe optic nerve lesions. However, to accurately evaluate the optic nerves, it is preferable to suppress the signal from the orbital fat. Therefore, when imaging the orbits, the sequence should be T2WI with fat suppression or short inversion time inversion recovery (STIR)[3].
Moyamoya disease causes narrowing of the lumens of the internal carotid and middle cerebral arteries; however, it is distinguished from arteriosclerosis by arterial shrinkage [5-7]. In Japan, where Moyamoya disease is more common, the diagnostic criteria for MRI include arterial vessel shrinkage on heavy T2WI [8]. GPT did not seem to have learned this diagnostic criterion even in Japanese.
In addition, GPT-4 suggested an ADC map for cerebral infarction in Japanese. The ADC map is generated alongside DWI, so it requires no additional sequence.
GPT-4’s suggested “3D CISS or 3D FIESTA” and “3D T2WI” for three protocols. The sequence actually used at our institute is spin-echo-based DRIVE (Philips Healthcare, Best, The Netherlands), which has the advantage of having fewer susceptibility artifacts compared with gradient-echo-based sequences such as CISS (Siemens Healthcare, Erlangen, Germany) or FIESTA (Philips) [9]. Since all these sequences are heavily T2-weighted 3D MR cisternography, they were scored as correct suggestions.
Nevertheless, this experiment has some limitations. It only targeted head protocols excluding brain tumors and was limited to diseases that are commonly encountered at our hospital.
Additionally, the sequences used in our institute were used as the gold standard for scoring. There is a concern that LLM suggestions may depend on domain and language. The suggestions in English differed from those in Japanese for seven protocols. GPT-4 suggested SWI for six protocols in English, and T2*WI for two protocols along with SWI for two protocols in Japanese. Our actual protocol did not include SWI but included T2*WI for three protocols, which ameliorated the score in Japanese. SWI is superior to T2*WI in sensitivity to hemorrhage[10-12]. However, in Japan, T2*WI is often preferred over SWI because the depiction of normal vessels in SWI can sometimes obscure hemorrhage site detection. The training data for LLMs varies by language, causing their outputs to be language-dependent. This not only provides utility tailored to specific regions and domains but also introduces biases that hinder uniformity.
Currently, LLM use for clinical decision-making is restricted by both regulatory and ethical guidelines. Please note that this report is based on a preliminary experiment and is not intended for clinical application. Further research on the radiological application of LLMs is needed.
Conclusion
GPT-4 can suggest almost proper MRI sequences for each disease in addition to the standard brain MRI protocol.
The protocol names or target disease names, additional sequences used at our facility, and GPT’s suggestions in English and Japanese are presented. Some names are listed only as abbreviations. Since the responses from GPT-4 were long sentences that included the features of imaging findings, the full text is provided as a supplement.
Data Availability
All data obtained from GPT-4 are compiled in an Excel sheet containing data sheets in both English and Japanese and available as Supplement.
Statements and Declarations
Ethical considerations
This article does not contain any studies with human or animal participants, therefore ethics approval was not required.
Declaration of conflicting interests
The Authors declare that there is no conflict of interest.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Data availability
All data obtained from GPT-4 in English are compiled in an Excel sheet containing data sheets and available as Supplement. Based on medRxiv’s policy, the results in Japanese are omitted. For those who wish to see the complete results, including those in Japanese, please contact the corresponding author.
Acknowledgments
None.