Abstract
The rapid progress in artificial intelligence, machine learning, and natural language processing has led to the emergence of increasingly sophisticated large language models (LLMs) enabling their use in various applications, including medicine and healthcare. The study aimed to evaluate the performance of two LLMs: ChatGPT (based on GPT-3.5) and GPT-4, on the Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023. The accuracies of both models were compared and the relationships between the correctness of answers with the index of difficulty and discrimination power index were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations. GPT-4 was able to pass all versions of MFE with mean accuracy of 80.7%, and GPT-3.5 managed to pass 2 out of 3 versions of the exam with mean accuracy of 56.6%. Despite its performance, GPT-4 score was lower than the average score of a medical student. There was a significant positive and negative correlation between the correctness of the answers and the index of difficulty and discrimination power index, respectively, for both models in all three exams. These findings contribute to the growing body of literature on the utility of LLMs in medicine. They also suggest an increasing potential for the usage of LLMs in terms of medical education and decision-making support.
1. Introduction
The rapid advancements in artificial intelligence (AI), machine learning (ML) and natural language processing (NLP) methods have paved the way for the development of large language models (LLMs), which possess an unprecedented ability to understand and generate human-like texts. These models have demonstrated remarkable performance in various tasks, spanning from sentiment analysis, machine translation, to text summarization and question-answering [1], [2]. As a result, the potential application of LLMs in various domains, including medicine along with healthcare, is a topic of significant interest [3]. Recently, the AI topic has gained in even more general popularity thanks to the ChatGPT chatbot introduction for the public [4].
ChatGPT is a LLM developed by OpenAI and initially released on the 30th of November 2022 on the website https://chat.openai.com/. ChatGPT became the fastest-growing application in history, as it gained 1 million users in 5 days and 100 million users just after 2 months after the initial launch. The first release of the service was based on the 3.5 version of the generative pre-trained transformer (GPT) model. The model was trained using Reinforcement Learning from Human Feedback technique with Proximal Policy Optimization (PPO) [5]. The training procedure contained three steps: (1) supervised learning where the AI trainer indicated the desired response, (2) training a reward model based on the ranking of different outputs, and finally (3) optimizing the policy against the reward model using the PPO. On the 14th of March 2023, the newest version of the GPT model (GPT-4) was also released. Access to this model was restricted only to the premium users of the OpenAI chatbot (one can become a premium user only by taking out the subscription). GPT-3.5 and GPT-4 training data were cut off in September 2021, so those models were not exposed to the newest data. Both GPT-3.5 and GPT-4 performance was validated on the Massive Multitask Language Understanding (MMLU) test [6]. GPT-4 model outperformed other models not only in the English version but also after translation of the test to other languages (even those rarely used like Latvian, Welsh, or Swahili) [7].
In order to incorporate GPT-3.5/GPT-4 into a specific field it needs to be further validated in the field-specific tests. In medicine, the expertise of healthcare professionals is crucial in ensuring accurate diagnosis, effective treatment, and patients’ safety. To maintain a high standard of medical practice, rigorous assessment methods, such as different medical final examinations, are employed to evaluate the competency of medical graduates before they begin practicing independently. Such examinations cover a wide range of medical knowledge, including theoretical concepts, clinical reasoning, and practical skills, making it a suitable benchmark for evaluating the performance of LLMs in the medical domain [8], [9]. Results of validation analysis of GPT-3.5 on numerous medical examinations have been recently published. GPT-3.5 and GPT-4 were already validated on, to our knowledge, several national medical tests like the United States Medical Licensing Examination (USMLE) [9], Japanese [8], and Chinese National Medical Licensing Examinations [10], [11], on couple of medical benchmark databases like MedQA, PubMedQA, MedMCQA, and ABMOSS [12]–[14]. GPT-3.5 was also evaluated in terms of its usability in the decision-making process. Rao et al. reported that GPT-3.5 achieved over 88% accuracy by being validated using the questionnaire regarding the breast cancer screening procedure [15]. However, to the best of our knowledge, there have been no studies yet presenting the capabilities of GPT-3.5 and GPT-4 in terms of European-based medical final examinations.
In this paper, we hence aimed to investigate the utility of GPT-3.5 and GPT-4 in the context of Polish Medical Final Examination. By evaluating the LLMs’ performance on the examination and comparing it to real medical graduates’ results, we seek to better understand their potential as a tool for medical education and clinical decision support as well as an improvement of the GPT technology which comes with the newest version of the model.
2. Methodology
Polish Medical Final Examination (Lekarski Egzamin Końcowy, or LEK, in Polish), which is necessary to complete medical education under Polish law and to pass to apply for the license to practice medicine in Poland (and based on the Directive 2005/36/EC of the European Parliament also in European Union). The exam is a test comprising 200 questions with 5 options to choose and only a single correct answer. In order to pass a test, it is required to obtain at least 56% of correct answers [16]. As both models were trained on the data until September 2021 it was decided to evaluate their performance on 3 editions of the Polish Medical Final Examination – Spring 2022 (S22), Autumn 2022 (A22), and Spring 2023 (S23). All questions from the previous editions of the examination are available online, along with the average results of medical graduates, detailing overall results, results of graduates who took the exam for the first time, those who graduated in the last 2 years, and those who graduated more than 2 years ago [17]. Besides the content of the question, the correct answers and answer statistics like the index of difficulty (ID) and discrimination power index (DPI) were published. Those indexes were calculated according to the equations presented below [18]:
where n is the number of examinees in each of the extreme groups (27% of the participants with the best results and 27% with the worst results in the entire test), Ns - the number of correct answers to the analyzed task in the group with the best results, Ni - the number of correct answers for the analyzed task in the group with the worst results. The index of difficulty takes values from 0 to 1, where 0 means that the task is extremely difficult and 1 means that the task is extremely easy. The discrimination power index assumes values from -1 (for extremely badly discriminating tasks) to 1 (for extremely well discriminating tasks).
In the case of GPT-4, each question was taken as an input to the prompt of the models with all available answers A-E. Final answers from all prompts were stored in Appendix 1. In the case of GPT-3.5, an API provided by OpenAI was used in order to accelerate the process of obtaining answers, as a model gpt-3.5-turbo was used [19]. All responses were stored in the form of a CSV file and provided in Appendix 2. From each response, the final answer was obtained and saved to the Excel file. If the answer was ambiguous, then the given question was treated as not answered (in other words – incorrectly answered).
The accuracy of both models for each test was calculated by dividing the number of correct answers by the number of all questions, which had the correct answer provided. As some questions were invalidated due to inconsistency with the latest knowledge, there were no correct answers for these questions, thus the number of correct answers was divided by number smaller than 200. Questions which contained image were also excluded. Moreover, the Pearson correlation coefficient was calculated, and the Mann–Whitney U test was conducted to investigate the relationships between the correctness of the answers, the index of difficulty and the discrimination power index. The overall scores for each examination obtained by LLMs were also compared to the average score obtained by medical graduates who took the exam in the given editions. All questions were asked between the 29th of March and the 13th of April 2023 (ChatGPT March 23 version). The significance level was set at the level of 0.05. For the usage of API, calculations, statistical inference, and visualizations Python 3.9.13 was used.
3. Results
GPT-3.5 managed to pass 2 out of 3 versions while GPT-4 was able to pass all three versions of the exam. The detailed results obtained by both models are presented in Table 1 and visualized in Figure 1.
Number of correct answers of GPT-3.5 and GPT-4 for each of the undertaken examinations. In the brackets, the number of questions with the given answers and percentage accuracy is provided next to the exam version and the number of correct answers respectively.
Comparison of the performance of both models along with passing score and average medical graduate score for all three examinations.
There was a significant positive correlation between the correctness of the answers and the index of difficulty as well significant difference between the index value for correct and incorrect answers in the case of all three exams for both models. There was also a negative correlation and significant difference between the correctness of the answers and discrimination power index in the case of the A22 and S23 exams for both models. The results are presented in Table 2 and Table 3 for the index of difficulty and discrimination power index, respectively. The boxplots of the index values depending on the correctness of the answers were visualized in Figure 2 for the index of difficulty, and Figure 3 for the discrimination power index.
Results of the correlation analysis with Pearson correlation coefficient and obtained p-value given in the brackets along with p-value obtained from the Mann–Whitney U test comparing the values of the index of difficulty for correct and incorrect answers.
Results of the correlation analysis with Pearson correlation coefficient and obtained p-value given in the brackets along with p-value obtained from the Mann–Whitney U test comparing the values of the discrimination power index for correct and incorrect answers.
Boxplots of the index of difficulty for the correct and incorrect answers for all three versions of the examination.
Boxplots of the discrimination power index for the correct and incorrect answers for all three versions of the examination.
4. Discussion
GPT-4 consistently outperformed GPT-3.5 in terms of the number of correct answers and accuracy across three Polish Medical Final Examinations. Our main finding indicates a significant improvement in the scope of medical knowledge represented by the GPT-4 model compared to the previous version, GPT-3.5. For both versions of the model, there is a significant correlation between the accuracy of the answers given and the difficulty of medical issues, indicating still a lack of in-depth knowledge in this area. Additionally, a negative correlation and significant difference were found between the correctness of the answers and the discrimination power index for both models in the A22 and S23 exams. The low DPI indicates a high number of incorrect answers given by subjects who obtained high overall results, which might be caused by more depth knowledge and overcomplicated reasoning. Thus, this negative correlation might be a sign of the simplicity of the model’s reasoning or the ability to simplify tasks in terms of the medical questions. GPT-4 models scored slightly worse than the average of the medical students in the respective editions of the examination, which was equal to 84.8%, 84.5% and 83.0% for S22, A22 and S23 respectively. The newest version of GPT obtained better results than the average of students who graduated more than 2 years ago in terms of S22 and better than students who took the exam A22 as their first one. The average result of students who graduated less than 2 years before the examination was always higher than the results obtained by both LLMs.
Our results on the European-based medical final examination are line with other studies conducted on different tests and languages from North America and Asia, which indicated the improvement of the medical knowledge possessed by GPT LLMs alongside with the development of the consecutive versions. Kung et al. evaluated the performance of GPT-3.5 on the USMLE, where GPT-3.5 outperformed its predecessor (GPT-3) with a score near or passing the threshold of 60% accuracy, which is required to pass the exam [9]. Recently, GPT-4 model was also evaluated on USMLE by Nora et al. The newest version of the GPT model outperformed GPT-3.5 with the improvement of its accuracy by over 30 percentage points [12]. In this study, GPT-4 turned out to be superior compared to its previous version and Flan-PaLM 540B model [13] in evaluation on other medical benchmarks like MedQA, PubMedQA and MedMCQA. In the study performed by Gilson et al., GPT-3.5 was confronted with the commonly used ABMOSS medical question database and 120 free questions from the National Board of Medical Examiners (NBME) [14]. GPT-3.5 outperformed IntructGPT and GPT-3 models in terms of accuracy by at least 4.9% and 24%, respectively. As shown by Kasai et al., GPT-4 was also able to pass the Japanese Medical Licensing Examinations again outperforming GPT-3.5 [8]. This study also highlighted the relationship between the correctness of the answers given by the LLM and the difficulty of the questions, which was also reported in our results. Study performed by Mihalache et al. presented that GPT-3.5 performed the best in the general medicine questions, while obtained the worst results in the specialized questions [20]. Bhayana et al. demonstrated that GPT-3.5 exhibited superior performance on questions required low-level thinking comparing to those which require high-level thinking [21]. Moreover, model struggled with questions involving description of imaging findings, calculation and classification, and applying concepts. Recently, Google and DeepMind presented their LLM PaLM 2 and its medical domain-specific finetuned MedPaLM 2 [22], [23]. The performance of GPT-4 and MedPaLM 2 on USMLE, PubMedQA, MedMCQA and MMLU appears to be very similar, where both GPT-4 and MedPaLM 2 were superior to each other in an equal number of tests evaluated. In this comparison, it is worth noticing that GPT-4 is a general-purpose model and was not explicitly finetuned for the medical domain.
There may be several potential reasons for the imperfect performance and providing incorrect answers by the tested models. First of all, both models are general-purpose LLMs that are capable of answering questions from various fields and are not dedicated to medical applications. Secondly, these models were tested on exams in Polish, for which there is less training data than for other languages like English. The accuracy of GPT-4 on Massive Multitask Language Understanding (MMLU) benchmark was 3.4% lower for Polish compared to English [7]. Both problems can be addressed by fine-tuning the models, that is, further training them in terms of medical education and the Polish language. As was shown in other studies, a finetuning of LLMs can further increase the accuracy in terms of answering medical questions [24]–[26]. Currently, OpenAI does not provide finetuning options for GPT-3.5 and GPT-4, but in the future, when this feature becomes available, it is also planned to explore the capabilities of those models after the finetuning on a medical dataset.
We believe that appearance of such powerful tools might have a significant impact on the shape of the medicine of tomorrow. Accurate and validated LLMs with broad medical knowledge can be beneficial for medical students in terms of self-learning e.g., by generating tailored learning materials, improving physician-patient communication by simulating conversations and clinical reasoning by providing step-by-step explanations of medical cases [27]. This influence will not be restricted to the education, but also it might be useful in terms of taking a medical note from a transcript, summarization of test results or decision-making support [4], [27]–[31]. Also, recent study has shown that chatbot responses were prefer over the physician responses on a social media forum, which shows that AI may significantly improve the quality of medical assistance provided online [32]. However, it is also extremely important to check authenticity of the responses generated by GPT model, as it might “hallucinate”, especially regarding provided references [33]–[35]. Alongside other researchers, we believe that LLMs although they need to be approached with caution, are not a threat to physicians, but can be a valuable tool and will be used more widely in the near future [4], [36], [37]. As of now, it is necessary to remember, that still a human should be at the end of the processing chain.
While the results of this study demonstrated the potential utility of AI language models in the medical field, several limitations should be acknowledged. First of all, the study focused solely on the Polish Final Medical Examination, which may limit the generalizability of the findings to other medical examinations or languages. What is more, PFME is an A-E test, which means that in some cases the correct answers could be by chance not as the result of the knowledge possessed by the models. Moreover, although GPT-4 outperformed GPT-3.5, the overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings e.g., for self-learning or decision-making support.
In conclusion, this study highlights the advances in AI language models’ performance on medical examinations, with GPT-4 demonstrating superior performance compared to GPT-3.5. However, there is still considerable room for improvement in their overall accuracy. Future research should focus on finetuning of those models and exploring their potential applications in various medical fields, such as diagnostic assistance, clinical decision support, and medical education. Further tests of LLMs could also include more open questions with evaluation by physicians without prior knowledge of the origins of the answers (if it was created by LLM or a human being).
5. Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.