GPT-4 outperforms ChatGPT in answering non-English questions related to cirrhosis ================================================================================= * Yee Hui Yeo * Jamil S. Samaan * Wee Han Ng * Xiaoyan Ma * Peng-Sheng Ting * Min-Sun Kwak * Arturo Panduro * Blanca Lizaola-Mayo * Hirsh Trivedi * Aarshi Vipani * Walid Ayoub * Ju Dong Yang * Omer Liran * Brennan Spiegel * Alexander Kuo ## Abstract **Background and Objectives** Artificial intelligence is increasingly being employed in healthcare, raising concerns about the exacerbation of disparities. This study evaluates ChatGPT and GPT-4’s ability to comprehend and respond to cirrhosis-related questions in English, Korean, Mandarin, and Spanish, addressing language barriers that may impact patient care. **Methods** A set of 36 cirrhosis-related questions were translated into Korean, Mandarin, and Spanish and prompted to both ChatGPT and GPT-4 models. Non-English responses were graded by native-speaking hepatologists on accuracy and similarity to English responses. Chi-square tests were used to compare the proportions of grading between ChatGPT and GPT-4. **Results** GPT-4 showed a marked improvement in the proportion of comprehensive and correct answers compared to ChatGPT across all four languages (p<0.05). GPT-4 demonstrated enhanced accuracy and avoided erroneous responses evident in ChatGPT’s output. Significant improvement was observed in Mandarin and Korean subgroups, with a smaller quality gap between English and non-English responses in GPT-4 compared to ChatGPT. **Conclusions** GPT-4 exhibited significantly higher accuracy in English and non-English cirrhosis-related questions, highlighting its potential for more accurate and reliable language model applications in diverse linguistic contexts. These advancements have important implications for patients with language discordance, contributing to equalizing health literacy on a global scale. ## Introduction The use of Artificial intelligence (AI) is growing in the field of medicine with the advent of new and innovative tools that can refine various aspects of patient care, from diagnosis to long-term management. However, it is important to identify and prevent the exacerbation of systemic racial, ethnic, and sex disparities in healthcare with the utilization of AI tools in medicine.[1] ChatGPT is an innovative natural language processing tool that can comprehend complex inquiries and provides easy-to-understand conversational responses that are seemingly knowledgeable.[2] ChatGPT and GPT-4 were released in November 2022 and March 2023. Despite its recent release, there is a rapidly growing body of evidence showing its remarkable ability to answer clinically related questions.[3, 4, 5] Cirrhosis is a complex condition that constitutes 2.4% of worldwide deaths and requires meticulous care to prevent complications.[6, 7, 8] Language barriers could impact the quality of care, as patients with restricted English proficiency could experience barriers to receiving adequate medical attention. Patients who had language discordance with their healthcare provider experienced an increased likelihood of patient dissatisfaction and worse outcomes.[9] ChatGPT 4 has shown a significant improvement in cross-lingual comprehension and translation.[10] This has the potential to diminish the language barrier and address the racial disparity. In this study, we examined the models’ ability to understand and reply to cirrhosis-related questions in Korean, Mandarin, and Spanish while comparing its performance to English. We also compared the accuracy of responses between ChatGPT and GPT-4. ## Method Frequently asked patient questions about basic knowledge of cirrhosis were used from our previous article to ensure consistent evaluation of ChatGPT’s capability.[3] These questions were obtained from reputable healthcare organizations and patient support groups. A total of 36 questions in the domain of basic knowledge were included. Each question was translated into Korean, Mandarin, and Spanish. Questions in different languages were independently prompted to both ChatGPT and GPT-4 models to obtain a response. ### ChatGPT and GPT-4 ChatGPT is a natural language processing (NLP) model that has been designed as a variant of GPT-3.5 LLM (Large Language Model). It is trained on a vast dataset of information collected from various online sources up to September 2021, such as books, websites, and articles. ChatGPT is trained using Reinforcement Learning from Human Feedback (RLHF), which incorporates human feedback and correction to generate coherent and contextually appropriate responses [11]. This allows for responses to be concise, understandable, and well-formulated. Users can input any prompt into ChatGPT, which uses the information stored in its database to generate a response. GPT-4 is the successor of ChatGPT. Although exact technical specifications of GPT-4 have not been publicly disclosed, GPT-4 boasts superior performance over its predecessor, and displays advance reasoning capabilities. In 24 of 26 language versions tested, GPT-4 has far outscored ChatGPT in the Massive Multitask Language Understanding (MMLU) examination, which covers 57 different subjects [12]. GPT-4 is also able to better detect inappropriate prompts and regulate its answers. Being a multimodal LLM, it can receive both image and text prompts, whereas ChatGPT can only receive text prompts. ### Grading Non-English responses were collected and subjected to two methods of grading: 1) the accuracy of each response, 2) similarity to the English response. These responses were graded by hepatologists who are native speakers. The accuracy of each response was graded using the scale: 1) comprehensive, 2) correct but inadequate, 3) some were correct, while some were incorrect, and 4) completely incorrect. Grading was based on the American Association for the Study of Liver Diseases (AASLD) guidelines. Given that the reviewers for each language are different, we did not compare the proportions of the grading across the language directly. We, instead, performed comparisons between English and non-English responses by having the native speaker hepatologists (all were also proficient in English) assess and compare the accuracy. The comparison was performed using the scale 1) the English response has more accurate explanations than the non-English response, 2) the same level of accuracy, and 3) the English response has less accurate explanations than the non-English response. ### Statistical Analysis The proportion of responses in each grade for both grading methods was calculated. Chi-square test was applied to compare the accuracy of proportions of grading between ChatGPT and GPT-4. A p-value of >0.05 was considered statistically significant. Analysis was done with IBM SPSS Statistics (Version 22). ## Results The workflow of response data collection and assessment is summarized in **Figure 1**. We saw a marked improvement in the proportion of comprehensive and correct answers by GPT-4 compared to ChatGPT across all four languages (p<0.05 for each language) **(Table 1)**. Notably, the responses generated by GPT-4 demonstrated enhanced accuracy and avoided erroneous responses that were evident in ChatGPT’s output. For example, hepatic encephalopathy and hepatic steatosis were used interchangeably in Korean using ChatGPT, which rendered the Korean responses to be completely incorrect in several questions. The Spanish version provided an incorrect definition of heavy drinking, and the Mandarin version suggested the use of anti-coagulation medications to prevent variceal bleeding. However, these mistakes were not found in GPT-4’s responses. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/05/05/2023.05.04.23289482/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2023/05/05/2023.05.04.23289482/F1) Figure 1. Flowchart of data collection and analysis. View this table: [Table 1.](http://medrxiv.org/content/early/2023/05/05/2023.05.04.23289482/T1) Table 1. Grade of responses by GPT-3.5 and GPT-4 to questions related to cirrhosis in different languages. Furthermore, while the English responses consistently provided more comprehensive explanations than their non-English counterparts for both ChatGPT and GPT-4, significant improvement was observed in both Mandarin (0.018) and Korean (<0.001) subgroups (**Table 2**). For responses generated by ChatGPT, although many responses were factually correct, they were inadequate and very succinct, with less explanation, resulting in less abundant information compared to the English version. Compared to the responses generated by ChatGPT, a larger proportion of GPT-4’s responses in Mandarin and Korean were considered to possess comparable levels of accuracy. View this table: [Table 2.](http://medrxiv.org/content/early/2023/05/05/2023.05.04.23289482/T2) Table 2. Difference in the accuracy of GPT-3.5 and GPT-4 in response to English vs. non-English cirrhosis-related questions. ## Discussion In this study, GPT-4 demonstrated a significantly higher accuracy in both English and non-English questions related to cirrhosis. We also showed a smaller quality gap between the English and non-English responses in GPT-4 compared to ChatGPT, as there was a significant increase in the proportions of responses with a similar level of accuracy between English and Mandarin, and Korean. The negative impact of language barriers on healthcare outcomes has been previously described. A meta-analysis of 14 studies which included 300,918 participants, revealed that language barriers in healthcare might lead to miscommunication between healthcare providers and patients, leading to reduced satisfaction for both parties, as well as a decline in the quality of healthcare delivery and patient safety.[9] These disparities have also been demonstrated among patients admitted to the hospital where patients with limited English proficiency experienced higher rates of adverse events compared to patients proficient in English.[13] A qualitative study of immigrants with limited proficiency in English highlighted concerns among patients, specifically among those with chronic diseases which require the availability of after-visit resources to aid in education and compliance with medical directions.[14] Furthermore, studies examining online Spanish medical information found sources to either be non-readable with higher than recommended reading levels or of low quality.[15, 16, 17] While we do not advocate for the use of natural language models in patient education outside of the care of a licensed healthcare professional, this rapidly evolving technology may serve as an easy-to-access and highly comprehensible adjunct information source for patients. Understanding the mechanism by which language learning models can comprehend and respond in multiple languages is important in assessing their performance but also their limitations. Cross-lingual machine reading comprehension can be achieved through multilingual pre-training, which is a technique utilized by language models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT).[18, 19] These methods allow the language model to understand words and phrases in different languages by learning in a unified semantic space, facilitating the transfer of knowledge across languages. GPT-4 outperformed ChatGPT in responding to non-English questions related to cirrhosis. This could be explained by the fact that GPT-4 was trained using a larger pre-trained dataset with more examples of text in various languages.[20] Regarding pre-processing of data, GPT-4 has improved tokenization and handling of unique characters found in different languages. Moreover, GPT-4 is equipped with enhanced training techniques, including transfer learning, zero-shot or few-shot learning (as explained above), multilingual data sampling, and cross-lingual pre-training, allowing it to comprehend and answer non-English questions more effective than ChatGPT. In conclusion, we demonstrated a significant improvement in GPT-4’s ability to comprehend and accurately respond to both English and non-English cirrhosis-related questions. The advancements of GPT-4 over its predecessor highlight the potential for more accurate and reliable language model applications in diverse linguistic contexts. These advancements have important implications for patients who have language discordance with their healthcare providers and will contribute to equalizing health literacy on a global scale by delivering accurate and comprehensive explanations. ## Data Availability All data produced are available online using ChatGPT ## Acknowledgments GPT-4 was used to paraphrase the part of the manuscript. ## Footnotes * * The 2 authors share co-first authorship * **Conflict of Interest Statement:** None declared. * **Funding/Support:** None * **Ethics Approval:** Since all responses from OpenAI were publicly available, approval from the institutional review board was not sought. * Received May 4, 2023. * Revision received May 4, 2023. * Accepted May 5, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## Reference 1. Uche-Anya E, Anyane-Yeboa A, Berzin TM, Ghassemi M, May FP. Artificial intelligence in gastroenterology and hepatology: how to advance clinical practice while ensuring health equity. Gut 2022;71:1909–15. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ3V0am5sIjtzOjU6InJlc2lkIjtzOjk6IjcxLzkvMTkwOSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzA1LzA1LzIwMjMuMDUuMDQuMjMyODk0ODIuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 2. openai. ChatGPT: Optimizing Language Models for Dialogue. 2023. 3. Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023. 4. Potapenko I, Boberg-Ans LC, Stormly Hansen M, Klefter ON, van Dijk EHC, Subhi Y. Artificial intelligence-based chatbot patient information on common retinal diseases using ChatGPT. Acta Ophthalmol 2023. 5. Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA 2023;329:842–4. 6. Collaborators GBDC. The global, regional, and national burden of cirrhosis by cause in 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet Gastroenterol Hepatol 2020;5:245–66. 7. Lesmana CRA, Raharjo M, Gani RA. Managing liver cirrhotic complications: Overview of esophageal and gastric varices. Clin Mol Hepatol 2020;26:444–60. 8. Gines P, Krag A, Abraldes JG, Sola E, Fabrellas N, Kamath PS. Liver cirrhosis. Lancet 2021;398:1359–76. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(21)01374-X&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F05%2F2023.05.04.23289482.atom) 9. Al Shamsi H, Almutairi AG, Al Mashrafi S, Al Kalbani T. Implications of Language Barriers for Healthcare: A Systematic Review. Oman Med J 2020;35:e122. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5001/omj.2020.40&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F05%2F2023.05.04.23289482.atom) 10. Wenxiang Jiao WW, Jen-tse Huang, Xing Wang, Zhaopeng Tu. Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine. arXiv 2023. 11. Griffith S, Subramanian K, Scholz J, Isbell CL, Thomaz AL. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems 2013;26. 12. OpenAI. GPT-4 Technical Report. 2023. 13. Divi C, Koss RG, Schmaltz SP, Loeb JM. Language proficiency and adverse events in US hospitals: a pilot study. Int J Qual Health Care 2007;19:60–7. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/intqhc/mzl069&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17277013&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F05%2F2023.05.04.23289482.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000246960800002&link_type=ISI) 14. Pandey M, Maina RG, Amoyaw J, Li Y, Kamrul R, Michaels CR, et al. Impacts of English language proficiency on healthcare access, use, and outcomes among immigrants: a qualitative study. BMC Health Serv Res 2021;21:741. 15. Nix E, Willgruber A, Rawls C, Kinealy BP, Zeitler D, Schuh M, et al. Readability and Quality of English and Spanish Online Health Information about Cochlear Implants. Otol Neurotol 2023;44:223–8. 16. Cardel MI, Chavez S, Bian J, Penaranda E, Miller DR, Huo T, et al. Accuracy of weight loss information in Spanish search engine results on the internet. Obesity (Silver Spring) 2016;24:2422–34. 17. Garland ME, Lukac D, Contreras P. A Brief Report: Comparative Evaluation of Online Spanish and English Content on Pancreatic Cancer Treatment. J Cancer Educ 2022. 18. Linting Xue NC, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021: Association for Computational Linguistics, 2021. 19. Alexis Conneau KK, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020: Association for Computational Linguistics, 2020. 20. OpenAI. GPT-4 Technical Report. 2023.