Summary
Background ChatGPT (Chat Generative Pre-trained Transformer) has initiated widespread conversation across various human sciences. We here performed a concise review combined with a SWOT (strengths, weaknesses, opportunities, threats) analysis on ChatGPT potentials in natural science including medicine.
Methods This is a concise review of literature published in PUBMED from 01.12.2022 to 31.03.2023. The only search term used was “ChatGPT”. Publications metrics (author, journal, and subdisciplines thereof) as well as findings of the SWOT analysis are presented.
Findings Of 178 studies in total, 160 could be evaluated. The average impact factor was 4,423 (0 – 96,216), average publication speed was 16 days (0-83 days). Of all articles, there were 77 editorials, 43 essays, 21 studies, six reviews, six case reports, six news, and one meta-analyses. Strengths of ChatGPT include well-formulated expression as well as the ability to formulate general contexts flawlessly and comprehensibly, whereas the time-limited scope as well as the need for correction by experts were identified as weaknesses and threats. Opportunities include assistance in formulating medical issues for non-native speakers as well as the chance to be involved in the development of such AI in a timely manner.
Interpretation Artificial intelligences such as ChatGPT will revolutionize more than just the medical publishing landscape. One of the biggest dangers in this is uncontrolled use, so we would do well to establish control and security measures at an early stage.
Evidence before this study Since its release in 11/ 2022, only a few randomized controlled trials using ChatGPT have been published. To date, the majority of data stems from short notes or communication. Given the enormous interest (and also potential for misuse), we conducted a PUBMED literature search to create the most comprehensive evidence base currently available. We searched PUBMED for publications including the quote “ChatGPT” in English or German from 01.12.2022 until 31.03.2023. In order not risk any bias of evidence all related publications were screened initially.
Added value of this study This is the most concise review for ChatGPT up to date. By means of a SWOT analysis, readers and researchers gain comprehensive insight to strengths, weaknesses, opportunities and threats of ChatGPT especially in the context of medical literature.
Implications of all the available evidence Our review may well serve as origin for further research related to the topic in order to create more evidence, strict regulations and policies in dealing with ChatGPT.
Introduction
Chat Generative Pre-trained Transformer (ChatGPT), is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Since its launch by OpenAI, ChatGPT has stimulated widespread conversation and momentum across various human sciences, including medicine. It enables humans to interact and discuss with an artificial intelligence chatbot on a broad range of topics. At the latest since it became known that ChatGPT was able to pass the United States Medical Licensing Examination performing at a level comparable to a third-year medical student,1 it moved into the center of interest. Preliminary evidence exists, that ChatGPT has promising applications across the clinical workflow.2 Moreover, ChatGPT and similar generative AI have specific applications within medical education, including clinical vignette generation and communications training with AI virtual patients;2–4 however, it also poses several challenges that universities need to carefully address.5 Tested on the verbal scale of the Wechsler adult intelligent scale (WAIS), an intelligence quotient of 155 was demonstrated for ChatGPT.6 When asked, ChatGPT itself explains its role in medical literature in the following way: “As an AI language model, ChatGPT’s role in medical literature is to provide information and insights related to healthcare and medicine based on its knowledge of the subject matter. ChatGPT can help researchers and healthcare professionals by providing quick access to information on a wide range of medical topics, including symptoms, diagnoses, treatments, and outcomes. ChatGPT can also assist in generating summaries and abstracts of medical articles and papers, making it easier for healthcare professionals to keep up with the latest developments in their field. Additionally, ChatGPT can assist in developing natural language processing applications that can improve the efficiency and accuracy of medical data analysis and interpretation. However, it’s important to note that ChatGPT is not a substitute for medical professionals and should not be used to make medical decisions. It is always recommended to consult with a licensed healthcare provider for medical advice and treatment.”
Although these functions are promising and tempting, the technology is still in its early stages, making the program vulnerable to mis- and abuse. Since its release, over 200 papers have addressed issues using the search term “ChatGPT”. The aim of this review was to analyze the role ChatGPT in medical literature during the first three months after its release and put its use in various contexts. In order to provide an overview and to find out more about its potential as well as to create hypotheses for further research, we undertook this concise review on the role of ChatGPT including a SWOT (strengths, weaknesses, opportunities and threats) analysis to define its potential especially for medical literature.
Methods
Study design
Search strategy and selection criteria
References for this Review were identified through searches of PubMed with the search term “ChatGPT” from 01-12-2022 until 31-03-2023. Only fully retrievable papers published in English or German and were reviewed. The final reference list was generated on the basis of originality and relevance to the broad scope of this Review.
Data analysis
Only complete data sets published in English or German with respect to the above criteria were included in this study. Additional exclusion criteria included incomplete or non-retrievable data sets as well as articles completely written completely by but not about ChatGPT. All accessible publications were evaluated according to the following specifications by the author team.
Articles
Articles were primarily classified according to the specifications of PUBMED. For better comprehensibility, a “studies” category was created, defined as “a method of research in which a problem is identified, relevant data are gathered, a hypothesis is formulated, and the hypothesis is empirically tested”. All identified articles were scanned for “qualitative” (collection of text-based data, e.g. interviews, focus groups, usually hypothesis generating) vs. “quantitative” (collection of number-based data, e.g. measurements, questionnaires with associated statistics, usually hypothesis-testing) content. We also chose to discriminate “mixed method research” (combination of qualitative and quantitative content) and “reviews and meta-analyses”. Furthermore, empirical data, based on a (proprietary) database, was distinguished from non-empirical data (including anything without a database). Article content was analyzed in reporting on the use or actual, partial or full use of ChatGPT in the drafting of the article. In this context, attention was also paid to the correlation between the share of ChatGPT in the preparation of the manuscript (not at all, partially, completely) and the achievable impact factor. In order to better compare the course of the number of actual published papers, an article count was displayed by week and compared to weekly article release during the COVID-19-outbreak.
Journals
Journals publishing articles on ChatGPT were evaluated regarding title, discipline of natural science, actual impact factor, open access vs. traditional publishing and publishing speed (incl. preprint server).
Authors
To obtain more information about authors publishing on the topic, the number of first and last authorships other than the index publication was determined for the years 2020 through 2022. Additionally, the specialty, if identifiable via PUBMED or the full text (ORCID ID), was also reported.
SWOT analysis
During the screening of all evaluated articles, quotes on strengths, weaknesses, opportunities and threats mentioned within the articles were collected. Subsequently, the items identified were evaluated in a Delphi round, consented upon and assigned in keyword form to one of the four components of the SWOT analysis.
Statistical analysis
The primary endpoint was applicable extent of data collection on the role of ChatGPT in medical literature defined by author, article and journal type. Secondary endpoints included strengths, weaknesses, opportunities and threats of Chat GPT use in medical literature. All data were analyzed on a descriptive basis. Data are means ± SD unless otherwise stated. Statistical analysis was performed using descriptively using Microsoft Excel for Office 365 (Version16.16.27) and PSPP, Version 1.6.2 (https://www.gnu.org/software/pspp/). Student’s t-test, Levene’s test, and Mann-Whitney-U test were applied as appropriate. A p < 0.05 was considered to represent statistical significance.
Results
From 01.12.2022 until 31.03.2023, a total of 178 papers using the search term “ChatGPT” were published in PUBMED, thereof six papers in December 2022, sixteen in January, 68 in February and 88 in March 2023. After a thorough human review, eighteen papers had to be excluded, 11 papers because they were written with but not about ChatGPT, four papers were not retrievable as full text, and two papers were neither written in English nor German. One paper was just an erratum note. Figure 1 shows the PRISMA flow chart of ChatGPT related publications.
Articles
There vast majority of all articles were brief statements like editorials, or letters to the editor (48·1%, n = 77). Essays or commentaries (26·9%, n = 43) represented the second largest portion of the articles. Studies (13·3%, n = 21), reviews, news, and case reports (each 3·8%, each n = 6), or meta-analysis (0·6%, n = 1) were less frequent. No randomized controlled trial could be identified. Figure 2 shows the distribution of article types according to the specifications of PUBMED. Of all articles, 80% (n = 128) contained non-empirical and 20% (n = 32) contained empirical data. Of these again, 6·9% (n = 11) were of qualitative, 8·8% (n = 14) were of quantitative, and 1·9% (n = 3) of mixed nature. Regarding the proportion of ChatGPT within the article, 11·9% (19) of all articles were written at least partially with ChatGPT. The average impact factors are displayed in table 1.
In order to illustrate scientific interest in the topic, as measured by number of publications, figure 3 shows the comparison to the number of Covid-19 publications during the first 12 weeks in 2020.
Journals
Publishing journals showed a wide range of scientific disciplines. Figure 4a shows an overview of the specialty distribution of the journals. The current impact factor of the represented journals ranged from 0 to 96·216 with a median of 5·144 (IQR 3·352-11·325). Overall, 45·6% (73) of all articles were published “traditionally” in contrast 54·4% (87) that were published as “open access”. Of those two groups, 5% of “traditional” and 11% of “open access” publications were provided on preprint servers in advance. Data on publication speed was accessible in 33·1% of all evaluated articles. The average time to publication was 16 days, ranging from 4 to 83 days.
Authors
Authors had a median of five (IQR 1-12 / range 0-94) first and a median of one last authorships (IQR 0-6 / range 0-61) in the years 2020-2022. Their area of expertise spanned all medical specialties up to science journalism, bioinformatics, nursing as well as humanities, economics and law. Figure 4 gives an overview of the specialty distribution of first authors (4b) publishing on ChatGPT.
SWOT analysis
We were able to detect over 400 quotes in which information was provided on strengths, weaknesses, opportunities and threats. Of those, by far the most were related to weaknesses and least to opportunities. Quotes on strengths and threats were mentioned less frequently. Among the most prevalently cited weaknesses were limited abilities,9,10 lack of accuracy/ correctness,11,12 citation problems,13,14 and need for verification.15,16 Strengths, on the other hand, included reduced workload,11,17–19 data summarization,20 and high-quality results.12,21–25 Amongst the threats captured most frequently were plagiarism/ hallucination, scientific misconduct and ethical concerns,22,26–28 whereas major opportunities were seen in supporting different faculties.21,29,30 Due to the variability in the mentions, we decided to use a semiqualitative analysis. Results, conclusions and suggestions can be found in Figure 5.
Discussion
This is the most comprehensive review of ChatGPT to date, summarizing all articles published in PUBMED since its introduction in November 2022 through end of March 2023. In addition to a whole series of metric results, ChatGPT is also critically reviewed in the context of a SWOT analysis. To the best of our knowledge, no similarly comprehensive study on the topic exists to date.
Concerning the article types, it is interesting to see, that so far, no randomized controlled trial has been published about ChatGPT which on the one hand would certainly be difficult to accomplish, but on the other hand is urgently needed. The majority of articles were predominantly of a shorter nature (editorial, letters, features, essays or commentaries).
Journals from the ranks of clinical medicine have published the most articles on ChatGPT, followed by education and others. This resembles the results from authorship. Both aspects, i.e. authorship and journal, show the wide application potential for ChatGPT across many specialist areas as would be expected from a LLM. When considering the impact factor of the journals, it is interesting to see that some articles were published in journals with no impact factor, although even highly reputable fundamental research journals such as Science or Nature as well as clinical journals such as BMJ or Lancet, took up the topic. This proves how important ChatGPT is seen in science, education and clinical work. In the 160 evaluated papers (138 published in Journals with impact factor), there was no significant difference in regard to the impact factor, if the paper was or was not written at least in parts with ChatGPT.
Despite the extensive application possibilities of ChatGPT in many medical, but also non-medical fields, the publication frequency increased rather sluggishly during the observation period, which is somewhat contradictory to the rather spectacular successes of LLM.1 Because ChatGPT is also an event of global significance, we deliberately chose the pandemic for comparison, but almost certainly the global health significance was a stronger trigger to address the issue, although no relevant difference was seen during the first four weeks. Interestingly, however, a publication speed, if ascertainable, of 16 days (4-83) was significantly faster than described in other studies on biomedical journals.31–33 Beside the spectrum of journals in which ChatGPT publications are made, the proportion of preprint and open access articles could also be decisive in this context. Online publishing has been identified to be strongly associated with reduced submission-to-publication time in multivariate analysis.33 Restrictively, it must be mentioned that data on submission speed was only available in about a third of all articles. However, in combination with the higher proportion of quantitative and non-empirical data, we assume all four factors (including open access and preprints) contributed to the fast publication times.
It is difficult to draw a portrait of the authors on this topic due to the distribution pattern as well as the frequency of publication. However, the majority seems to originate from fields of “clinical medicine” which means working with real patients. Education, a frequently mentioned and predestined area of application for ChatGPT, was less present. It should be noted that many authors were not “inexperienced in publishing”, but certainly broke new ground with their publications on the topic. The fact that it seems to be worthwhile to deal with ChatGPT is shown by the average impact factor that could be achieved with a publication on this topic, whereby it was apparently irrelevant whether the article was written with or without the help of ChatGPT. The median impact of 5·144 (with and without the help of ChatGPT) is in a range where only about 12·9% of other journals in a comparison of 13,000 selected scientific journals in 27 major research categories were.34 In addition, although it seems tempting to have at least parts of one’s manuscript created with the help of AI, only just under 12% (18/160) made use of it -or at least indicated they had. Clearly recognizing that AI was used for assistance is among the most frequently cited SWOT quotes, but more on that later. So far, however, the use of ChatGPT is not clearly superior or inferior in terms of the impact to be achieved.
Interestingly, ChatGPT was most likely to be identified as having weaknesses in our SWOT analysis, well ahead of strengths, which followed in second place. According to the distribution frequency of the SWOT citations, many authors seem to choose rather a descriptive description of the weaknesses and strengths, but from this much less perspectives or ideas for further handling or further development of Chat GPT developed from their findings. Our assertion is confirmed by the fact, that threats were cited only to 2/3, and opportunities even only to almost 50% in comparison to weaknesses. A SWOT analysis is originally defined as a “strategic planning and strategic management technique used to help a person or organization identify Strengths, Weaknesses, Opportunities, and Threats related to business competition or project planning”.35 Through a SWOT analysis, favorable and unfavorable internal and external factors achieving the objectives of a venture shall be identified. Some of its advantages include usability and being a “tried-and-true tool” of strategic analysis, points of criticism include limitations like preoccupation leading to neglect, inconsistent compliance with the analysis and domination of certain team members.36–39 In order to overcome some of the shortcomings, quotes were analyzed in a modified Delphi process; furthermore, as we intended our SWOT analysis as a starting point for discussion, we considered it as just the right tool for analyzing ChatGPT in its early stages and possibly give some ideas on how to move on from here, particularly in a rapidly changing environment.
So far, ChatGPT has been used to write essays, pass exams, translate knowledge for various peer groups, or write comments on the most diverse topics. During this application it became clear that ChatGPT is apparently “knowledge limited” (until 2021), that source information or even various facts are fictitious, which can only be detected by people with appropriate expertise.
The publications on the topic so far contribute to the fact that, on the one hand, the AI can be improved accordingly, but on the other hand, fields have already been identified in which ChatGPT can presumably be applied safely. These applications include the summarization of large data sets or the quite high linguistic quality of the generated texts.
Overall, caution must be exercised when using ChatGPT, as in several cases sources have been freely invented (hallucination) or copied (plagiarism) and thus the accurateness of the texts created by ChatGPT must always be questioned.
Certainly, no application areas for ChatGPT are, among others, the writing of scientific papers with references, the writing of CVs, or the writing of speeches -in all areas it could already be shown that at least partially completely fictitious passages were formulated by ChatGPT, which did not stand up to a review.
Actually, from our point of view after a concise review, ChatGPT has actually more the state of a toy to explore rather than a reliable tool for scientific working, which doesn’t have to be a bad thing at all, because even through a playful encounter, further strengths could be worked out and further weaknesses could be reflected back to the programmers, which could then be improved. In any case, however, a monopolization is to be avoided, since this could, under certain circumstances, result in major disadvantages such as a ban on viewing source codes or increasing commercialization with ethical-economic imbalances throughout the world. So far, over 30 alternatives for ChatGPT exist including OpenAI playground, Jasper Chat, Bard or Bing AI.40 But, in an ideal environment, such large-scale software would always best be open-source.
Another problem of major concern is the ability to detect scientific output by AI. The existing AI Detector software like GPTZero (https://gptzero.me) or related products like GLTR, GPTKit, OpenAI or Output Detector are for example based on scanning for perplexity (rather lower in AI) and burstiness (rather higher in AI).41 Their most obvious, clear limitation is that texts are not analyzed for context, but only for writing patterns, which allows the AI to remain undetected. First data on artificial intelligence output detector, plagiarism detector, and blinded human reviewers show promising results: most generated abstracts were detected using the AI output detector, with an AUROC of the AI output detector of 0.94.42 Still, further derivatives like GPT-3 or GPT-4 are already “waiting in the wing” - it only remains to hope that test software not only withstands, but hopefully also overtakes the rapid development – always in combination with an alert and suspicious human mind.
How is the phenomenon being dealt with worldwide?
ChatGPT is an AI specialized in written conversations, which makes its application imaginable in almost every area. Its potential has highlighted an absence of any concrete regulation. As by April 2023, ChatGPT is not available in China, nor in various countries with heavy internet censorship like North Korea, Iran and Russia. It is not officially blocked, but OpenAI doesn’t allow users in the country to sign up. Interestingly, several large tech companies in China are developing alternatives.43 Italy became the first western country to ban ChatGPT and various other western governments like Germany, Great Britain or Canada are exploring how to regulate AI right now. The U.S. hasn’t yet proposed any formal rules to bring oversight to AI technology.43
How is the phenomenon being dealt with in medicine?
During the review, the threat of “remaining undetected” and the associated lack of reproducibility was mentioned.30 Suggested solutions include the continuous mention of a possible involvement of ChatGPT,21 not as an author,44,45 but more as an “acknowledgement”.46 This issue has lately been addressed by the world association of medical editors (WAME) clearly stating, that “AI cannot be an author” and commenting on responsibility and reproducibility of the human authors.47 This is analogously also found in the criteria of the International Committee of Medical Journal Editors (ICMJE).48 Major publishers have begun to adopt those recommendations in their own policies.49 Other sources recommended inclusion of AI output detectors in the editorial process and clear disclosure if these technologies are used.42
Artificial intelligence has always fired human imagination as can be seen from famous movies like Star Trek, Star Wars, Terminator or Aliens - always associated with a resonating, undefined fear that AI may (will?) “overtake” us one day - with potentially deleterious consequences. Despite these easily visualizable and seemingly apocalyptic dangers, one should not condemn the sheer unlimited and fascinating possibilities of artificial intelligence - strict and clear regulation on many levels is necessary to fully leverage the potential. Maybe we should keep a low profile, hence ChatGPT itself points out at least some its weaknesses:
„As an artificial intelligence language model, I do not have a role in the discussion about ChatGPT in the medical literature. However, I can provide information and answer questions related to my capabilities and limitations as a language model, as well as share insights on how natural language processing technology is being applied in healthcare and medical research.”
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Ethics and reporting
All Authors declare that all relevant ethical guidelines have been followed, all necessary IRB and/or ethics committee approvals have been obtained, all necessary patient/participant consent has been obtained and the appropriate institutional forms archived.
Authors statement
All listed authors declare that all four criteria for authorship in the ICMJE Recommendations are meet individually. All authors confirm that they had full access to all the data in the study and accept responsibility to submit for publication. All authors claim responsibility and accountability for the originality, accuracy, and integrity of the work. Especially, AI and AI-assisted technologies were used twice as examples and clearly marked as such in the manuscript.
Contributors
All listed authors contributed in the following ways: Literature search (DG, SN, CW, YR, LR, JE, FB, TS), creation of figures (DG, TS), study design (DG, SN, CW, YR, LR, JE, FB, TS), data collection (DG, SN, CW, YR, LR, JE, FB, TS), data analysis (DG, SN, CW, YR, LR, JE, FB, TS), data interpretation (DG, SN, CW, YR, LR, JE, FB, TS), writing (DG, SN, LR, JE, FB, TS).
Role of the funding source
Institutional funding only; no external funding was utilized for the preparation of the manuscript.
Declarations of interest
We declare no competing interests.
Data sharing
Without exception, all data collected for the study, including individual participant data and a data dictionary defining each field in the set, will be made available to others. This also includes additional, related documents will be available with publication and will be made available upon serious request by email from the corresponding author.
Acknowledgement
The authors would like to thank Dr. rer. nat. Christian Burisch, state of North-Rhine Westphalia, for his help in statistics and the drafting of the manuscript.
Footnotes
Source of funding None.