PT - JOURNAL ARTICLE AU - Mudrik, Aya AU - Tsur, Abraham AU - Nadkarni, Girish N AU - Efros, Orly AU - Glicksberg, Benjamin S AU - Soffer, Shelly AU - Klang, Eyal TI - Leveraging Large Language Models in Gynecologic Oncology: A Systematic Review of Current Applications and Challenges AID - 10.1101/2024.08.08.24311699 DP - 2024 Jan 01 TA - medRxiv PG - 2024.08.08.24311699 4099 - http://medrxiv.org/content/early/2024/08/09/2024.08.08.24311699.short 4100 - http://medrxiv.org/content/early/2024/08/09/2024.08.08.24311699.full AB - Rationale and Objectives Over the past year, studies have been conducted to evaluate the performance of Large Language Models (LLMs), such as ChatGPT, in the fields of gynecologic oncology. This review aims to analyze the applications and risks associated with using LLMs in this specialized field.Materials and Methods This systematic review was performed in adherence to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, incorporating elements from the diagnostic test accuracy extension and the CHARMS checklist for reviews of prediction models. A systematic literature search was executed on July 17, 2024, across PubMed, Web of Science, and Scopus databases. We focused on identifying original research that integrates LLMs with gynecologic oncology. We assessed the risk of bias using the adapted QUADAS-2 criteria.Results Our search identified eight studies that met our criteria, focusing on healthcare education, clinical practice, and medical code generation. These studies revealed variability in ChatGPT’s performance across different applications. It excelled in genetic testing and counseling, achieving 97% accuracy rate. However, its performance in cervical cancer prevention was less robust, with an accuracy of 83%. While one study demonstrated ChatGPT’s high adherence to quality guidelines, another noted that established guidelines significantly outperformed ChatGPT’s outputs. Additionally, code generation using tools like Google Bard and RoBERTa have shown potential to improve accuracy in clinical predictions and quality assurance. For example, Natural Language Processing (NLP) assisted by RoBERTa (based on Google’s BERT model) has improved the prediction of residual disease in women with advanced epithelial ovarian cancer following cytoreductive surgery. Despite these advancements, challenges related to consistency, specificity, and personalization persist, underscoring the necessity for continuous enhancement of these technologies.Conclusion LLMs demonstrate inconsistent performance in gynecologic oncology. These findings emphasize the need for continuous evaluation of these models before they are implemented clinically.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present work are contained in the manuscript