PT  - JOURNAL ARTICLE
AU  - Hasheminasab, Seyed Alireza
AU  - Jamil, Faisal
AU  - Afzal, Muhammad Usman
AU  - Khan, Ali Haider
AU  - Ilyas, Sehrish
AU  - Noor, Ali
AU  - Abbas, Salma
AU  - Cheema, Hajira Nisar
AU  - Shabbir, Muhammad Usman
AU  - Hameed, Iqra
AU  - Ayub, Maleeha
AU  - Masood, Hamayal
AU  - Jafar, Amina
AU  - Khan, Amir Mukhtar
AU  - Nazir, Muhammad Abid
AU  - Jamil, Muhammad Asaad
AU  - Sultan, Faisal
AU  - Khalid, Sara
TI  - Assessing equitable use of large language models for clinical decision support in real-world settings: fine-tuning and internal-external validation using electronic health records from South Asia
AID  - 10.1101/2024.06.05.24308365
DP  - 2024 Jan 01
TA  - medRxiv
PG  - 2024.06.05.24308365
4099  - http://medrxiv.org/content/early/2024/06/05/2024.06.05.24308365.short
4100  - http://medrxiv.org/content/early/2024/06/05/2024.06.05.24308365.full
AB  - Objective Fair and safe Large Language Models (LLMs) hold the potential for clinical task-shifting which, if done reliably, can benefit over-burdened healthcare systems, particularly for resource-limited settings and traditionally overlooked populations. However, this powerful technology remains largely understudied in real-world contexts, particularly in the global South. This study aims to assess if openly available LLMs can be used equitably and reliably for processing medical notes in real-world settings in South Asia.Methods We used publicly available medical LLMs to parse clinical notes from a large electronic health records (EHR) database in Pakistan. ChatGPT, GatorTron, BioMegatron, BioBert and ClinicalBERT were tested for bias when applied to these data, after fine-tuning them to a) publicly available clinical datasets I2B2 and N2C2 for medical concept extraction (MCE) and emrQA for medical question answering (MQA), and b) the local EHR dataset. For MCE models were applied to clinical notes with 3-label and 9-label formats and for MQA were applied to medical questions. Internal and external validation performance was measured for a) and b) using F1, precision, recall, and accuracy for MCE and BLEU and ROUGE-L for MQA.Results LLMs not fine-tuned to the local EHR dataset performed poorly, suggesting bias, when externally validated on it. Fine-tuning the LLMs to the local EHR data improved model performance. Specifically, the 3-label precision, recall, F1 score, and accuracy for the dataset improved by 21-31%, 11-21%, 16-27%, and 6-10% amongst GatorTron, BioMegatron, BioBert and ClinicalBERT. As an exception, ChatGPT performed better on the local EHR dataset by 10% for precision and 13% for each of recall, F1 score, and accuracy. 9-label performance trends were similar.Conclusions Publicly available LLMs, predominantly trained in global north settings, were found to be biased when used in a real-world clinical setting. Fine-tuning them to local data and clinical contexts can help improve their reliable and equitable use in resource-limited settings. Close collaboration between clinical and technical experts can ensure responsible and unbiased powerful tech accessible to resource-limited, overburdened settings used in ways that are safe, fair, and beneficial for all.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study was funded by Bill &amp;amp; Melinda Gates Foundation (INV-062576).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:IRB of Shaukat Khanum Memorial Cancer Hospital and Research Centre gave ethical approval for this work (Ethics Review Number EX-17-07-23-01)I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAggregate data will be made freely available on study website. No patient-level data sharing is permitted as per ethics approval.