LLMs for analyzing open text in global health surveys: why children are not accessing vaccine services in DRC

Roy Burstein; Eric Mafuta; Joshua L. Proctor

doi:10.1101/2024.11.14.24317253

Abstract

This study evaluates the use of large language models (LLMs) to analyze free-text responses from large-scale global health surveys, using data from the Enquête de Couverture Vaccinale (ECV) household coverage surveys from 2020, 2021, 2022, and 2023 as a case study. We tested several LLM approaches varying from zero-shot and few-shot prompting, fine-tuning, and a natural language processing approach using semantic embeddings to analyze responses on reasons caregivers did not vaccinate their children. Performance ranged from 61.5% to 96% based on testing against a curated benchmarking dataset drawn from the ECV surveys, with accuracy improving when LLM models were fine-tuned or provided examples for few-shot learning. We show that even with as few as 20–100 examples, LLMs can achieve high accuracy in categorizing free-text responses. This approach offers significant opportunities for reanalyzing existing datasets and designing surveys with more open-ended questions, providing a scalable, cost-effective solution for global health organizations. Despite challenges with closed-source models and computational costs, the study underscores LLMs’ potential to enhance data analysis and inform global health policy.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This publication is based on research funded in part by the Gates Foundation, including modeling and analysis performed by the Institute for Disease Modeling at BMGF. This study uses data drawn from the DRC vaccine coverage surveys, also funded by UNICEF, USAID, and GAVI.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The institutional review board of the Kinshasa School of Public Health Ethical committee gave ethical approval for the collection and use of survey data.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Raw survey responses to the free-text question and all code for this study in both python and R are available at https://github.com/InstituteforDiseaseModeling/AIAugmentedSurveyResponseCategorization

https://github.com/InstituteforDiseaseModeling/AIAugmentedSurveyResponseCategorization

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.