Abstract
Isolation of patients with communicable infectious diseases limits spread of pathogens but can be difficult to manage outside hospitals. Rwanda deployed a digital health service nationally to assist public health clinicians to remotely monitor and support SARS-CoV-2 cases via their mobile phones using daily interactive short message service (SMS) check-ins. We aimed to assess the texting patterns and communicated topics to understand patient experiences. We extracted data on all COVID-19 cases and exposed contacts who were enrolled in the WelTel text messaging program between March 18, 2020, and March 31, 2022, and linked demographic and clinical data from the national COVID-19 registry. A sample of the text conversation corpus was English-translated and labeled with topics of interest defined by medical experts. Multiple natural language processing (NLP) topic classification models were trained and compared using F1 scores. Best performing models were applied to classify unlabeled conversations. Total 33,081 isolated patients (mean age 33·9, range 0-100), 44% female, including 30,398 cases and 2,683 contacts) were registered in WelTel. Registered patients generated 12,119 interactive text conversations in Kinyarwanda (n=8,183, 67%), English (n=3,069, 25%) and other languages. Sufficiently trained large language models (LLMs) were unavailable for Kinyarwanda. Traditional machine learning (ML) models outperformed fine-tuned transformer architecture language models on the native untranslated language corpus, however, the reverse was observed of models trained on English-only data. The most frequently identified topics discussed included symptoms (69%), diagnostics (38%), social issues (19%), prevention (18%), healthcare logistics (16%), and treatment (8·5%). Education, advice, and triage on these topics were provided to patients. Interactive text messaging can be used to remotely support isolated patients in pandemics at scale. NLP can help evaluate the medical and social factors that affect isolated patients which could ultimately inform precision public health responses to future pandemics.
Author Summary We present the first application of NLP for categorizing text messages between patients and healthcare providers within a nationally scaled digital healthcare program. This study provides unique insights into the circumstances of home-based COVID-19 patients during the pandemic. Our trained topic classification models accurately categorized topics in both English and African language texts. Patients reported and discussed both medical and social issues with public healthcare providers. This approach has the potential to guide precision public health decisions and responses in future outbreaks, pandemics, and remote healthcare scenarios.
Introduction
The practice of isolating patients with communicable infectious diseases reduces transmission by separating infected individuals from healthy ones, and isolation at home allows healthcare systems to focus on managing severe cases in hospitals.1 However, isolated patients may have changes to their condition that require attention from medical professionals and can experience additional social stressors.2 Isolation protocols have even been considered a risk factor for mortality.3 Understanding the complexity of facility and home-based isolation remains a critical consideration in effective pandemic response planning.
At the pandemic outset, the Rwanda government COVID-19 Joint Task Force adopted a short message service (SMS)-based digital health service, WelTel, to remotely monitor and support community isolated cases and contacts of SARS-CoV-2 from the centralized national command post. Its use continued throughout the emergency phase of the pandemic as the key to their Home-Based Care (HBC) program.4 The WelTel platform was selected based on its history of use in HIV programs, prior clinic evidence, and importantly, its accessibility.5–7 Internet based tools and smartphone apps would have been insufficient since, as a lower-middle income country (LMIC), only 23% of the Rwanda population has internet-access; yet at least 90% of the population has access to cellular phones.8,9 As in other applications of the WelTel platform, throughout the COVID-19 response in Rwanda, automated ‘check-in’ messages were sent daily to isolated patients, to which they could respond to report on their status and ask questions. A manually performed topic classification of WelTel text messaging conversations with patients during HIV care previously demonstrated that these open-ended text messaging conversations contain wide ranging issues experienced by patients that could be acted upon.10
Advances in natural language processing (NLP) enables rapid computational topic analysis (classification and modelling) which can delineate large amounts of text into topics of interest.11 Topic classification models using machine-learning (ML) have been used in clinical settings to label text messages exchanged between healthcare providers and outpatients, but never to our knowledge in a national scale program.12–15 Unfortunately, the uneven distribution of global healthcare data has the potential to leave entire continents behind in the development of artificial intelligence (AI) for healthcare.”16
In this study, we aimed to analyze the complete corpus of SMS conversations between Rwanda’s public health clinicians and isolated COVID-19 patients and contacts during the first two years of the COVID-19 pandemic. We considered a clinical, social, and logistical perspective using topic classification to better delineate the needs and experiences of patients and the advice they were given, during isolation – a large-scale approach that could be used to guide decision-making in outbreak management.
Results
Participation in text messaging conversations
During the study period, Rwanda reported 131,190 cases of COVID-19, of whom 33,081 individuals (25%, 30,398 cases, 2,683 contacts) were registered in the WelTel texting service. The numbers of patients registered, and the occurrence of text-message conversations (with at least 3 interactive messages), appeared in waves similar to the COVID-19 incidence reported in the national registry (Fig. 1).
Of those registered, 6,021 patients (18%) used the texting service to generate 12,119 conversations (Table 1). Controlled for other demographic and clinical factors, the odds of generating a conversation among females compared to males was 0·83 (95%CI: 0·78-0·89; p<0·001). Age did not significantly influence the frequency of conversations (OR per year increase in age = 1·00, 95%CI: 0·99-1·00, p=0·17). Contacts were more likely to generate conversations than cases (OR 1·83 95%CI: 1·63-2·06, p<0·001). The proportion registered patients texting conversations varied across waves of the pandemic (Wave one: 12%; two: 37%; three: 25%; and four: 3%).
The median number of conversations per patient was one (IQR: 1-2, range: 1-18), and the median conversation length (# of discrete text messages) was three (IQR: 2-5, range: 2-44). Most conversations occurred in Kinyarwanda (8,183, 67%) or English (3,069, 25%), however other regional languages (French, Kiswahili, Ganda, Chewa/Nyanja) were also identified (<10%). Language proportions were similar to the proportion of the population considered literate in those languages in Rwanda.27
Comparison of NLP model performance
The NLP models tested classified most topics with a level of performance above our cut-off F1-score (Table 2). The traditional ML models performed better than fine-tuned transformer models on untranslated native language conversation data. Best performing models for each topic met the pre-determined F1-score cutoff of 0·7 for six of the nine topics meeting the sample size threshold of 100. Model performance was roughly correlated with the number of occurrences (within sample) of the given topic.
In our language experiments, fine-tuned transformer models generally performed better when using the English (translated) training and testing corpus, exceeding the F1 threshold in all but one topic.
Topic analysis of conversations
Human labeling (training corpus) and machine topic labeling (full corpus) identified the topics and subtopics of interest at roughly similar frequencies (Figure 2 and S3 Fig). Overall, medical topics were discussed more frequently than non-medical topics. The most frequent topic was symptoms (69%), the majority reporting a lack of symptoms (59%). Diagnostic methods (38%, e.g., receiving COVID-19 test results), prevention (18%, including non-pharmaceutical such as how to maintain isolation, and pharmaceutical such as vaccines), healthcare logistics (16%, e.g., asking where to get tested), and treatment (8%, including management of symptoms). Social topics (19%) included mention of culture or religion (5·5%), followed by discussing friends and family (4·9%). Manually labeled training data revealed important issues such as nutrition and food security (3·7%). Examples of text conversations are provided in Table 3.
Comparing the odds of topic discussion within the full corpus by patient demographic and clinical factors revealed minor differences that are reported in (S4 Fig).
Methods
Setting and digital health monitoring technology
Rwanda, a Central African nation with a population of 14 million, was internationally recognized for its prompt response to the COVID-19 pandemic.17 Their public health command implemented the WelTel text messaging service [www.weltelhealth.com] on March 18, 2020, coincident with detecting its first COVID-19 cases. This tool enabled public health staff to remotely monitor and support individuals who tested positive for, or were at high risk of contracting, SARS-CoV-2, and remained operational until the public health emergency was officially declared over.
The WelTel platform is a secure web-based application that healthcare providers can access from any internet-connected device. It interfaces with cellular networks enabling it to send both pre-set and manually input SMS messages to registered mobile phone users. It features a conversation dashboard that visualizes responses.
Following the initial facility-based quarantine of COVID-19 cases identified at border entry points, Rwanda launched its HBC program when community transmission became widespread. Under this program, mild cases and contacts were asked to stay home and be remotely monitored. Patients with significantly worsening clinical status were triaged to hospitals. Individuals were enrolled as ‘cases’ if they tested positive for SARS-CoV-2 or as ‘contacts’ if they were deemed high risk due to exposure. Patients without personal cell phone numbers were registered via a household member, friend, or neighbor.
The messaging protocol for COVID-19 monitoring involved sending bulk automated daily ‘check-in’ texts in Rwanda’s principal languages (Kinyarwanda, English, French) daily to registered patients throughout their isolation period. Patients could respond via text in their own words to indicate their status and/or ask questions. Replies indicating a problem or question were flagged for follow-up. Public health clinicians responded to flagged messages to provide information or advice, forming interactive text conversations.
Included datasets and description
Patient registration and all text communication data generated between March 18, 2020 and March 31, 2022 was exported from the WelTel database and linked to the Rwanda DHIS2 COVID-19 testing registry to obtain missing demographic or clinical data.18 Personal identifiers were removed. For the purposes of our study, a ‘conversation’ was defined as a text exchange consisting of at least one ‘incoming’ patient message containing three or more words (filtering out minimal replies e.g., ‘doing well’), together with at least one ‘outgoing’ public health clinician or automated system message (thus being interactive). Patient registration and numbers of conversations and their characteristics (# messages, language used) were quantified. Conversation usage was compared across patient sociodemographic and clinical characteristics (sex, age, province of residence, pandemic wave, and COVID-19 status (case vs. contact)) using a multivariable logistic regression model. Odds ratios (OR) are reported with 95% confidence intervals (CI) and p<0·05 was considered significant. Example conversations were selected to illustrate context.
Language translation and topic labelling
A list of topics and subtopics was developed through a three-round mini-Delphi process involving ten public health experts, clinicians, and researchers from Rwanda and Canada. This list was an expansion of a previously published list focused on clinical care and used to develop a conversation analysis and visualization tool, ConVIScope.12 The final list of topics and subtopics encompassed perspectives on public health, emergency response, and clinical care. Topic definitions are provided in (S1 Fig).
For the purpose of labeling a training dataset for topic classification models, a sample of 2,791 text conversations from March 18, 2020, to May 31, 2021, was extracted from the study period corpus. These conversations were analyzed for language use with Google’s CLD2 language detector. Non-English sample messages were translated into English by a compensated professional healthcare text translator proficient in Kinyarwanda, English, French, and Kiswahili.
The English-translated conversations were labeled by at least three trained members of the research team using a custom conversation annotation tool. Any conflicts in labeling were resolved by an external team member. The topic labels were then applied back to the corresponding conversations in their original language, where applicable.
Classification model training, testing, and application
We aimed to identify the best performing classification model for each topic and subtopic. Training set sample size constraints limited our classification experiments to topics with ≥100 occurrences. We employed the binary relevance strategy, which simplifies performance management by building a separate binary classifier for each topic. We experimented with both traditional ML algorithms and Transformer-based language models for each topic. We explored large language models (LLMs, e.g., GPT-4), however, our literature review revealed that, of the three pre-trained with any Kinyarwanda, the pre-training datasets for these models contained minimal Kinyarwanda text (≤0.01%). Investigations of the performance of these LLMs at NLP tasks in Kinyarwanda suggested they were insufficient to include.19
Traditional-ML classification models: For training traditional ML models, we explored five feature sets: binary and non-binary bag of words, TF-IDF (term-frequency inverse document frequency), and character n-gram (with/without TF-IDF). The character n-gram features were tried because Kinyarwanda is an agglutinative language. All combinations of one of these feature sets, and one classification algorithm, were tested. We evaluated logistic regression, ridge classifier, and random forest as classification algorithms. Hyperparameters for each classifier were tuned using random search in five-fold cross validation.20
Transformer-Based Language Models: We leveraged open-source pre-trained Transformer-based language models for Kinyarwanda, including AfriBERTa, AfroLM, AfroXLMR, and KinyaBERT.21–24 To fine-tune these models for our task, we employed the AdamW optimizer along with the binary cross-entropy loss function. Additionally, we incorporated dropout layers (with a probability of 0.0 for the first layer and 0.3 for subsequent layers) for regularization.
We compared the performance of each model for each topic using the validation set F1-score, with a minimum F1-score cut-off of 0.7. The highest F1-score models were applied to predict topics for the remaining 9,328 unlabeled, untranslated SMS conversations.
Additional Language experiments: Since our corpus contained a mix of high– and low-resource languages (e.g., English, French, Kinyarwanda) we explored how this might affect performance of the models by conducting identical experiments to above using only the English-translated version of the training corpus. We evaluated both traditional ML models (as above), and English pre-trained Transformer models (BERT, Longformer).25,26 F1 scores are reported for comparison.
See (S2 Fig.) for more hyperparameter and fine-tuning details.
Topic analysis and statistics
Counts and proportions of conversations containing each topic (full corpus), and subtopic (sampled corpus), are provided. Topic frequencies were compared by patient demographics and clinical metrics using multiple independent multivariable logistic regression models for each topic. Models incorporate all patient demographic and clinical characteristics and use complete cases. OR are reported with 95% CIs and p-values are adjusted for multiple testing using the Benjamini-Hochberg correction with a false discovery rate of 0·05. Classified topic categories could be batch filtered and manually read to understand the nature of the discussions that occurred within each topic.12 Data summarizations, visualizations, and statistical testing were produced or conducted using R version 4.2.1.
Discussion
To our knowledge, this is the first study to report on the use of NLP to investigate a broad range of topics from a complete corpus of conversational texts with patients in home isolation in a pandemic or outbreak setting at large scale. Through a combination of manual annotation and NLP techniques, we gained insights into patient experiences during COVID-19 in Rwanda by identifying key discussion topics between patients and public health clinicians. The sustained use of the text messaging service in Rwanda demonstrated the feasibility of using SMS-based communication for remote monitoring at scale in a region with limited internet access but widespread mobile phone penetration. Leveraging existing infrastructure allowed for efficient support for isolated individuals, regardless of location or access to healthcare facilities.
Our data captured the first two years of the Rwandan COVID-19 pandemic, including text conversations in Kinyarwanda, English, and other local languages with public health clinicians to report their status or seek advice. Topic classification revealed that medical topics, such as symptoms, diagnostics, prevention, and treatment, were commonly discussed, reflecting patient’s primary focus on their health status and obtaining medical guidance. However, social and lifestyle topics, including discussions about social support, healthcare logistics, diet/nutrition, finances, and service quality, also emerged as important aspects to patients during home isolation. The data underscores the need for holistic support systems for patient isolation programs. Additionally, while our topic classification approach rapidly quantified prevalent topics well, reflecting their relative importance to patients, it also enables the sorting of large datasets into smaller categories for easier manual evaluation by decision-makers. The identification of less common topics, such as food insecurity, may also provide actionable targets for public health responders, especially if it can be localized to communities.
NLP techniques are rapidly being introduced into health systems to understand unstructured data from large datasets, such as clinical notes in electronic medical records (EMRs), social media posts, or free-text survey responses.11 For example, topic classification has been employed to sort unstructured patient feedback from text messages.15 Still, manual qualitative analyses dominate. A manually labeled thematic analysis of 1,454 patient portal secure messaging conversations between cancer patients and health professionals in the US during COVID-19 found 26% of conversations were related to COVID-19 and identified topics related to changes in care plans, symptoms, risks, precautions and symptoms.28 In Northern Thailand manual topic classification of SMS exchanges with 213 home-isolated patients with diagnosed COVID-19 revealed several complexities of care including discussing symptoms or vital signs, medications, diet and exercise, mental stresses, information about COVID-19, and a high proportion of logistical assistance to navigate the health services.29 Manual labeling is resource intensive and generally limited to smaller data sets. One US-based study used NLP techniques to determine if patient-initiated secure messaging via a EMR portal was associated with positive testing, but this was a simple binary classification.13 To our knowledge, our study is the first to use NLP techniques to classify patient clinician texts into broad medical, social and logistical topics set by experts, and at a national scale.
Our study compared performance of several ML model approaches. Traditional ML models and transformer-based language models were both effective in classifying text message conversations into the predefined topics. While traditional ML models performed better with untranslated conversations, transformer models had superior performance when trained on English (translated) data, highlighting their limited performance on African languages.30 This observation is consistent with the knowledge that transformer architecture language models are likely to underperform on NLP tasks in low-resource languages until sufficient high-quality pre-training data is sourced and leveraged.31 Similarly, LLMs, which are also based on transformer architecture, currently contain too little, if any, training on Kinyarwanda and other low-resource languages to be useful in our context.19 A brief trial using Google Translate to machine translate Kinyarwanda text into English for our study was inaccurate. This highlights the importance of considering language-specific model performance when analyzing multilingual text data.32 Language consideration remains a significant challenge, with over 4,000 written languages worldwide. Pandemics affect everyone and AI tools must account low-resource languages to be equitable.
Our study has limitations. First, the supervised learning approach required the pre-definition and manual labeling of topics of interest which can be human resource intensive. Furthermore, the topics of interest or training samples annotated may evolve over time or change in different settings requiring continuously updating training datasets. For example, prevalent discussions of vaccines came later in the pandemic after most of our training data had been annotated. Including unsupervised ML models that discover topics as they emerge could help overcome this bias, but have their own flaws. Furthermore, while our dataset was similar in size to those used to train other topic classification models in healthcare, many subtopics of interest were not prevalent enough to develop models. Additionally, the training dataset was translated into English for labelling purposes, again, being resource intensive and potentially altering meaning in translation. Finally, training the models took time and conducting the NLP analyses afterwards meant the data was not available to decision-makers real time. Ideally, once models are trained and validated, they could be applied while an outbreak or pandemic is occurring and provide insights into patient and population issues that could be responded to while they are occurring. Together with location data, this type of analysis could contribute to precision public health responses.
Conclusion
Moving forward, future research should explore ways to further optimize the evaluation of text-based communication for greater insights into the patient care journey in pandemics, and clinical care. It must consider technology infrastructure available to populations globally and be inclusive of resource-limited languages.33,34 Efforts to improve language support and translation capabilities across diverse linguistic and cultural contexts will be essential to ensure equitable access to digital services by people in all regions. Ultimately, emerging research on models that detect, understand, and generate natural language, as well as development of tools that formulate predictions and recommendations for healthcare decision-makers, will achieve the higher accuracy, precision, and reliability required for use in routine healthcare.
Data Availability
The WelTel software was installed and run on the Rwanda government National Data Center and using the government contracted API (FDI) to connect to cellular networks. The Rwanda government remains the steward of all patient data produced or collected by the WelTel platform and the COVID-19 national data registry. Deidentified text conversation training data sets can be made available on request. All NLP models used were open source and references for NLP tools are provided in the Methods and Supplementary Material.
Supporting Information
Acknowledgements
We would like to thank the staff at the Rwanda Biomedical Center and work-learn students at the University of British Columbia who helped with language translation, annotation, and various aspects of background research and model building.