Abstract
Objective Healthcare websites allow patients to share their experiences with their treatments. Drug testimonials provide useful information for real-world evidence, particularly on the occurrence of side effects that may be underreported. We investigated the potential of large language models (LLMs) for detecting signals of body weight change as under-reported side effect of antidepressants in user-generated online content.
Materials and Methods A database of 8,000 user-generated comments about the 32 FDA-approved antidepressants was collected from healthcare social websites. These comments were manually annotated under the supervision of drug experts. Several pre-trained LLMs derived from BERT were fine-tuned to automatically classify comments describing weight gain, weight loss, or the absence of reference to a weight change. Zero-shot classification was also performed. Performance was evaluated on a test set by measuring the weighted precision, recall, F1-score and the prediction accuracy.
Results After fine-tuning, most of the BERT-derived LLMs showed weighted F1-scores above 97%. LLMs with higher number of parameters used in zero-shot classification almost reached the same performance. The main source of errors in predictions came from situations where the machine predicted falsely weight gain or loss, because the text mentioned these elements but for a different molecule than the one for which the comment was written.
Conclusion Even fine-tuned LLMs with limited numbers of parameters showed interesting results for the detection of adverse events from online patient testimonials, suggesting they can be used at scale for real-world evidence.
Introduction
Body weight changes are described adverse effects (AEs) of antidepressant use. Weight loss or gain is also a component of the clinical presentation of depression. Consequently, a change in body weight can be considered a potentially expected outcome in treated patients and weight changes may not always be readily attributed to the use of these treatments as adverse effects. However, if they do occur, they can significantly impact patients and their therapeutic management, as these changes directly affect self-image and adherence to treatment. AEs like weight changes become more evident with long term treatment. They are thus difficult to capture in clinical trials due to their short follow-up period (1). Their identification and correct quantification rely on further evaluation in real-world conditions, mostly through clinical reporting to pharmacovigilance. But attributing to treatments long-term and delayed changes is more difficult than evidencing acute immediate effects.
The first-generation of antidepressants are known to cause weight gain (2). These effects can also be observed with more recent treatments (3). It has been reported that the general trend over 10 years for people who have received an antidepressant treatment is weight gain (4). The results of cohort studies and meta-analyses also agree to identify some antidepressants being at higher risk (1). Molecules such as mirtazapine, amitriptyline, and paroxetine are generally associated with weight gain, while bupropion and fluoxetine seem to be more associated with weight loss (1). However, for many molecules, both weight gain and weight loss are reported, highlighting the variability in individual responses to these medications as well as the importance of temporal aspects, with effects that may differ at the start of treatment or over the long term. Owing to all these elements, patient testimonials can be a very interesting source of information to better describe the adverse effects of this family of drugs.
The democratization of the internet has led to the emergence of many platforms for free expression, including online health communities. These health-specialized sites allow patients to share their experiences with their treatment. These testimonials are a potential source of massive data for collecting information about the effects of treatments in real-world conditions (5). The diversity of information available in these testimonials is a direct result of a proactive and patient-centered information sharing approach (6? ?). It is conceivable that the adverse events with the most significant impact on patients are the most commented on, even if they may not be the ones that are discussed the most between patients and caregivers during medical consultations and therefore not the ones that are reported the most. This source of information is an interesting alternative in the context of adverse events that would be under-reported in traditional pharmacovigilance channels (7, 8). The main disadvantages of these data are that they are unstructured, with a wide linguistic diversity, which complicates the automation of their analysis. Given the number of comments generated, processing by human operators is proving to be a long and tedious task. During the 2010’s, natural language processing (NLP) tools (9) coupled to deep neural networks have shown promising results for AEs detection in text data (10), and a variety of architectures have been proposed to detect AE in text sources (11–13). Automated AEs were screened scrutinizing electronic health records of patients (14–17), and Web data including social networks like X (formerly Twitter) (6, 18–22). Most of these works aimed at identifying the mention to an AEs and did not focus specifically on the detection of a particular AE.
More recently, the development of transformer-based models has revolutionized NLP and have become the new standard for many NLP tasks (23). Few recently published works make use of pretrained transformer-based models for ADE extraction on informal texts (24), especially the models based on pre-trained models like bidirectional encoder representations from transformers (BERT) (21, 25–29). Self-attention (or QKV-attention) is central mechanism in Transfomer -allowing the model to attend to different parts of the text input sequence when making predictions and to learn long-range dependencies in the input sequence. Transformer models can be trained on large datasets to learn language representations. Pre-trained models can then be used for zero-shot classification or they can be fine-tuned for a specific classification task. In zero-shot classification, pre-trained models are used without any specific additional training examples for new classes (30). Their performance rely on their ability to generalize classification properties based on the language representation they captured during pre-training. On the opposite, the fine-tuning step consists of training further a pre-trained model on a new dataset and for new classification classes to improve the model’s performance on this specific task (31).
Here, we explored the potential of different models of classification to automatically recognize comment texts describing weight gain, weight loss, or the absence of reference to a weight change. In particular, we focused on BERT and BERT-related transformers, that have achieved state-of-the-art results on a wide range of NLP tasks (32). The novelty of BERT was to encode the context of a word from both the left and the right (bidirectional). BERT was designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
We also used RoBERTa - a robust re-implementation of BERT with some modifications to the key hyperparameters and minor embedding tweaks. It has been shown to outper-form BERT on a variety of NLP tasks (33). Roberta uses a byte-pair encoding tokenizer instead of a word-piece to-kenizer to better handle rare words and out-of-vocabulary words. It trains with a larger mini-batch size and for a longer number of steps to better learn long-range dependencies in text and it removes the next sentence prediction (NSP) objective, found to be less effective than the masked language modeling (MLM) objective for pre-training BERT models. If BERT and RoBERTa are highly efficient, it can be difficult to run these large models on edge devices or to train or use them with limited computational ressources.
DistilBERT is a distilled version of BERT that is smaller and faster (34). It is trained using a knowledge distillation procedure, which involves training a smaller model to mimic the predictions of a larger model and it has been shown to achieve comparable performance to BERT and RoBERTa. Distil-BERT is a good choice for applications where speed, efficiency, and ressources are important. Similarly, we explored SqueezeBERT, a lightweight and efficient transformer-based language model that is specifically designed for mobile and embedded devices (35). SqueezeBERT is based on the BERT architecture, but it has a number of modifications that make it smaller and faster including using grouped convolutions instead of fully-connected layers to reduce the number of parameters in the model.
BERT and these BERT-derived models are pre-trained on a massive dataset of text and code. They can be used pre-trained and only fine-tuned on a specific downstream task including classification without having to be trained from scratch (36). We also explored the performance of DistilBERT and DeBERTa in the context of zero-shot classification. DeBERTa uses disentangled attention for learning richer contextual representations of words and an enhanced mask decoder for more accurate predictions of masked words, providing this model state-of-the-art performance on a variety of natural language processing tasks (37).
Taken together, we aimed to explore the potential of large language models (LLMs) in detecting signals of changes in body weight as potential side effects of antidepressants in user-generated online content. Our investigation included fine-tuning and zero-shot classification using various BERT-based models. Our main motivation was to explore the ability of these models in identifying signals within large datasets of user reviews. Additionally, we aimed to investigate their capacity to distinguish weight gain from weight loss - two closely related elements sharing similar syntax that require a more nuanced analysis and interpretation of the language used in the comments for accurate classification. This represents an additional step beyond signal detection, moving towards assessing the causality of a drug in the occurrence of adverse effects. Most of these different models performed particularly well in this task.
Material and Methods
Data collection
Data collection took place between August 2022 and November 2022. The data corresponded to publicly available user-generated online content from four specialized websites at the time of data collection (38): Drugs.com, WebMD, Everyday Health, and Ask a Patient. Drugs.com, WebMD and Everyday Health are websites that provide information about medications and health to the general public. Ask a Patient is a platform where patients can share their first-hand experiences with their prescribed drugs. Data from Drugs.com, WebMD and Everyday Health were scraped using R version 4.2.2 and the R packages rvest (39), httr (40), and xml2 (41) while data from Ask a Patient (and the psyTAR) database were kindly provided by Askapatient.com. A random sample of 8,000 texts, stratified by site and drugs, was extracted for manual annotation.
Data labeling
The annotation was performed by ten operators, including two pharmacists, a general practitioner, two pharmacy students and five non-drug experts. The data set was randomly split in 10 sets with overlap to allow for the measurement of inter-rater agreement. The drug on which the users were commenting on in the input text was identifiable to the rater. Each input text was categorized into three categories: weight gain, weight loss, or the absence of reference to a weight change. However, the label had to be applied on the drug a user was commenting on (i.e. if weight change elements were described only for another drug mentioned in the user-content, like previous experience in treatment history, then the label was set to absence of reference to a weight change for this given drug). In cases where the rater felt uncertain about the appropriate class assignment, he/she could add a specific code to allow this input text be reviewied by a drug-expert.
Data split
The data set was split into training (n = 6,000), validation (n = 1,000), and test (n = 1,000) datasets. The training and validation data sets were used to adjust the LLMs during the fine-tuning step. The test set was used to evaluate the performance of the model. It is important to note that the training, the validation and the test sets were defined identically for all models in order to facilitate comparisons.
Large Language Models
The pre-trained models were loaded by instanciating a given LLM configuration model using the generic sequence classification model class available in the Hugging Face library https://huggingface.co/ (42) for Python 3.8.
The different models for zero shot classification were :
distilbert-base-uncased (6-layer, 768-hidden, 12-heads, 66 M parameters)
deberta-v3-base (12 layer, 768-hidden,12-heads, 86 M parameters)
deberta-v3-large (24 layer, 1024-hidden,16-heads, 304 M parameters)
and for fine-tuning :
bert-base-uncase (12-layer, 768-hidden, 12-heads, 110 M parameters)
roberta-base (12-layer, 768-hidden, 12-heads, 125 M parameters)
distilbert-base-uncased (6-layer, 768-hidden, 12-heads, 66 M parameters)
squeezebert-uncased (12-layer, 768-hidden, 12-heads, 51 M parameters)
The input texts corresponding to user reviews were lower-cased and tokenized using the appropriate pre-trained tokenizer using the AutoTokenizer class from the Transformers model library, without any additional preprocessing steps. In all cases, the tokenized texts were paded and truncated to a fixed length of 512.
The models were trained over 5 epochs, using the AdamW optimizer (43) and a learning rate of 2 ×10−5 with batch sizes of 4 to 16, depending on models. Accuracy and F1-score were monitored during training, as well as the training and validation loss. Models were trained for about 30-45 minutes with NVidia 1080 GPU acceleration.
Prediction performance
The models were evaluated using accuracy, weighted precision, recall and F1-score and ‘macro’ F1-score measures. Performance was measured on the predictions of the test dataset, which was unseen during the training step.
Results
Characteristics of the training data
The database included 80,594 comments about 32 different antidepressants, from four websites: Drugs.com (35.8 % [28,876]), WebMD (43.1 % [34,748]), Everyday Health (18.2 % [14,672]), and Ask a Patient (2.9 % [2,298]). Of these comments, 8,000 were randomly selected for manual annotation, stratified by website and antidepressant molecules. The annotation process was designed to allow for some overlap between text inputs so that inter-rater reliability could be estimated. Out of the 8,000 text inputs, 3,500 were evaluated twice, resulting in 96 label discrepancies (2.74 %) reconciled by an expert reviewer. This gave a Krippendorff’s alpha of 0.841. The distribution of labels in the dataset are presented as a function of drugs in Table S1. Following the data split, the training and validation sets comprised 6,000 and 1,000 labeled inputs, respectively, while 1,000 inputs were kept for testing. As expected, labels associated with gaining or losing weight were in the minority in the training and testing sets-with almost 90% of the labels corresponding to an absence of mention of change in body weight in the input text. This led to an accuracy value of 0.897 in the validation dataset for a dummy classifier predicting the most represented class (see Table 1).
Models performance
BERT, RoBERTa, DistilBERT, and SqueezeBERT all performed well after fine tuning, with BERT and RoBERTa performing the best with an F1-score of 0.976. Their macro F1-score were 0.865 and 0.891, respectively. Precision and recall of each class are presented for all models in Table S2. DistilBERT performed poorly in zero-shot classification, while DeBERTa’s performance seemed to scale with the number of parameters, reaching performance close to that of the smaller fine-tuned models for the large DeBERTa model with 304M parameters. Confusion matrices - calculated for the same test set for all the models - are available as Supplementary Mate-rials in Figures S1 and S2. After analyzing the content of the incorrectly predicted classes, the main source of errors was situations where a weight loss or weight gain was mentioned in the text but did not refer to the drug covered by the comment. All comments affected by prediction errors contained syntax elements mentioning weight gain or loss, changes in appetite, as well as nausea and vomiting. Prediction errors were mainly due to a lack of attention to temporality in the narrative, to the mention of weight without any indication of change, or to the expression of a desire to gain or lose weight, whether or not related to the medication. In addition, it’s important to note that some prediction errors can be considered a posteriori as labeling errors. All of these misclassifications can be considered as false positives, indicating that the sensitivity of detecting changes in body weight would have been higher if we didn’t take into account the fact that the AE must be related to the drug at the source of the comment.
Discussion
In this work, we explored the potential of large language models (LLMs) for detecting signals of body weight change as a side effects of antidepressants.
An important key point of this work is that we have built a valuable human annotated database, controlled by experts, with a substantial size of 8,000 input texts to evaluate model performance. We have made our dataset publicly available to promote open science and enable other researchers to build upon our labeling efforts (doi to be released). Collecting data on health-focused sites was relevant because it is estimated that only 10% of medical content on general social networks includes information on AEs, compared to 20-25% on health-focused platforms all AEs combined (19). Focusing on a single AE, we observed a proportion of 10% of the reviews made on antidepressant drugs were mentioning weight-related Aes - confirming that this AE might be a concern for patients taking these treatments. This high rate of AEs also confirmed that the choice of data source is important to reduce noise and identify AEs more easily.
In terms of model performance, the fine-tuned models were able to automatically distinguish situations of weight gain or loss associated with antidepressant treatment in user-generated online content. The performance of all the fine-tuned models were good - even for LLMs with a more limited number of parameters. We showed that LLMs can enable the detection of subtle signals within tens of thousands of comments pertaining to antidepressants, which could streamline the signal screening process. Zero-shot classification results are also very encouraging - especially with the large DeBERTa model. It has been demonstrated that pre-trained models with larger number of parameters shows better generalization results (44–46). It will be interesting to explore models with even more scaled properties in future works. Zero-shot approaches eliminate the need for annotation, making it possible to study a wide range of adverse events directly from extracted user reviews.
These models also tended to produce false positives. This is partly due to the annotation strategy, which focused only on mentions of body weight changes for the molecule being discussed in the review. Weight loss seems more difficult to classify and can be confused with weight gain - perhaps because these two classes share very similar syntactic elements. If an expert review of identified signals will likely remain essential for assessing a drug’s causality in the occurrence of adverse effects, it will be interesting to investigate how these models can aid in contextualizing the findings to support causality determination (47). Because online comments are often written with little contextual information, it is difficult to establish a causal relationship between drug use and the occurrence of an event. This might be the strongest limitation for the use of these data for pharmacovigilance.
Finally, extracting relevant medical information from online data depends on the quality of the data source. The ability of malicious social media bots to generate realistic comments raises doubts about the authenticity of all these comments and reviews, making it challenging to distinguish genuine patient feedback. (48–50). For this reason, the analysis of data from web platforms must be a matter of careful interpretation. Some specialized tools to detect fake reviews are probably expected as a safeguard. And if we can hope that website policies can partially prevent fake declarative comments and validate the credibility of their data, LLMs could be used at scale to provide insights for a better evaluation of medications in real-world conditions of use.
Data Availability
Codes to reproduce tables and figures S1 and S2 are available in this public repository (git address to be specified)
Funding information
This study did not receive any funding.
Data and Code availability
Labeled data are available as an open source dataset on FigShare (DOI to be specified). Codes to reproduce tables and figures S1 and S2 are available in this public repository (git address to be specified)
Supplementary Note 1
Supplementary Note 2
Supplementary Note 3
Supplementary Note 4
ACKNOWLEDGEMENTS
We acknowledge Maxime Alter, Furkan Erol, Ahmed Guendouz, Vitoria Morais-Brazil, Paule Nkeng, Inas Oulkaid-Mouaddan, Sofia Salaa and Chinar Salmanli for their valuable help in data annotation.