Abstract
Objectives Online health forums provide rich and untapped real-time data on population health. Through novel data extraction and natural language processing (NLP) techniques, we characterise the evolution of mental and physical health concerns relating to the COVID-19 pandemic among online health forum users.
Setting and design We obtained data from 739,434 posts by 53,134 unique users of three leading online health forums: HealthBoards, Inspire and HealthUnlocked, from the period 1st January 2020 to 31st May 2020. Using NLP, we analysed the content of posts related to COVID-19.
Primary outcome measures
Proportion of forum posts containing COVID-19 keywords
Proportion of forum users making their very first post about COVID-19
Number of COVID-19 related posts containing content related to physical and mental health comorbidities
Results Posts discussing COVID-19 and related comorbid disorders spiked in early- to mid-March around the time of global implementation of lockdowns prompting a large number of users to post on online health forums for the first time. The pandemic and corresponding public response has had a significant impact on posters’ queries regarding mental health.
Conclusions We demonstrate it is feasible to characterise the content of online health forum user posts regarding COVID-19 and measure changes over time. Social media data sources such as online health forums can be harnessed to strengthen population-level mental health surveillance.
Strengths and limitations of this study
Analysing online health forum data using NLP revealed a substantial rise in activity which correlated with the onset of the COVID-19 pandemic.
Real-time data sources such as online health forums are essential for monitoring fluctuating population health and tailoring responses to daily pressures.
It is not yet possible to establish COVID-19 status or whether concerned posters have pre-existing mental or physical health issues, are recovered, or have become unwell for the first time.
Online health forums are help-seeking forums, which introduces self-selection bias.
Introduction
Measures to tackle the COVID-19 pandemic have resulted in unprecedented societal restrictions worldwide. The mental health impacts of these measures and accompanying socioeconomic stressors are likely to be extensive; identifying and quantifying these impacts are now an urgent priority.[1] For example, social distancing restrictions make it harder to maintain regular contact between individuals and their friends and family as well as health and social care professionals. Furthermore, the psychological and emotional burden of the pandemic (and its consequences) may increase risk of relapse or worsen existing mental health disorders. Conversely, mental disorders can increase susceptibility to infections.[2,3]
Real-world data from online resources may be extracted using natural language processing (NLP) techniques to provide automated, population-level health surveillance. These methods can be used to rapidly ascertain discussion related to COVID-19 and associated symptoms and comorbidities. NLP has previously been used to identify medically relevant information from web pages and analyse extracted text.[4,5] Applying these techniques to real world data sources such as social media and online forums may be used to supplement active data collection from participants in prospective observational research. Recent studies have applied this approach to Twitter, Facebook and Reddit data to forecast the emergence of depression and post-traumatic stress disorder,[6] predict depression in the general population,[7] identify mothers at risk of postpartum depression,[8] and investigate suicidal ideation.[9]
While social media platforms such as Twitter, Facebook and Reddit are commonly used, other internet resources such as online health forums have so far been neglected. Online health forums are enriched for health information and receive millions of posts each year, therefore providing untapped reservoirs of healthcare data at population level.
In a recent proof-of-concept study we demonstrated that online health forums can be extracted to detect health discussion trends that correlate with real-life events.[10] Here, we use the same technology to analyse online health forum data discussing mental and physical health problems associated with the COVID-19 pandemic. We use NLP techniques to extract data from online health forum posts related to the COVID-19 pandemic, references to specific comorbid illnesses, and their direct and indirect impacts on mental or physical health.
Methods
Study Design and Setting
We obtained data from online health forums using NLP. Online forums are discussion websites hosted on the internet where people hold conversations in the form of posted messages. A single conversation is called a thread. Threads are chains of posts identified within a forum by a title and an individual URL. Clicking on the thread title opens the thread which contains one or more posts which may be from the same user who started the thread (i.e. the original poster) or different users who have replied within the thread. In this study, we analysed text data in thread titles and in individual posts within a thread. We analysed posts written in English only. Depending on the forum’s settings, users can be anonymous or have to register with the forum to post messages, with most users opting not to use personally identifiable information to register their account. Registration may not be required for read-only access. Most forums recommend that users do not use personally identifiable information when posting. Online health forums specifically cover health topics and offer peer support for various health conditions.
We collected data from three major online health forums posted from 1st January 2020 to 31st May 2020: HealthBoards (www.healthboards.com), Inspire (www.inspire.com) and HealthUnlocked (www.healthunlocked.com). These forums were chosen on the basis that they have global user coverage, include subforums on several aspects of healthcare, have a large user base contributing to regular activity on the forum, and are feasible to extract information from using NLP.
HealthBoards was founded in California, USA in 1997 and offers patient to patient health support. Inspire, founded in 2005, is a US healthcare social network managing online support groups for patients and caregivers. HealthUnlocked is a British online health forum launched in 2011 with a similar offering to HealthBoards and Inspire. Registration and participation in all three forums are free of charge to users.
Analysis Using NLP
Definition of search terms
To investigate the potential impact of COVID-19 on users posting in online health forums, we classified threads and posts using keywords related to the COVID-19 pandemic and various groups of case-insensitive keywords relating to medical treatment in an intensive care unit or physical symptoms as a direct consequence of COVID-19 infection or mental health symptoms as a consequence of measures in response to the pandemic.
Search terms used to identify whether a thread or post was related to COVID-19 were ‘covid’, ‘covid-19’, ‘coronavirus’, ‘corona’, ‘sars-cov-2’, ‘sars-2’, ‘shielding’, ‘pandemic*’, ‘vulnerable’, ‘quarantine’, ‘lockdown’, ‘distancing’, ‘isolation’, ‘isolating’ where * indicates a wildcard search term.
Table 1 provides the final keywords used to search posts within COVID-19 related threads; the Python coded search terms are provided in Supplementary Tables 2 and 3. We tested the specificity of keywords by searching for matches occurring before 1st January 2020. For threads, these were matches in the title and URL, while for posts, these were matches in the entire text (see Extracting and matching keywords below). Term incidence and excluded keywords are provided in Supplementary Table 5.
Data pre-processing
Data obtained from different online health forums come in various formats. We standardised and normalised the data before analysing them. This included normalisation of Unicode strings and whitespace characters, standardisation of date and time, and standardisation of location through the GeoNames.org database.
Extracting and matching keywords
We extracted the keywords in thread titles and post content using lemmatisation. For flexibility and efficiency, search terms in posts and thread titles were matched using regular expressions that accounted for both inflection and common spelling variants. Matching was case-insensitive and limited to whole words in the post content and thread title; when matching thread URLs, parts containing words were considered. To prevent spurious matches, words shorter than four letters (e.g. ICU) were considered valid matches only if they were delimited by non-word characters.
Analysis of COVID-19 threads to identify changes in COVID-19 related user activity and physical and mental health associations over time
We identified the users contributing to a COVID-19 related thread in a given week. We then retrieved all the other posts made by the same authors in the previous, same and subsequent calendar weeks. We scanned such posts for physical symptom, mental health symptom or intensive care keywords as defined in Table 1, and recorded whether each of these topics was mentioned by the author during the time window. We performed this analysis to establish variations in the prevalence of concerns relating to physical symptom, mental health symptom and intensive care keywords over the course of the pandemic during 2020. Weekly counts were measured each Sunday for the previous week.
Analysis of thread titles
We inspected thread titles to identify how many mentioned a comorbidity in the title. We searched for terms related to autoimmune disorders, mental disorders or worry, cancer, cardiovascular problems or stroke, and diabetes as listed in Table 2; the Python coded search terms are provided in Supplementary Table 4.
Analysis of first-time posters in a COVID-19 related thread
We analysed the first ever post published by a user to determine the proportion of first-time posters who started out by contributing to a COVID-19 related thread. We performed this analysis to determine the degree to which new users were motivated to make their first post in relation to the COVID-19 pandemic and how this varied over time during 2020.
Implementation and computation
All descriptive analyses were performed using bespoke software written in Python. An outline of the coding approach employed is included in the Supplementary Material.
Ethics and Data Sharing
We consulted and adhered to internet research guidelines from the Association of Internet Researchers [11] and the British Psychological Society (BPS) [12] to inform study development.
All data have been provided in aggregate form to protect the privacy of forum users. As we analysed data in aggregate form, it was not possible to seek individual user consent. However, users were aware that their data were available for anyone to view online by virtue of contributing to publicly available online health forums. Of note, the BPS guidelines advise that “valid consent should be obtained where it cannot be reasonably argued that online data can be considered ‘in the public domain’ or that undisclosed usage is justified on scientific value grounds”. This approach is consistent with similar studies examining healthcare related data from Twitter.[13,14]
QMUL is registered as a data controller with the Information Commissioner’s Office (ICO; registration number: Z5507327), which covers all research activities undertaken at the university. All data were analysed on QMUL IT facilities, which employs a two-layer security model as per their security policy.
Given licensing and privacy issues, it is not possible to publicly release the aggregate dataset generated from the three online health forums investigated. However, we welcome collaboration with other researchers and healthcare policy makers. Anyone interested in accessing the aggregate data and data analysis code should contact the guarantor (f.smeraldi{at}qmul.ac.uk).
Patient and public involvement
As the data were analysed in aggregate form it was not possible to involve individual forum users in the design or conduct of the study.
Results
Related posts and active threads
HealthUnlocked was the most frequently used forum accounting for 97% of overall posts and 97% of posts mentioning COVID-19 in the thread title or post content during the study period (Table 3).
Weekly post count for HealthUnlocked peaked in mid-March. Post count for Inspire declined sharply in the last two weeks of March. Post count for HealthBoards declined slowly across the entirety of the observation period (Supplementary Figure 1).
Across all three forums, there were a total of 3,342 threads containing a COVID-19 keyword within the thread title or URL. These contained a total of 44,894 posts during the study period (1st January 2020 to 31st May 2020). A total of 35,581 posts (whether in COVID-19 related threads or otherwise) contained a COVID-19 keyword during the study period. The proportion of posts containing COVID-19 keywords increased rapidly across all forums in early March (Supplementary Figures 2 and 3), corresponding with the World Health Organisation’s declaration of COVID-19 as a pandemic on 11th March 2020. The smaller online forums (Inspire and HealthBoards) had a greater peak in percentage of total posts containing COVID-19 keywords. The total number of posts containing COVID-19 search terms declined from mid-April onwards.
For quite a long period, most posts about COVID-19 (over 90% at the beginning of the observation period, and remaining above 50% until the week ending 29th March) were written by users who had not yet posted on the topic. By the end of the observation period, the percentage of weekly posts in COVID-19 threads written by new entrants to the discussion reduced to a still quite sizeable 30%. While many of these users may have posted before on the forum about other topics, Supplementary Figure 4 presents the proportion of posters whose very first post to a forum appeared in a COVID-19 related thread. This figure peaked above 20% in the week ending 22nd March. Considering that these forums have a very broad spectrum, this is a remarkably high fraction. It includes both new joiners and users who were previously silent members of the forums, possibly for a long time (so-called “lurkers”), and who may have been spurred into a more active role by the pandemic.
Thread title analysis
Over a quarter of COVID-19 related thread titles mentioned another condition of interest (Table 4). After cancer and autoimmune diseases, mental health represented a major area of concern for online health forum users posting about COVID-19, comparable to respiratory and circulatory diseases (Table 2). Around 0.5% of thread titles mentioned two or more comorbidities.
User analysis
Posts in threads related to COVID-19 were analysed to determine the number of users contributing in each given week. For each active user, all posts in the previous, same and following calendar weeks were scanned irrespective of thread for mentions of physical symptom, mental health symptom or intensive care keywords. The number of active users mentioning each of these concerns peaked in the week ending 22nd March and subsequently declined but still remained elevated above the January baseline. In particular, users discussing mental health outnumbered users mentioning the other topics (Figure 1).
Discussion
Using a novel technique to analyse data from online health forums, we found a marked increase in posts related to COVID-19 across the observation period of 1st January 2020 to 31st May 2020. The frequency of these posts increased rapidly in early March 2020 corresponding with the World Health Organisation’s declaration of COVID-19 as a pandemic.
During this period, we found mental health symptom keywords were most frequently mentioned by authors of COVID-19 related posts (either contextually or in separate messages), followed by physical symptoms and intensive care keywords, suggesting that the pandemic and public health response to it has had a significant impact on posters’ concerns regarding mental health. The marked increase in mental health symptom related posts in early March, when the WHO declared the COVID-19 pandemic, correlates with preliminary worldwide data that show increases in anxiety and depression in response to the outbreak.
The mental health impacts of COVID-19 and associated physical distancing restrictions are likely to be extensive and wide-reaching. There is a growing body of evidence supporting the neuropsychiatric effects of coronavirus infections.[15] Restrictions fuel socioeconomic stressors such as unemployment, loneliness and financial burden, which are all implicated in the development of mental ill health.[16] Increased rates of bereavement, newfound caring responsibilities and interruptions to education are likely to be particularly stressful to children and young adults.[17]
A preliminary survey of 3,545 German respondents found evidence of substantial mental health burden from travel and physical distancing restrictions, including increased levels of stress, anxiety, depressive symptoms, sleep disturbance and irritability.[18] Worsening mental health has been confirmed in samples with both pre- and post-pandemic information for direct comparison: The Avon Longitudinal Study of Parents and Children (ALSPAC) study found probable anxiety disorder doubled compared to pre-pandemic sizes (26% vs 13%) and lower wellbeing, particularly in young people, women and those with pre-existing conditions.[19] The literature on social media mining for COVID-19 mental health related trends is limited. A study analysing sentiment evolution trends of four emotions across Twitter – fear, anger, sadness, joy – has been able to identify developing shared distress, and topics of interest relating to those emotions.[20]
Our findings also suggest that mental and physical health concerns documented in online forum posts have levelled off following their peak in March 2020. The number of users active in COVID-19 threads who also wrote posts concerning mental health symptoms reduced from their peak in March of 1,355 (per week) to 253 by the end of the observation period (compared to a mean number of 30 per week in January), suggesting that as time went on most users had begun to adjust to the consequences of the pandemic. Other NLP studies have also identified a similar trend. An analysis of 10 million Google searches within the United States found large shifts in mental health symptom searches linked to stay-at-home orders in the United States across the week commencing 16th March 2020.[21] Searches for topics related to anxiety, negative thoughts about oneself and the future, insomnia and suicidal ideation dramatically increased prior to stay-at-home orders, levelling off upon the announcement of stay-at-home orders. These patterns were relatively unique to searches for mental health related information and not physical conditions.
Over the entire period, on average 4% of first-time posters (over 20% in the peak period) made their very first contribution to the forum in a COVID-19 related thread. Furthermore, 77% of COVID-19 threads were started by users who had never posted about the topic before, and chose to start out by creating their own thread. A certain degree of motivation is required to take someone to the point of making that first post on a forum, and also for starting a thread; our finding suggests that the pandemic is driving users to engage more actively in community forum services in times of uncertainty.
Strengths and weaknesses
Online health forums are an important source of real-world, real-time, population-level data on people living through the COVID-19 pandemic. Online health forums also afford users anonymity to discuss aspects of their experience they might otherwise have been embarrassed or fearful to disclose in identifiable forms of social media. We have demonstrated that it is possible to automate information extraction from these posts using natural language processing, providing access to a rich reservoir of previously untapped real-world data from health-specific online resources.
Our approach was able to automatically extract data from a large sample of over 53,000 unique users at a fraction of the cost of previous approaches that have relied on social media individual participant recruitment and manual review of posts generating sample sizes in the low hundreds.[7] Some studies screened users on Twitter via depression symptom questionnaires and used their tweets to train depression onset classifiers.[6,22] Analogous approaches have been used with Facebook data.[8]
Our study has some limitations. At present it is difficult to establish whether concerned posters have pre-existing mental or physical health issues, have experienced confirmed COVID-19 illness themselves, are recovered, or have become unwell for the first time. Online health forums are help-seeking communities; this introduces self-selection bias in which individuals from disadvantaged backgrounds who do not have IT equipment/network connection to access online resources are under-represented and our results are therefore not generalisable to the entire population. Furthermore, as these forums have worldwide coverage we cannot isolate trends to one geographic region. However, future work could utilise the location data (see Data pre-processing in Methods) to explore this avenue.
Conclusions and future research
Publicly accessible sources of real-world data, such as online health forums analysed in this study, can strengthen population-level physical and mental health surveillance and provide a rapid and inexpensive means to inform public healthcare policy. We found that the majority of posts in online forum data related to COVID-19 concerned features related to mental health and that the peak in frequency of posts corresponded with the early phase of the pandemic, indicating the significant impact of COVID-19 on the mental health of susceptible populations.
As the pandemic evolves, further research using online forum data could improve our understanding of the long-term consequences of COVID-19 infection [23] and the longer-term socioeconomic consequences of travel and physical distancing restrictions that have been employed in many countries to manage viral transmission.[24,25] Analysis of real-world data, including social media and online health forums, could provide a useful insight into attitudes and perceptions towards novel therapeutics. This will be crucial to maximising uptake of effective preventative approaches such as mask-wearing, physical distancing, hygiene measures and potential vaccines.
Data Availability
Given licensing and privacy issues, it is not possible to release the dataset generated from the online health forums investigated in this study. However, we welcome collaboration with other researchers and healthcare policy makers. Anyone interested in accessing the aggregate data and data analysis code should contact the guarantor (f.smeraldi@qmul.ac.uk).
Supplementary Material
Methods - Search Terms and Keywords
Examples:
‘auto[_\s-]?immune’ will match all of “autoimmune”, “auto-immune” and “auto immune”.
‘psycho[st]i[sc]’ will match both “psychotic” and “psychosis”
‘infarct\w*’ will match “infarct”, “infarcts”, “infarction”, “infarctions” and “infarcted”
Matching is case-insensitive. When matching post content or thread titles, keywords have to appear as separate whole words; this requirement is lifted when matching thread URLs. To reduce spurious matches, keywords up to three letters long have to appear in URLs surrounded by underscores (_) or other non-alphanumeric characters in order to be counted as a match.
Methods - Coding Approach
Footnotes
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf and declare: RP has received funds from Janssen, Induction Healthcare and Holmusk outside the current study. The other authors declare no competing interests.
Ethics approval: Data for this study were drawn from publicly available online health forums and extracted in aggregate form for secondary data analysis rather than at individual user level. No individual user level data were retained, making it impossible to obtain informed consent. However, users were aware that their data were available for anyone to view online by virtue of contributing to publicly available online health forums. The data were analysed using the computing infrastructure based at Queen Mary University of London (QMUL) which employs a two-layer security model to maintain data privacy. QMUL is registered as a data controller with the Information Commissioner’s Office (ICO; registration number: Z5507327), which covers all research activities undertaken at the university.
Source of funding: RP has received support from a Medical Research Council (MRC) Health Data Research UK Fellowship (MR/S003118/1) and a Starter Grant for Clinical Lecturers (SGL015/1020) supported by the Academy of Medical Sciences, The Wellcome Trust, MRC, British Heart Foundation, Arthritis Research UK, the Royal College of Physicians and Diabetes UK. FS and CB were partly funded by an Alan Turing Institute (ATI) Fellowship and by an EPSRC COVID-19 Rapid Response Impact Acceleration Fund. Computational resources were partly funded by a Microsoft Azure Sponsorship through the ATI.
Role of funder: The views expressed are those of the authors and not necessarily those of the funders. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Data sharing: Given licensing and privacy issues, it is not possible to release the dataset generated from the online health forums investigated in this study. However, we welcome collaboration with other researchers and healthcare policy makers. Anyone interested in accessing the aggregate data and data analysis code should contact the guarantor (f.smeraldi{at}qmul.ac.uk).