Abstract
The opioid epidemic persists in the United States; in 2019, annual drug overdose deaths increased by 4.6% to 70,980, including 50,042 opioid-related deaths. The widespread abuse of opioids across geographies and demographics and the rapidly changing dynamics of abuse require reliable and timely information to monitor and address the crisis. Social media platforms include petabytes of participant-generated data, some of which, offers a window into the relationship between individuals and their use of drugs. We assessed the utility of Reddit data for public health surveillance, with a focus on the opioid epidemic. We built a natural language processing pipeline to identify opioid-related comments and created a cohort of 1,689,039 geo-located Reddit users, each assigned to a city and state. We followed these users over a period of 10+ years and measured their opioid-related activity over time. We benchmarked the activity of this cohort against CDC overdose death rates for different drug classes and NFLIS drug report rates. Our Reddit-derived rates of opioid discussion strongly correlated with external benchmarks on the national, regional, and city level. During the period of our study, kratom emerged as an active discussion topic; we analyzed mentions of kratom to understand the dynamics of its use. We also examined changes in opioid discussions during the COVID-19 pandemic; in 2020, many opioid classes showed marked increases in discussion patterns. Our work suggests the complementary utility of social media as a part of public health surveillance activities.
Introduction
The United States is currently experiencing an epidemic of opioid abuse. The number of opioid-related deaths has been rising in almost every year since 1999, with the exception of 2018, and in 2019 no state experienced a significant decrease in opioid overdose deaths1. With no sign of the opioid epidemic ending, it is important that public health agencies have the ability quickly to identify, monitor, and address both established and emerging patterns of drug abuse.
Opioids are a class of compounds that interact with opioid receptors in the brain, most notably the mu opioid receptor. Opioids are either naturally occurring or synthesized, and can differ greatly in their ability to be agonistic or antagonistic to mu opioid receptor activity. Opioids that are agonists for the mu opioid receptor have the greatest pain relief benefits, but are also those with the highest abuse potential. Despite concerns about the abuse potential, opioids are currently the most effective means of treating acute pain and as such will likely remain a key component of the clinical toolkit. Therefore, it is necessary to monitor population-level rates of opioid use and abuse to balance benefits and risks.
To that end, the U.S. spends billions of dollars every year to conduct public health surveillance, and this surveillance relies on the integration of data from many different sources. For instance, the National Institute on Drug Abuse (NIDA) has several existing systems for tracking rates of drug abuse at the population level. NIDA’s National Drug Early Warning System (NDEWS) integrates information on drug use from community epidemiologists at satellite offices around the country2. NIDA’s Monitoring the Future effort conducts annual surveys of 10,000s of students from hundreds of schools, asking questions about drug use experiences in the recent and distant past. The National Vital Statistics program at the CDC collects data on death from a large number of regional entities within the United States, identifying causes of mortality, including drug overdoses3. The National Forensic Laboratory Information System (NFLIS) is a DEA program that monitors national drug patterns through reports from state and local forensic labs4.
These systems are accepted by research and policy communities; they continue to grow in scale and scope, and have established benchmarks, methods, and resources for analysis. Despite these benefits, these systems are not perfectly able to monitor ongoing and developing drug epidemics. They suffer from some limitations: (1) they are not real-time, (2) they rely on an expert’s interpretation of another person’s health (i.e. epidemiologists, doctors, social workers), (3) they are sparse in terms of geographic, demographic, and/or disease coverage, (4) they are tabular, with a constrained ability to provide understanding of observed trends, (5) they rely on data collection methods that can be biased, and (6) they are often hypothesis-driven5.
The U.S. Food and Drug Administration (FDA) and others have recently identified social media as a potentially useful source of real-world evidence for drug monitoring6,7. However, public health efforts have been slow to adopt social media surveillance due to limitations and challenges currently posed by the data. Social media datasets can be very large and most of their content is of no relevance to public health, thus it requires additional expertise and resources to manipulate and analyze the data8. Additionally, evidence of the utility of social media data is still relatively limited compared to evidence for traditional methods of public health surveillance.
Social media offers benefits as well, such as access to an observational data source that is potentially orthogonal to the data streams currently used in public health surveillance, and the pseudo-anonymous nature of these platforms can make some users more inclined to discuss stigmatized behaviors9. Additional advantages of social media data are that many individuals are turning to these platforms to discuss their health and health issues10. In addition, the data is unbiased by expert interpretation (it is often first-person reports). Users’ posts often include additional details about their lives or environments in their discussions of health behaviors and conditions providing contextual information often lacking from tabular analysis. Finally, social media data is longitudinal, potentially providing a vivid story about public health over time.
The majority of research on drug and disease surveillance in social media has focused on the Twitter platform, although some make use of Reddit, Facebook, Instagram, or smaller online discussion forums. A review article by Sarker et al. found over 1000 articles since 2012 that are related to social media and drug abuse monitoring11. Narrowing the focus to geolocated drug abuse monitoring, Katsuki et al. found an association between prescription drug abuse and illicit online pharmacies on Twitter12. Chary et al. used a set of keywords to scan Twitter for opioid mentions and filtered the resulting mentions based on semantic distance to a set of manually labeled Tweets, these tweets were grouped by geolocation and the total count for each state was normalized by the total number of tweets collected from that state. They found a strong correlation between normalized Twitter frequency values and opioid use statistics from NSDUH13. A recent study by Sarker et al. followed 9,006 Twitter posts from Pennsylvania and demonstrated that social media posts processed through a machine learning pipeline generated substrate-level abuse-indicating tweet rates at the county level that were correlated with county-level opioid overdose death rates14. Studies that specifically use Reddit data include a study by Pandrekar et al. that investigated the opioid discussion on Reddit by analyzing 51,537 opioid-related posts, performing topic modelling, and finding the psychological categories of the opioid posts15. Chancellor et al. used Reddit posts about opioid recovery, and discovered potential alternative treatments in opioid recovery posts16.
In this paper, we use data from the social media platform Reddit, to conduct public health surveillance of the opioid epidemic. Reddit, a pseudo-anonymous platform, was founded in 2005 and is the 6th most visited website in the United States. User activity on the Reddit platform takes place in topic-oriented subreddits, denoted with a leading ‘r/’, where users can congregate to discuss a particular topic of interest. Community moderators and users work to keep discussions focused on that particular topic. For example, ‘r/addiction’, a subreddit where community members discuss and support each other in overcoming their addictions. In 2020, there were over 52 million daily active users on Reddit, who contributed over 303 million posts, 2 billion comments, and 49.2 billion upvotes17. Unlike other social media platforms like Twitter and Facebook, the majority of the historical activity on Reddit is archived and publicly available18. Overall, 11% of adults in the U.S. report using Reddit. The user base is skewed towards males, younger individuals, Whites and Hispanics, and those that have at least some college education19. While these demographics differ from those of the United States overall and would be limiting in the study of drugs used primarily in older populations, they may be sufficient for the monitoring the rate of drug abuse in these groups.
In this study, we create a cohort of 1,689,039 geo-located users and follow them over a period of 10+ years, measuring drug-related discussion over time. We demonstrate that Reddit can be useful as a data source for conducting timely surveillance of the opioid epidemic. We examine the overall discussion rate of opioids on the Reddit platform, examining both individual drugs and drugs grouped by opioid class. We demonstrate that our social media-derived opioid rates are correlated with CDC overdose death rates and NFLIS drug report rates across regional, state and city analyses. We examine the relatively recent rise in discussions of kratom, a lesser known opioid agonist that has seen increasing usage in the United States. We make progress towards real-time drug abuse surveillance by examining recent changes in Reddit opioid discussions rates during the COVID-19 pandemic. Our work demonstrates the potential utility on monitoring digital cohorts for changes in drug abuse rates that could supplement existing epidemiological surveillance systems.
Methods
Datasets
Reddit opioid comment database
Reddit comments and their associated metadata from January 2006 to September 2020 were downloaded from PushShift.io and the Reddit API18. We focused on opioids, but also extracted comments for benzodiazepines. Our initial drug lists came from Le et al.20 and the WHO’s Anatomical Therapeutic Chemical Classification System21. Recognizing that Reddit users often use non--standard drug vocabularies, we used the RedMed word embedding model, which was trained on a health-oriented subset of Reddit, and which provides a lexicon of misspellings and synonyms for different drugs22. We initialized our opioid and benzodiazepine term list using terms linked to these two drug classes within the RedMed lexicon. To avoid the inclusion of ambiguous terms, we manually filtered the RedMed term list. We curated our opioid mention dataset from Pushshift data by selecting all comments that contained at least one of our opioid and benzodiazepine search terms.
Estimating geo-location and creating a user cohort
Neither Reddit nor the pushshift API provide location metadata for Reddit users, so we created a simple proxy to estimate user location. We extracted an initial set of location specific subreddits, from a Redditor curated list (https://www.reddit.com/r/LocationReddits/wiki/faq/northamerica). We manually mapped these subreddits to their city, state, and region. We identified all Reddit users that posted in a location-based subreddit during the observation period. We then filtered out users who had posted in multiple location-based subreddits. Our final cohort consisted of all Reddit users who had posted in a single location-based subreddit. We note that the inclusion criteria for this cohort was only our ability to get a proxy location for an individual user and was agnostic to whether they had ever mentioned opioids.
Opioid receptor activity based classes
The opioids labeled as full agonists are morphine, codeine, oxycodone, pethidine, diamorphine, hydromorphone, levorphanol, methadone, fentanyl, sufentanyl, remifentanyl, tramadol, tapedolol, oxymorphone, and hydrocodone. The opioids labeled as partial agonists are buprenorphine, meptazinol, and loperamide. Mixed agonist opioids were nalorphine, pentazocine, nalbuphine, butorphanol, and dezocine. Opioid antagonists were naloxone, naltrexone, nalmefene, and diprenorphine. Heroin and kratom were maintained as independent opioid classes.
Opioid synthesis based classes
Opioids labeled as synthetic were tramadol, fentanyl, and meperidine. The class of natural and semi-synthetic opioids consisted of morphine, codeine, hydrocodone, oxycodone, oxymorphone, hydromorphone, naloxone, buprenorphine, and naltrexone. Heroin was maintained as an independent opioid class and the remaining opioids listed above and in Table 1 were grouped into the Opioid class.
Benchmark datasets
CDC vital statistics overdose data
The CDC vital statistics unit has published trailing monthly drug overdose data for 2015-2018, for a subset of states and cities in the United States. The CDC has also published underreported preliminary data (due to incomplete data) for 2019-April 2020, as of the writing of this paper3. The data is 12-month trailing provisional monthly overdose deaths for several different opioid categories (all opioids, heroin, semi-synthetic opioids, and synthetic opioids). In order to convert the overdose data to overdose death rates, we normalized the opioid death counts by the size of the relevant population (i.e. location and year) based on census data.
NFLIS data
The NFLIS monitors drug use trends in different communities in the U.S., through reports from state and local forensic labs. These reports represent ∼98% of the data from the 1.5 million annual U.S. drug cases. The laboratory network includes data from 50 states and 104 local forensic labs. NFLIS provides drug identification rates for 25 drugs that are most commonly identified in national laboratory reports. This data is reported on a semi-annual basis and is available from 2010 to 2018 for the entire U.S. and for each region of the U.S.: south, midwest, west, and northeast. In order to convert the drug report data to drug report rates, we normalized the opioid report counts by the size of the relevant population (i.e. location and year) based on census data.
Calculation of Reddit and Benchmark statistics over time
Summary statistics and unnormalized opioid mention counts
We counted the total number of mentions in our opioid database for each drug to create summary statistics for the opioid discussion on Reddit. We counted the number of comments by month for each drug and opioid category, and visualized these counts over time.
Normalized rates: comparing Reddit comment rates with CDC overdose data
In order to compare the Reddit opioid conversation with CDC overdose data, we calculated 12-month trailing opioid comment rates within our cohort for the geographies we covered, and for the categories the CDC reported: all opioids, heroin, semi-synthetic opioids, and synthetic opioids. The numerator for the Reddit opioid comment rate was the 12-month trailing total number of comments (based on the opioid category) for the geography (U.S., region, state, city) and the denominator was the 12-month trailing total number of comments made by users within our location cohort for the geographic area (U.S., region, state, city).
We compared the 12-month trailing comment rates and the 12-month trailing overdose rates for each category for the entire U.S., for regions in the U.S., and for states/cities that the CDC reported data on. We calculated the Pearson correlation between the 12-month trailing comment rates and the 12-month trailing overdose rates for these geographies and categories. We visualized the 12-month trailing opioid comment rates and the 12-month trailing opioid overdose rates side by side for the entire U.S., and for different regions in the U.S. for 2015-April 2020.
We compared the 12-month trailing comment rates and the 12-month trailing overdose rates for each category for the entire U.S., for regions in the U.S. using cross-correlation analysis, i.e. calculating correlations using different leading and lagging time intervals, as provided by the ccf function in the stats library module in R 23.
Normalized rates: comparing Reddit comment rates with NFLIS drug report data
To compare the Reddit opioid conversation with the NFLIS drug report rates, we started by calculating the total number of semi-annual opioid comments for the geographies and select drugs: heroin, oxycodone, hydrocodone, buprenorphine, and fentanyl. The numerator for this normalized rate was the total number of semi-annual comments for the drug and geography (U.S., region), and the denominator for this was the total number of semi-annual comments made by users with location data for the geographic area (U.S., region).
We then compared the semi-annual comment rate and the semi-annual drug report rate for these drugs for the entire U.S., and for NFLIS reporting regions in the U.S.. We calculated the Pearson correlation between the semi-annual comment rate and the semi-annual drug report rate for these geographies and drugs. We also visualized the semi-annual drug comment rate and the semi-annual drug report rate side by side for the entire U.S., and for different regions in the U.S. for 2010 to 2018.
Kratom trends
We identified kratom comments using manually filtered Kratom terms from the Redmed resource, through the same process used in the creation of the Reddit opioid comment database outlined above. Reddit comment counts for the top 15 subreddits were the sum of all comments made within each subreddit over the January 1st, 2010 to July 31st, 2020 observation period. To create word clouds for the “r/kratom”, “r/quittingkratom”, and “r/Drugs” subreddits we first counted all tokens in these three subreddits, as well as the remaining non-kratom mentioning comments in the overall Reddit opioid comment database, splitting on spaces and stripping non-alphanumeric characters from the ends of each token prior to counting. We then calculated probabilities for each token by dividing each token by the total number of tokens within the respective corpus. The log probability of a token in the non-kratom corpus was then subtracted from the log probability of that token in each of the kratom corpora, filtering out stopwords according to the list maintained in the Spacy v3.0.5 English language stop word list, “spacy.lang.en.stop_words”24. Word clouds were rendered with the top 150 words ranked by the difference in log probability of the words using the log probability difference to size the words via the wordcloud2 R package25.
Recent changes in opioid comment rates during the COVID-19 pandemic
The numerator for this normalized rate was the 3-month trailing total number of comments (based on the opioid category) for the cohort users assigned to each of the four U.S. regions. The denominator was the 3-month trailing total number of comments made by the same users in each regional subset.
Results
Summary stats
Our final comment dataset included 6,038,907 opioid mentions from 4,348,244 comments. Heroin (and synonyms from RedMed) is the most frequently mentioned opioid on Reddit, followed by morphine, fentanyl, and oxycodone (Table 1). Naloxone was the most frequently mentioned opioid antagonist. Partial agonists and mixed agonists were among the least mentioned opioids.
Cohort stats
We mapped 1,689,039 unique users to 46 states (Supp. Table 1 contains state-level user count statistics). Within this cohort, 258,591 users have mentioned an opioid on Reddit at some point in their comment history.
Kendall’s Tau correlation between a state’s estimated 2019 population and our number of observed users was 0.626, after replacing the observed user count with zero for the four states with no observed users (Supp. Figure 1). We followed this group from 2009 to September 2020, where they made ∼1.7 billion comments on Reddit.
Benchmark Opioid Statistics
Benchmark: CDC vital statistics overdose data
We benchmarked 12 month trailing U.S. opioid comment rates on Reddit against 12 month trailing opioid overdose death rates for different geographies (U.S., region, state) and drug categories, we found strong correlations between the two in every drug category and for all the available geographical regions. We plotted 12 month trailing opioid overdose death rates in the U.S. and 12 month trailing U.S. opioid comment rates on Reddit, along with the Pearson cross-correlation for a 6-month lead/lag window (Figure 2).
The Pearson correlation between 12 month trailing opioid overdose death rates in the U.S. and 12 month trailing U.S. opioid comment rates were 0.90, 0.98, 0.99, 0.87 for heroin, opioids, synthetic, and natural/semi-synthetic opioids, respectively (Table 2). The Pearson correlation between 12 month trailing opioid overdose death rates and 12 month trailing opioid comment rates for different regions are shown in Table 2. As Table 2 shows, the correlation between 12 month trailing opioid overdose death rates and 12 month trailing opioid comment rates on Reddit are similarly strong for heroin, opioids, and synthetics for the Northeast and South, with weaker correlations for the midwest and west regions. Natural/semi-synthetic overdose death rates had weaker correlations across all regions.
The Pearson correlations between 12 month trailing opioid overdose death rates and 12 month trailing opioid comment rates on Reddit persist at the state level as well. As Table 3 shows, for the states that provided more than 35 months worth of overdose data and had Reddit activity, we find that in most states there is a high correlation (> 0.7) for opioids and synthetic opioids.
As the CDC provides overdose data for New York City (NYC), and as there is a highly active NYC subreddit, we were able to perform the same analysis at the city level for NYC. In NYC, the Pearson correlations were 0.813, 0.966, 0.934, 0.965 for heroin, natural & semi-synthetic opioids, opioids, and synthetic opioids, respectively.
Benchmark: NFLIS data
We compared Reddit opioid activity with NFLIS drug report rates over time. In particular, we compared semi-annual Reddit opioid activity with semi-annual NFLIS drug report rates for 2010-2018 for several key drugs: heroin, oxycodone, hydrocodone, buprenorphine, and fentanyl. For the entire U.S., the correlation between the semi-annual NFLIS drug report rate and the semi-annual Reddit comment rate was −0.31, 0.78, 0.94, 0.95, and 0.87 for heroin, oxycodone, hydrocodone, fentanyl, and buprenorphine, respectively. For every drug, except heroin, the shapes of NFLIS report rates over time and Reddit comment rates over time were very similar (Figure 3). As Table 4 shows, the Pearson correlations between the semi-annual NFLIS drug report rates and semi-annual drug comment rates on Reddit for 2010-2018 are good on the regional level, as well. Similar to the entire U.S., heroin report rates are not consistently correlated with Reddit comment rates at the regional level. The most correlated drug across regions is fentanyl, which experienced a steady ascent in drug reports and comment rates over this time period across every region.
Kratom trends
We graphed mentions of kratom on Reddit over time for January, 2010 - August 31st, 2020 (Figure 4). We observed a strong increase in Kratom mentions from 2010 until ∼2017 when the comment rate begins to plateau. On August 31st, 2016 we observed a distinct spike in kratom mentions, the same day the Drug Enforcement Administration (DEA) temporarily scheduled kratom as a schedule I drug (Figure 4A, dashed line). Over this period, kratom has been mentioned on 5,185 distinct subreddits by 117,658 distinct users. The majority of kratom mentioning comments were contributed to the “r/kratom”, “r/quittingkratom”, and “r/Drugs” subreddits (Figure 4B). Word clouds for these 3 subreddits show distinct patterns of word usage corresponding to the intended topic of those subreddits (Figure 4C). Kratom and its synonyms appear strongly in each word cloud along with several weaker opioids. “r/kratom” shows symptoms and responses discussed in that subreddit. “r/quittingkratom” mentions many opioids that are partial agonists, different terms for withdrawals, as well as various symptoms of withdrawal. The “r/Drugs” word cloud shows the greatest variety of drugs and various adjectives used to describe different kratom strains. All three word clouds contain various measurement quantities and units of measurement.
Recent opioid comment rates during the COVID-19 pandemic
We graphed regional 3-month lagged cohort mention rates of opioids grouped by synthesis class on Reddit over time for January 1st, 2018 - August 31st, 2020 (Figure 5). We used a 3-month lag in order to be more responsive to short-term changes in comment rate given the shortness of the observation period. We observed a relatively steady decline in opioid mention rate from 2018-2019, with the exception of synthetic opioids that mostly remained flat with a bump in activity in the winter of 2019. We observed a change in trend from decreasing or stable to increasing for heroin, opioid, and synthetic opioid classes starting in May of 2020. Many states implemented COVID-19 control measures in March and/or April of 2020 resulting in increased unemployment and time spent at home26, for convenience we’ve marked April of 2020 with a dashed vertical line in Figure 5.
Discussion
We are in the midst of an opioid epidemic in the United States and there are no signs of the crisis disappearing1. The COVID-19 pandemic has made the opioid epidemic worse, potentially due to increased unemployment and social isolation as a result of policies implemented to combat the pandemic27. The emergence and ubiquity of social media platforms, where people discuss drug abuse, can provide a lens into the drug use and abuse of millions of people. The continued growth of opioid related discussions on Reddit demonstrates the degree to which this is an ongoing phenomenon (Figure 1). In this paper, we introduce a method for monitoring drug epidemics spatiotemporally using the Reddit platform. Reddit can be a useful, alternative data stream as it correlates well with other opioid statistics, while also providing granularity, continuity, and a unique scale of ethnographic information.
As Table 2 and Table 3 show, drug comment rates are strongly correlated with CDC overdose death rates for each of the opioid drug classes we analyzed, at the U.S., regional, state, and city level. At the individual drug level, Table 4 shows good correlations between semi-annual drug comment rates and semi-annual drug report rates, at the U.S. and regional levels for all opioids except heroin. We extend previous work by showing strong correlations with benchmarks at the national, regional, and city level using Reddit data. Additionally, correlations between drug chatter on Reddit and individual drug report rates in the NFLIS benchmark data indicate that single drug resolution can be achieved for drug prevalence rates at a national level. This potentially allows for the expansion of a similar system to monitor for drugs of interest not currently monitored by the NFLIS.
Kratom is a drug largely unmonitored by our benchmark data sources, whose activity is pronounced on Reddit. Not only does kratom emerge as a drug commonly co-mentioned with many other opioids and other drugs, as seen in the word cloud analysis, it is also the 2nd most discussed opioid on Reddit. Our analysis showed a spike in kratom activity on August 31st, 2016, the day the DEA released an announcement of their intent to temporarily schedule Kratom as a schedule I drug. In general, drug scheduling seeks to categorize drugs into one of five schedules, according to their acceptable medical uses and their potential for abuse and/or dependency. The DEA’s estimate of abuse potential of a given drug decreases with its schedule, with schedule I drugs having the highest risk of abuse, and schedule V drugs having the lowest risk. Therefore, classifying Kratom as a schedule I drug would define it as having “currently accepted medical use and a high potential for abuse”28. The announcement of intent was met with strong public opposition and the FDA eventually reversed course, a historic first for the administration29. As such, kratom remains unscheduled but illegal in many states. Further analysis of the comments contained within this activity spike could yield information on the public reaction and response to the DEA announcement. Additionally, the plateauing of the kratom activity curve could be a result of many states independently acting to make kratom illegal. The distribution of kratom mentions across subreddits indicates that much of the online discussion focuses on usage of kratom, using kratom in combination with other prescription and recreational substances, and with quitting kratom. The word clouds associated with these subreddits give insights into the particulars of activities surrounding these three domains. Use and development of more sophisticated methods to summarize and visualize the kratom comment corpora could contextualize kratom usage further and empower healthcare providers, as well as regulators, with needed information about kratom usage. One future area of research that could leverage Reddit for drug surveillance includes the surveillance of adverse drug reactions (ADRs) associated with different drugs. Reddit has already been used to successfully quantify the severity of ADRs30, so it is known that there is discussion of ADRs on the social media platform.
Increased temporal resolution is one of the potential benefits of using social media data to monitor changes in drug usage rates. Our analysis of regional changes in opioid discussion rates during the COVID-19 pandemic demonstrates the temporal resolution and responsiveness of monitoring social media conversations for pattern changes. The patterns observed in our Reddit opioid chatter follow those discussed in a CDC press release on a rise in opioid overdose deaths during the COVID-19 pandemic31. The CDC notes a rise in overdose deaths in 2019, potentially matching the bump in drug chatter rates we observe in our data. Additionally, the CDC notes that the rate of opioid overdoses has drastically accelerated during the COVID-19 pandemic, with fentanyl being a key driver of that change. Our analysis correspondingly shows a strong upward trend in three different opioid classes during the COVID-19 pandemic time period. We observed the strongest upward trend among the synthetic opioids, principally driven by increased discussion of fentanyl. While this case study is limited, it provides evidence of the potential for real-time drug trend monitoring with social media.
There are critical ethical considerations when using online discussions to monitor drug abuse, including ensuring that user anonymity is preserved. While Reddit is a pseudo-anonymous platform (some handles contain no identifying information, while others may be identifying to some degree), we do not share any usernames in this study, and we do not analyze any user on the individual level, in accordance with published ethical guidance32. Our study relies on the creation of a cohort that we follow over more than a decade and all analyses are performed on data aggregated by location. We believe a formal system implemented to monitor social media data in aggregate and in an ongoing manner could balance ethical considerations between protecting user anonymity while allowing societal benefit from improved surveillance of drug abuse rates.
Our approach yielded good correlations despite our relatively simplistic location mapping pipeline. It is based on the assumption that Reddit users are most likely to comment on the location subreddit for their own location. While this assumption may be reasonable, and our results correlate with more accurate geolocation data, we believe the creation of geo-located cohorts with more sophisticated location mapping techniques could yield a low-effort drug abuse monitoring system capable of quickly detecting the emergence of new drugs of abuse. As part of our results involve the geographical surveillance of drug comment rates, noisy location information would likely negatively affect our results. Additionally, we believe that combining Reddit data with other social media data streams, such as Twitter, would likely further improve the system.
A key limitation of this study is that we do not distinguish between drug mentions and probable drug use, as our calculation of drug comment rates uses the number of mentions of a drug, not the number of drug mentions indicative of drug use. Future methods could make use of methods that have been developed to differentiate abuse from discussions, potentially improving accuracy of the overall system33. It is also possible that since we are not selecting for only comments regarding drug usage, that the rates of drug discussions we observe are driven by news cycles. While there is certainly an interplay between the news coverage and online discussions of drugs, the prevalence of kratom discussion on Reddit indicates that discourse can evolve without mainstream news coverage, as kratom is a drug with very low media coverage but high rates of discussion. Additionally, during the COVID-19 pandemic of the past year, changes in opioid overdose rates have received relatively little press coverage. Therefore, we view it as unlikely that news coverage drove the observed recent changes in opioid discussions or the emergence of kratom related activity.
There is potential for interplay and response from the drug abuse community to the existence of a future social media surveillance system. If the drug abuse community sought to avoid surveillance efforts, those communities could find or create new platforms in which to converse, which could potentially facilitate private discussions. Indeed, there has already been activity on Reddit to move such discussions to private chat rooms (such as Discord) which would prevent surveillance by third parties. The degree to which these discussions occur in public versus private domains will be a key component of potential performance of a future surveillance system.
In conclusion, we introduce a cohort-based system for monitoring geographical rates of opioid mentions on Reddit over time, which are strongly correlated with established monitoring systems. Our work demonstrates that Reddit can be used to extract valuable public health information that is difficult to find in traditional sources. Our system has strong signal and correlation with existing surveillance and ground-truth data relating to opioid overdose deaths, from the CDC, and specific opioid drug reports, as reported to the NFLIS. We believe the strength of the correlations between Reddit data and existing standards of surveillance add to the growing evidence of the utility of social media for this type of public health surveillance8,14,15. Social media has the potential to provide real-time information about evolving changes in drug abuse and could allow for ethnographic information that contextualizes new drug trends as they emerge. These advantages suggest Reddit data could be used to supplement information from existing surveillance systems. With further development, social media surveillance systems based on these ideas could assist in monitoring and predicting future drug epidemics.
Data Availability
The underlying data is not provided for ethical reasons.