Abstract
Background Pre-exposure prophylaxis (PrEP) is highly effective at preventing the acquisition of Human Immunodeficiency Virus (HIV). There is a substantial gap, however, between the number of people in the United States who have indications for PrEP and the number of them who are prescribed PrEP. While Twitter content has been analyzed as a source of PrEP-related data (e.g., barriers), methods have not been developed to enable the use of Twitter as a platform for implementing PrEP-related interventions.
Objective Men who have sex with men (MSM) are the population most affected by HIV in the United States. Therefore, the objective of this study was to develop and assess an automated natural language processing (NLP) pipeline for identifying men in the United States who have reported on Twitter that they are gay, bisexual, or MSM.
Methods Between September 2020 and January 2021, we used the Twitter Streaming Application Programming Interface (API) to collect more than 3 million tweets containing keywords that men may include in posts reporting that they are gay, bisexual, or MSM. We deployed handwritten, high-precision regular expressions on the tweets and their user profile metadata designed to filter out noise and identify actual self-reports. We identified 10,043 unique users geolocated in the United States, and drew upon a validated NLP tool to automatically identify their ages.
Results Based on manually distinguishing true and false positive self-reports in the tweets or profiles of 1000 of the 10,043 users identified by our automated pipeline, our pipeline has a precision of 0.85. Among the 8756 users for which a United States state-level geolocation was detected, 5096 (58.2%) of them are in the 10 states with the highest numbers of new HIV diagnoses. Among the 6240 users for which a county-level geolocation was detected, 4252 (68.1%) of them are in counties or states considered priority jurisdictions by the Ending the HIV Epidemic (EHE) initiative. Furthermore, the majority of the users are in the same two age groups as the majority of MSM in the United States with new HIV diagnoses.
Conclusions Our automated NLP pipeline can be used to identify MSM in the United States who may be at risk for acquiring HIV, laying the groundwork for using Twitter on a large scale to target PrEP-related interventions directly at this population.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This research was supported by a grant from the Penn Center for AIDS Research (CFAR), an NIH-funded program (P30 AI 045008).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Institutional Review Board (IRB) of the University of Pennsylvania
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
The annotated Twitter data used to validate our automated natural language processing pipeline will be made available by request.