RT Journal Article SR Electronic T1 Developing an automatic pipeline for analyzing chatter about health services from social media: A case study for Medicaid JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.06.12.20129593 DO 10.1101/2020.06.12.20129593 A1 Yang, Yuan-Chi A1 Al-Garadi, Mohammed Ali A1 Hogg-Bremer, Whitney A1 Zhu, Jane M. A1 Grande, David A1 Sarker, Abeed YR 2020 UL http://medrxiv.org/content/early/2020/06/13/2020.06.12.20129593.abstract AB Objective Social media can be an effective but challenging resource for conducting close-to-real-time assessments of consumers’ perceptions about health services. Our objective was to develop and evaluate an automatic pipeline, involving natural language processing and machine learning, for automatically characterizing user-posted Twitter data about Medicaid.Material and Methods We collected Twitter data via the public API using Medicaid-related keywords (Corpus-1), and the website’s search option using agency-specific handles (Corpus-2). We manually labeled a sample of tweets into five pre-determined categories or other, and artificially increased the number of training posts from specific low-frequency categories. We trained and evaluated several supervised learning algorithms using manually-labeled data, and applied the best-performing classifier to collected tweets for post-classification analyses assessing the utility of our methods.Results We collected 628,411 and 27,377 tweets for Corpus-1 and -2, respectively. We manually annotated 9,571 (Corpus-1: 8,180; Corpus-2: 1,391) tweets, using 7,923 (82.8%) for training and 1,648 (17.2%) for evaluation. A BERT-based (bidirectional encoder representations from transformers) classifier obtained the highest accuracies (83.9%, Corpus-1; 86.4%, Corpus-2), outperforming the second-best classifier (SVMs: 79.6%; 76.4%). Post-classification analyses revealed differing inter-corpora distributions of tweet categories, with political (63%) and consumer-feedback (43%) tweets being most frequent for Corpus-1 and -2, respectively.Discussion and Conclusion The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed pipeline presents a feasible solution for automatic categorization, and can be deployed/generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies (LINK_TO_BE_AVAILABLE).Competing Interest StatementThe authors have declared no competing interest.Funding StatementResearch reported in this publication was supported by Robert Wood Johnson Foundation (RWJF) under award number 76158 (JMZ, DG). The content is solely the responsibility of the authors and does not necessarily represent the official views of the RWJF.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:It's a non-human-subjects research classified by OSHU.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAnnotated data and methods will be available for future studies once the manuscript is published