RT Journal Article SR Electronic T1 Developing an automatic system for classifying chatter about health services from Twitter: A case study for Medicaid JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2020.06.12.20129593 DO 10.1101/2020.06.12.20129593 A1 Yang, Yuan-Chi A1 Al-Garadi, Mohammed Ali A1 Hogg-Bremer, Whitney A1 Zhu, Jane M. A1 Grande, David A1 Sarker, Abeed YR 2020 UL http://medrxiv.org/content/early/2020/12/19/2020.06.12.20129593.abstract AB Background The wide adoption of social media in daily life renders it a rich and effective resource for conducting close-to-real-time assessments of consumers’ perceptions about health services. This is, however, challenging due to the vast amount of data and the diverse content in the social media chatter.Objectives To develop and evaluate an automatic system, involving natural language processing and machine learning, for automatically characterizing user-posted Twitter data about healer services, using Medicaid, the single largest insurance in the United States, as an example.Methods We collected data from Twitter in two ways: (i) via the public streaming API using Medicaid-related keywords (Corpus-1), and (ii) by using the website’s search option for tweets mentioning the agency-specific handles (Corpus-2). We manually labeled a sample of tweets into five pre-determined categories or other, and artificially increased the number of training posts from specific low-frequency categories. Using the manually-labeled data, we trained and evaluated several supervised learning algorithms, including Support Vector Machine, Random Forest (RF), Naïve Bayes, shallow Neural Network (NN), k-Nearest Neighbor, Bi-Directional Long Short-Term Memory, and Bidirectional Encoder Representations from Transformers (BERT). We then applied the best-performing classifier to the collected tweets for post-classification analyses assessing the utility of our methods.Results We manually annotated 11,379 (Corpus-1: 9,179; Corpus-2: 2,200) tweets, using 7,930 (69.7%) for training and 1,449 (12.7%) for validation and 2,000 (17.6%) for test. A BERT-based classifier obtained the highest accuracies (81.7%, Corpus-1; 80.7%, Corpus-2) and F1-score on Consumer Feedback (0.58, Corpus-1; 0.90, Corpus-2), outperforming the second-best classifiers in accuracies (74.6%, RF on Corpus-1; 69.4%, RF on Corpus-2) and F1-score on Consumer Feedback (0.44, NN on Corpus-1; 0.82, RF on Corpus-2). Post-classification analyses revealed differing inter-corpora distributions of tweet categories, with political (64%) and consumer-feedback (55%) tweets being the most frequent for Corpus-1 and -2, respectively.Conclusions The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization, and can be deployed/generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies (https://yyang60@bitbucket.org/sarkerlab/medicaid-classification-script-and-data-for-public).Competing Interest StatementThe authors have declared no competing interest.Funding StatementResearch reported in this publication was supported by Robert Wood Johnson Foundation (RWJF) under award number 76158 (JMZ, DG). The content is solely the responsibility of the authors and does not necessarily represent the official views of the RWJF.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:It's a non-human-subjects research classified by OSHU.All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAnnotated data and methods will be available for future studies once the manuscript is publishedAPIApplication Programming InterfaceBERTBidirectional Encoder Representations from TransformersBLSTMBi-directional Long Short-Term MemoryKNNK-Nearest NeighborMAMedicaid agencyMCOManaged Care OrganizationNBNaïve BayesNLPNatural Language ProcessingNNShallow Neural NetworksRFRandom ForestSVMSupport Vector MachineTFIDFTerm-Frequency-Inverse-Document-FrequencyUSUnited States