PT - JOURNAL ARTICLE AU - Guo, Yuting AU - Ge, Yao AU - Yang, Yuan-Chi AU - Al-Garadi, Mohammed Ali AU - Sarker, Abeed TI - Comparison of pretraining models and strategies for health-related social media text classification AID - 10.1101/2021.09.28.21264253 DP - 2021 Jan 01 TA - medRxiv PG - 2021.09.28.21264253 4099 - http://medrxiv.org/content/early/2021/09/30/2021.09.28.21264253.short 4100 - http://medrxiv.org/content/early/2021/09/30/2021.09.28.21264253.full AB - Motivation Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks. There is a need to benchmark such models for targeted NLP tasks, and to explore effective pretraining strategies to improve machine learning performance.Results In this work, we addressed the task of health-related social media text classification. We benchmarked five models—RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT on 22 tasks. We attempted to boost performance for the best models by comparing distinct pretraining strategies—domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and topic-specific pretraining (TSPT). RoBERTa and BERTweet performed comparably in most tasks, and better than others. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT+TSPT showed consistently high performance, with statistically significant improvement in one task. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.Availability and implementation Source code for our model and data preprocessing is available under the Github repository https://github.com/yguo0102/transformer_dapt_sapt_tapt. Datasets must be obtained from original sources, as described in supplementary material.Supplementary information Supplementary data are available at Bioinformatics online.Competing Interest StatementThe authors have declared no competing interest.Funding StatementResearch reported in this publication was supported in part by the National Institute on Drug Abuse (NIDA) of the National Institutes of Health (NIH) under award number R01DA046619. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Emory University (exemption category 4)All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesData for the 22 classification problems are available from their original sources.