PT  - JOURNAL ARTICLE
AU  - Towler, Lauren
AU  - Bondaronek, Paulina
AU  - Papakonstantinou, Trisevgeni
AU  - Amlôt, Richard
AU  - Chadborn, Tim
AU  - Ainsworth, Ben
AU  - Yardley, Lucy
TI  - Applying machine-learning to rapidly analyse large qualitative text datasets to inform the COVID-19 pandemic response: Comparing human and machine-assisted topic analysis techniques
AID  - 10.1101/2022.05.12.22274993
DP  - 2022 Jan 01
TA  - medRxiv
PG  - 2022.05.12.22274993
4099  - http://medrxiv.org/content/early/2022/06/01/2022.05.12.22274993.short
4100  - http://medrxiv.org/content/early/2022/06/01/2022.05.12.22274993.full
AB  - Background Machine-assisted topic analysis (MATA) uses artificial intelligence methods to assist qualitative researchers to analyse large amounts of textual data. This could allow qualitative researchers to inform and update public health interventions ‘in real-time’, to ensure they remain acceptable and effective during rapidly changing contexts (such as a pandemic). In this novel study we aimed to understand the potential for such approaches to support intervention implementation, by directly comparing MATA and ‘human-only’ thematic analysis techniques when applied to the same dataset (1472 free-text responses from users of the COVID-19 infection control intervention ‘Germ Defence’).Methods In MATA, the analysis process included an unsupervised topic modelling approach to identify latent topics in the text. The human research team then described the topics and identified broad themes. In human-only codebook analysis, an initial codebook was developed by an experienced qualitative researcher and applied to the dataset by a well-trained research team, who met regularly to critique and refine the codes. To understand similarities and difference, formal triangulation using a ‘convergence coding matrix’ compared the findings from both methods, categorising them as ‘agreement’, ‘complementary’, ‘dissonant’, or ‘silent’.Results Human analysis took much longer (147.5 hours) than MATA (40 hours). Both human-only and MATA identified key themes about what users found helpful and unhelpful (e.g. Boosting confidence in how to perform the behaviours vs Lack of personally relevant content). Formal triangulation of the codes created showed high similarity between the findings. All codes developed from the MATA were classified as in agreement or complementary to the human themes. Where the findings were classified as complementary, this was typically due to slightly differing interpretations or nuance present in the human-only analysis.Conclusions Overall, the quality of MATA was as high as the human-only thematic analysis, with substantial time savings. For simple analyses that do not require an in-depth or subtle understanding of the data, MATA is a useful tool that can support qualitative researchers to interpret and analyse large datasets quickly. These findings have practical implications for intervention development and implementation, such as enabling rapid optimisation during public health emergencies.Contributions to the literatureNatural language processing (NLP) techniques have been applied within health research due to the need to rapidly analyse large samples of qualitative data. However, the extent to which these techniques lead to results comparable to human coding requires further assessment.We demonstrate that combining NLP with human analysis to analyse free-text data can be a trustworthy and efficient method to use on large quantities of qualitative data.This method has the potential to play an important role in contexts where rapid descriptive or exploratory analysis of very large datasets is required, such as during a public health emergency.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThe study was funded by United Kingdom Research and Innovation Medical Research Council (UKRI MRC) Rapid Response Call: UKRI CV220-009. The Germ Defence intervention was hosted by the Lifeguide Team, supported by the NIHR Biomedical Research Centre, University of Southampton. LY is a National Institute for Health Research (NIHR) Senior Investigator and team lead for University of Southampton Biomedical Research Centre. LY is affiliated to the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Behavioural Science and Evaluation of Interventions at the University of Bristol in partnership with Public Health England (PHE). The views expressed are those of the author(s) and not necessarily those of the NIHR, the Department of Health or PHE. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethics committee of University of Southampton gave ethical approval for this workI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe datasets generated and/or analysed during the current study are available in the figshare repository, https://doi.org/10.6084/m9.figshare.19514305. https://doi.org/10.6084/m9.figshare.19514305 AIArtificial intelligenceIPAInterpretative phenomenological analysisMATAMachine-assisted topic analysisNLPNatural language processingPBAPerson based approachSTMStructural topic modelTAThematic analysis