Abstract
Background & Objective During infectious disease outbreaks, health agencies often share text-based information about cases and deaths. This information is usually text-based and rarely machine-readable, thus creating challenges for outbreak researchers. Here, we introduce a generalizable data assembly algorithm that automatically curates text-based, outbreak-related information and demonstrate its performance across three outbreaks.
Methods After developing an algorithm with regular expressions, we automatically curated data from health agencies via three information sources: formal reports, email newsletters, and Twitter. A validation data set was also curated manually for each outbreak.
Findings When compared against the validation data sets, the overall cumulative missingness and misidentification of the algorithmically curated data were ≤2% and ≤1%, respectively, for all three outbreaks.
Conclusions Within the context of outbreak research, our work successfully addresses the need for generalizable tools that can transform text-based information into machine-readable data across varied information sources and infectious diseases.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
Research reported in this work was supported by the National Institutes of Health through an NIH Director's New Innovator Award DP2-MD012722. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This manuscript uses publicly-available data exclusively and thus did not necessitate IRB approval.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Please refer to the manuscript for a link to the study's Github repository.