ABSTRACT
Cough is a common and commonly ignored symptom of lung disease. Cough is often perceived as difficult to quantify, frequently self-limiting, and non-specific. However, cough has a central role in the clinical detection of many lung diseases including tuberculosis (TB), which remains the leading infectious disease killer worldwide. TB screening currently relies on self-reported cough which fails to meet the World Health Organization (WHO) accuracy targets for a TB triage test. Artificial intelligence (AI) models based on cough sound have been developed for several respiratory conditions, with limited work being done in TB. To support the development of an accurate, point-of-care cough-based triage tool for TB, we have compiled a large multi-country database of cough sounds from individuals being evaluated for TB. The dataset includes more than 700,000 cough sounds from 2,143 individuals with detailed demographic, clinical and microbiologic diagnostic information. We aim to empower researchers in the development of cough sound analysis models to improve TB diagnosis, where innovative approaches are critically needed to end this long-standing pandemic.
BACKGROUND AND SUMMARY
Tuberculosis remains the leading infectious disease killer globally, partly due to public health systems’ inability to accurately diagnose millions of infected individuals every year.1 Insufficient access to high-quality TB screening and diagnosis is recognized as one of the most important gaps in the cascade of care.2 Here we describe a cough sound database including detailed demographic, clinical and microbiologic information for the development of AI-based sound classification TB triage models. As the WHO’s End TB Strategy calls for intensified research and innovation including the discovery of new tools for community-based screening, digital cough monitoring and Acoustic Epidemiology could represent new tools that can help bend the TB pandemic curve and accelerate the achievement of global TB elimination goals.3–5
The “missing millions” of undiagnosed patients living with active TB disease represent an heterogeneous group including those who did not access triage or diagnosis testing or weren’t appropriately referred for effective treatment. Improving the accuracy, portability, point-of-care amenability and connectivity of diagnostic tools and algorithms would have significant value. Most health systems build their TB programs on a combination of complementary screening followed by diagnostic tests. The WHO’s target product profile (TPP) for a community-based TB triage test suggests that it should be at least 90% sensitive and 70% specific.6 According to the 2021 WHO TB screening guidelines, symptom-based screening with questionnaires, including cough, is 42% sensitive.7 Besides having poor accuracy these guidelines have operational challenges that impede its sustained and uniform implementation within resource-challenged TB programs. Other tools such as digital chest X-rays combined with computer-aided detection (CAD) algorithms have also been evaluated in the context of TB triage. This approach was shown to be highly sensitive but had variable specificity and remains difficult to deploy due to limited availability of chest X-ray platforms at primary-level health facilities.8 Whether in the context of community-based outreach screening or healthcare facility-based evaluation prior to confirmatory testing, cough classification models could complement or replace other triage strategies including symptom-based screening.
We historically have been unable to objectively monitor cough sounds and consequently reduced this data-rich symptom into subjective and dichotomous information (e.g., cough versus no cough, chronic versus acute, better versus worse). Advances in acoustics and machine learning (ML) have enabled the identification and recording of human coughs in real-world acoustic environments (cough detection) as well as differentiation of coughs from patients with distinct clinical conditions or at different stages of disease (cough classification). As part of the emerging field of Acoustic Epidemiology, this has the potential to develop novel screening or diagnostic assays with simple digital recording devices, such as a smartphone, tablet or watch.5 Proof-of-concept studies previously showed that cough associated with TB contains a specific acoustic signature which can be recognized by ML models. A study by Pahar et al. suggests that a cough-based TB screening model can discriminate TB cough sounds from those associated with other lung conditions with 93% sensitivity and 95% specificity, exceeding the WHO TPPs.9 In a study combining cough sound analysis and patients’ clinical characteristics, Yellapu et al. report that ML can be used to detect TB with 90% sensitivity and 85% specificity.10 Those pilot studies report on ML models which were designed on small datasets and were not validated in external populations. Given the potential impact on performance of local disease epidemiology and population ethnicity among other confounders, large and diverse cough datasets are needed to replicate those studies.
We collected and are here releasing a dataset including 733,756 cough sounds from 2,143 patients across 7 countries with accurately annotated demographic, clinical and microbiologic diagnostic information. These data were initially used to enable and evaluate the CODA TB DREAM Challenge which invited participants to develop algorithms for prediction of TB diagnosis. The training data are now available for general use, and researchers are invited to leverage acoustic and clinical data to further develop and evaluate sound classification models for TB screening against a held-out test partition.11 We aim to enable the development of models which could achieve the WHO TPP performance targets for the current ‘community-based TB triage test’ or the forthcoming TPP for a TB screening test.6,12 This data set has limitations which include some selection bias since it was collected from a symptomatic presumptive TB population. The developed models which will be developed may hence not perform as well if used for asymptomatic screening at population level. Accordingly, more data should be collected from community screening activities.
METHODS
Participants
A total of 2,143 participants were recruited from two parent studies described below. To be eligible, participants had to be 18 years or older and have a new or worsening cough for at least two weeks. took place at outpatient clinics in India, Madagascar, the Philippines, South Africa, Tanzania, Uganda, and Vietnam. All participants provided informed consent. A summary of participant demographics and country distribution are available in Table 1.
Rapid Research in Diagnostic Development TB Network (R2D2 TB Network) study
The R2D2 TB Network study evaluates novel TB diagnostics in various stages of development among people with presumptive TB in five low- and middle-income countries: Uganda, South Africa, Vietnam, the Philippines and India.13 Ethical approval for this study was obtained from institutional review boards (IRB) in the US and in each study site. In the US, approval was obtained from the University of California San Francisco IRB (# 20-32670). In Vietnam, approval was obtained from the Ministry of Health Ethical Committee for National Biological Medical Research (94/CN-HĐĐĐ), the National Lung Hospital Ethical Committee for Biological Medical Research (566/2020/NCKH) and the Hanoi Department of Health, Hanoi Lung Hospital Science and Technology Initiative Committee (22/BVPHN). In India, approval was obtained from Christian Medical College IRB (13256). In South Africa, approval was obtained from Stellenbosch University Health Research Ethics Committee (17047). In Uganda, approval was obtained from Makerere University, College of Health Sciences, School of Medicine, Research Ethics Committee (2020-182). In the Philippines, approval was obtained from De La Salle Health Sciences Institute Independent Ethics Committee (2020-33-02-A).
The Digital Cough Monitoring for screening, diagnosis and clinical follow-up of tuberculosis and other respiratory diseases project
This project was designed to embed digital cough monitoring within existing health facility-based TB diagnostic cohorts in Madagascar and Tanzania. Ethical approval for this study was obtained from institutional review boards (IRB) in Canada and in each study site. In Canada, approval was obtained from the Centre de Recherche du Centre Hospitalier de l’Université de Montréal IRB (# 2021-9270, 20.226). In Madagascar, approval was obtained from the Comité d’Éthique à la Recherche Biomédicale (IORG0000851 - N°051-MSANP/SG/AMM/CERBM).). In Tanzania, approval was obtained from the Ifakarah Health Institute IRB (31-2021) and the National Institute for Medical Research (NIMR/HQ/R.8a/Vol IX/3805).
Data Collection
Demographic and clinical data
At enrollment into the parent studies, participants underwent a baseline questionnaire, clinical examination, and sputum collection for TB testing. Study staff also recorded participants’ age, gender, height, weight, smoking status and duration of cough. HIV diagnosis was made either based on participant self-report of a positive HIV diagnosis or a positive test result. A summary of the available variables is shown in Table 2.
TB Reference Standard Testing
Both Xpert MTB/RIF Ultra PCR and mycobacterial culture (Lowenstein-Jensen solid medium or MGIT liquid medium) were performed on sputum collected from all participants. Any participant whose first sputum Xpert MTB/RIF Ultra result was indeterminate or trace-positive, received a second sputum Xpert MTB/RIF Ultra test. Results from those assays were combined to determine TB status according to two reference standards: a microbiologic reference standard and a sputum Xpert reference standard. The sputum Xpert reference standard is restricted to Xpert MTB/RIF Ultra results on sputum samples. The microbiologic reference standard includes culture results, allowing for more individuals to be classified as TB positive. The microbiologic reference standard is considered the primary reference standard. Full details of the reference standards are described in Table 3.
Cough Recording
Cough sounds were collected using smartphones loaded with the Hyfe research app.14 Specific phone models used in the different participating sites are presented in Supplementary Materials 1. Hyfe research app is designed to listen for explosive sounds and record ∼0.5 seconds sound fragments corresponding to putative cough sounds. Hyfe research app uses a server-based convolutional neural network (CNN) model to classify explosive sounds as coughs and recordings of these cough sounds are saved on a protected health information (PHI)-regulated server for analysis. This model has been shown to be 96% sensitive and 96% specific for cough detection using human-labeled sounds as a reference standard.15 Smartphones were positioned on tripods in rooms within the clinic. Participants were asked to cough five times (solicited cough) while standing 60-90 cm from the tripod; participants who managed to produce at least three coughs were retained in the dataset. Some participants produced more than five coughs due to a triggered coughing fit and those additional coughs were also collected and included in the dataset. Solicited and triggered coughs could not be labeled distinctively and are treated the same in the dataset. After enrollment and onboarding, a subset of participants (n = 565) were also asked to carry a study phone for two weeks and collect longitudinal coughs sounds in an outpatient setting. Those sounds are labeled as longitudinal and made available within the dataset. A tally of solicited and longitudinal cough sounds per data partition are available in Table 4.
Data Partitioning
The dataset was split into a training (n=1,105) and validation set (n=1,038). The dataset was randomly partitioned evenly between the training and testing set at the level of the participant (i.e., all of a participant’s cough sounds are in either the training or validation set).
Data Pre-Processing
Cough sounds
The sound recordings available in this dataset have not undergone pre-processing beyond their identification as a cough sound by the Hyfe research app CNN model.
Clinical Data
Data from all participating sites were collected with standardized questionnaires and definitions. Data formatting was harmonized in the open access database.
DATA RECORDS
De-identified participant demographic and clinical data, including TB reference standard results, cough sound WAV files, and a datafile linking participant IDs to sound file IDs were exported to a dedicated project in Synapse. Synapse is a general-purpose data and analysis sharing service where members can work collaboratively, analyze data, share insights, and have attributions and provenance of those insights to share with others. Synapse is developed and operated by Sage Bionetworks16. A total of 1,105 participants’ data are made available for access and download as a training dataset. The validation set is withheld, but models can be evaluated against the validation set via the instructions provided in the Synapse project.
All training set files are stored and are accessible via the Synapse platform with associated metadata and documentation and can be accessed at the following URL: www.synapse.org/TBcough - https://doi.org/10.7303/syn31472953.
TECHNICAL VALIDATION
All cough collection periods were observed by study staff and cough sounds were spot-checked for accurate recording. Patient metadata was reviewed by study staff for accuracy. The data described in this article were collected using the Hyfe Research app which uses a proprietary algorithm to identify cough sounds. We used a prediction score of 0.8 from this algorithm to filter potential non-cough sounds. To validate the precision of the Hyfe algorithm, a standalone computer vision and deep learning model was trained using Log Mel spectrogram images from ESC-50 and Coswara datasets.17,18 The “VGG16” CNN based pre-trained model was trained for accurate classification of cough sounds and achieved a model accuracy of approximately 96% on Hyfe cough recordings.19 Most of the recordings that were classified incorrectly had a Hyfe prediction score less than 0.8.
USAGE NOTES
Users can register to evaluate predictive models of TB diagnosis against the held-out test partition via the instructions on the Synapse project. The scoring mechanism can evaluate two different types of models: (1) those that use only cough sounds, or (2) those which also incorporate clinical metadata variables which have been provided in the training dataset (sex, age, height, weight, reported duration of cough, prior TB diagnosis and type, hemoptysis, heart rate, temperature, weight loss, smoking in the last week, fever and night sweats). Models are submitted to the scoring queues as Docker images. Full instructions and example code is available on Synapse project website (www.synapse.org/TBcough).
Downloading the Data
Given the number of files represented in the data, users should consider downloading the data via one of the programmatic Synapse clients (available in R or Python). For convenience, Python code for downloading the data is provided in the Synapse project wiki. The training dataset size is (0.43 GB) for the solicited coughs and (31.6 GB) for the longitudinal coughs.
Data Use Agreement
To access the data, individuals must become Certified and Validated users of Synapse and maintain an active account on Synapse: http://www.synapse.org. They must also submit an Intended Data Use Statement and agree to the Terms of Use of the dataset. Terms of Use are summarized in Supplementary Materials 2.
CODE AVAILABILITY
No additional data processing was conducted other than what has been described above.
DATA AVAILABILITY
All training set files are stored and are accessible via the Synapse platform with associated metadata and documentation and can be accessed at the following URL: www.synapse.org/TBcough.
Specific acoustic and clinical metadata as well as dataset information can be found under the following doi references.
Clinical data: https://doi.org/10.7303/syn53710097
Cough metadata: https://doi.org/10.7303/syn53710098
Solicited Coughs: https://doi.org/10.7303/syn40358494
Longitudinal Coughs: https://doi.org/10.7303/syn40358476
Data Dictionary: https://doi.org/10.7303/syn41743692
Data sharing and model benchmarking website: https://doi.org/10.7303/syn31472953
SUPPLEMENTARY MATERIALS
Supplementary materials 1 Phone models used in the different participating sites
India
- Redmi 9 Prime
- Realme Narzo20
Madagascar
- Motorola G9 play
Philippines
- Myphone myWX2 Pro
- Xiaomi 9C
South Africa
- Nokia 3.1
- Nokia 5.4
- Xiaomi Redmi 9A
Tanzania
- Nokia 3.4 Ta-1288
Uganda
- Motorola G16
- Samsung M11
- Nokia model 5.3
Vietnam
- OPPOA54
Supplementary materials 2 Data Use Agreement
Researchers wishing to access the data must:
⍰ You must reaffirm your commitment to the Synapse Pledge and must abide by the guiding principles for responsible research use and data handling within the Synapse Commons Platform as described in the Synapse Governance documents.
⍰ You will not attempt to establish the identity of, or attempt to contact any of the subjects included in the data.
⍰ You confirm that if you inadvertently receive identifiable information or otherwise identify a subject, you will promptly notify the ACT by emailing act{at}sagebase.org.
⍰ You agree to establish appropriate administrative, technical, and physical safeguards to prevent unauthorized use of or access to the Data.
⍰ You will report any data misuse or breach of data security to ACT by emailing act{at}sagebase.org.
⍰ You will use the data only as identified in your intended data use statement (IDU), submitted through Synapse. The IDU should be written in English and must describe the objectives of the proposed research and study design and analysis plan (500 word maximum).
⍰ Data accessors must acknowledge the following in all publications or presentations as follows:
“The datasets used for the analyses described were contributed by Dr. Adithya Cattamanchi at UCSF and Dr. Simon Grandjean Lapierre at University of Montreal and were generated in collaboration with researchers at Stellenbosch University (PI Grant Theron), Walimu (PIs William Worodria and Alfred Andama); De La Salle Medical and Health Sciences Institute (PI Charles Yu), Vietnam National Tuberculosis Program (PI Nguyen Viet Nhung), Christian Medical College (PI DJ Christopher), Centre Infectiologie Charles Mérieux Madagascar (PIs Mihaja Raberahona & Rivonirina Rakotoarivelo), and Ifakara Health Institute (PIs Issa Lyimo & Omar Lweno) with funding from the U.S. National Institutes of Health (U01 AI152087), The Patrick J. McGovern Foundation and Global Health Labs.”
Funders / operations
Data from the CODA TB DREAM Challenge was generated with support from Global Health Labs, the Patrick J. McGovern Foundation, and the National Institute of Allergy and Infectious Diseases of the US National Institutes of Health under award number U01AI152087. The CODA TB DREAM Challenge and post-challenge evaluation was funded in part by the Bill & Melinda Gates Foundation.
Direct salary support
SGL is supported by a Junior 1 Salary Award from the Fonds de Recheche Santé Québec. DJ is supported by funding by the National Institutes of Health
AUTHOR CONTRIBUTION
Conception (SKS, LO, CB, PD, PMS, AC, SGL), data acquisition (MR, RR, IL, OL, DJC, NVN, GT, WW, CY), data analysis (SH, VY, SKS, SK, DJ), data interpretation (N/A), drafting the work (SH, SGL), reviewing the work critically for important intellectual content and final approval of the version to be published (all co-authors).
Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved (SH, SKS, AC, SGL).
COMPETING INTERESTS
The authors declare no competing interest.
ACKNOWLEDGMENT
We thank all patients and families who participated in this study. We thank healthcare providers, research personnel, laboratory technicians involved in patient recruitment and data collection.