Abstract
Taxonomic classification is an essential step in the analysis of microbiome data that depends on a reference database of whole genome sequences. Taxonomic classifiers are built on established reference species, such as the Human Microbiome Project database, that is growing rapidly. While constructing a population wide pangenome of the bacterium Hungatella, we discovered that the Human Microbiome Project reference species Hungatella hathewayi (WAL 18680) was significantly different to other members of this genus. Specifically, the reference lacked the core genome as compared to the other members. Further analysis, using average nucleotide identity (ANI) and 16s rRNA comparisons, indicated that WAL18680 was misclassified as Hungatella. The error in classification is being amplified in the taxonomic classifiers and will have a compounding effect as microbiome analyses are done, resulting in inaccurate assignment of community members and will lead to fallacious conclusions and possibly treatment. As automated genome homology assessment expands for microbiome analysis, outbreak detection, and public health reliance on whole genomes increases this issue will likely occur at an increasing rate. These observations highlight the need for developing reference free methods for epidemiological investigation using whole genome sequences and the criticality of accurate reference databases.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was supported by the 100K Pathogen Genome Project.
Author Declarations
All relevant ethical guidelines have been followed and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
NA
Any clinical trials involved have been registered with an ICMJE-approved registry such as ClinicalTrials.gov and the trial ID is included in the manuscript.
NA
I have followed all appropriate research reporting guidelines and uploaded the relevant Equator, ICMJE or other checklist(s) as supplementary files, if applicable.
NA
Data Availability
The whole genome sequences are available now via the SRA for all bu, except BCW8888. It will be publically available within 90 days.