ABSTRACT
Cholera remains a significant public health burden in many countries in sub-Saharan Africa, though the exact mechanisms of bacterial emergence and spread remain largely undefined. We generated genomic data from 728 Vibrio cholerae O1 isolates predominantly collected between 2019-2024 to create the largest dataset of V. cholerae genomes sequenced locally in Africa. This dataset enabled us to interrogate recent patterns of spread, including the rapid circulation of the AFR15 lineage associated with unusually large outbreaks in Southern Africa. We provide evidence for the movement of the AFR15 lineage into new African Member States and confirm previously observed differences in V. cholerae transmission dynamics in West versus East Africa, though cross-border transmission is prevalent on both sides of the continent. Despite observed differences, evolutionary processes are similar across lineages and we find no evidence for significant changes in antimicrobial resistance genotypes. Overall, our findings emphasize the importance of regionally coordinated cross-border surveillance and interventions, while also demonstrating the critical role of locally generated genomic data in understanding the spread of cholera in Africa.
INTRODUCTION
Cholera, a diarrheal disease caused by the bacterium Vibrio cholerae, poses a significant public health threat globally. There have been seven major global outbreaks (pandemics) of cholera, with the current outbreak primarily affecting and persisting in Africa since its introduction from Asia in 1970 [1,2]. Seventh pandemic cholera is typically caused by the V. cholerae O1 serogroup and El Tor biotype (referred to as 7PET V. cholerae), and several prior studies have documented the intermittent and often seasonally driven outbreaks observed in many African countries [3–5]. These recurring outbreaks underscore the ongoing challenges in controlling and preventing cholera, particularly in resource-limited settings, and emphasize the need for comprehensive control strategies that use all available tools to better understand the epidemiology and dynamics of cholera transmission.
Severe, protracted, and persistent cholera outbreaks in several countries necessitate improved cholera surveillance and control [6]. Over 92,000 cholera cases were reported across 16 African countries from January to May 2024 [7], with Zambia (20,113 cases), the Democratic Republic of the Congo (DRC; 16,539 cases), and Mozambique (7,762 cases) among the most affected [8]. Mozambique and Malawi also experienced severe outbreaks in 2022 [9], with the outbreak in Malawi recorded as one of the largest (over 56,000 cases) and deadliest (>3% case fatality rate) outbreaks in history [10,11]. Extreme weather events and the COVID-19 pandemic, with its disruptions to healthcare systems and changes in societal behaviors, likely contributed to the high morbidity and mortality of these outbreaks. However, simultaneous outbreaks in neighboring countries have been previously observed in this region and across Africa, suggesting that cholera transmission is regional and crosses borders, which has critical implications for its control and elimination [12,13].
Investigating transmission dynamics and unusual disease presentation (for example, high case fatality rates) requires an in-depth look at not only epidemiological and clinical data, but at the V. cholerae bacterium itself. Whole genome sequencing provides information about the specific strain(s) circulating, which can determine whether outbreaks in neighboring countries are connected, or if recent changes in the bacterial genome may explain observed trends. As a result, genomic analysis has recently emerged as an important tool in understanding the spread of cholera in sub-Saharan Africa [2]. Using whole genome sequences generated from V. cholerae isolates around the continent, previous studies first identified at least three waves of global transmission from Asia to Africa during the seventh pandemic [14], which was later divided into at least 17 independent introductions of 7PET into Africa, deemed the AFR1-AFR17 (or equivalently known as T1-T17) lineages [2,10,15–17]. These lineages have since been used to connect recent African and Middle Eastern outbreaks [16], to demonstrate regional transmission patterns in both East [17–19] and West Africa [13], and to suggest a possible transmission route underlying recent cases [10].
Although genomic analyses have been effectively used to better understand cholera transmission on the African continent, previous work has also highlighted the gaps in our current understanding and the need for additional surveillance and analyses. In particular, most published work focuses on outbreaks prior to 2020, when genomic surveillance infrastructure in Africa was limited. Therefore, an updated picture of V. cholerae diversity is needed, including how transmission patterns have or have not changed in recent years, the role of environmental reservoirs in transmission, and whether changes in the V. cholerae bacterium may have contributed to the worst cholera outbreaks in decades. There are also growing concerns over the emergence of antibiotic resistance to drugs used to manage severe cases of cholera and other infectious diseases [20], especially because bacteria can share resistance genes via horizontal gene transfer [21]. Finally, questions remain about the ability of genomic data to provide information that can meaningfully be used to inform and prioritize intervention approaches such as organizing vaccination campaigns.
In response to these unanswered questions, the Africa Centres for Disease Control and Prevention (Africa CDC), through the Africa Pathogen Genomic Initiative, formed the Cholera Genomics Consortium in Africa (“CholGEN”) in collaboration with national public health institutes and national reference laboratories from seven African Union (AU) Member States (Cameroon, DRC, Malawi, Mozambique, Nigeria, Uganda, and Zambia) and international partners [22]. These Member States were selected based on public health needs, represent distinct cholera transmission scenarios (primarily endemic versus primarily caused by imports), and have collectively observed all known lineages in Africa. Additionally, several of these countries share borders, which will aid in understanding the role of cross-border transmission. As part of CholGen, laboratorians and bioinformaticians from each of these member states have been working to leverage local genomic sequencing capacity built during the COVID-19 pandemic to sequence bacterial isolates from recent cholera outbreaks. The resulting data provide an updated picture of circulating cholera diversity, point to specific instances of likely cross-border transmission, and highlight the role of genomic data in understanding different transmission scenarios. These results, made possible by a highly collaborative effort, underscore the importance of multi-country solutions to cholera control, which should include coordinated genomic surveillance that can reveal both cross-border transmission events and notable changes in the bacterial genome.
RESULTS
Through CholGEN, the seven AU Member States identified, processed, and sequenced 1,288 isolates in-country (Table 1, Figure 1A), resulting in a final dataset of 728 high-quality genomes included in downstream analyses (Supp. Data 1; see Methods for exclusion criteria). Each participating Member State selected cholera samples for sequencing based on geographical distribution and completeness of metadata, to obtain a representative picture of national and regional cholera spread. The isolates sequenced by CholGEN represent the vast majority of cholera genomes sequenced in Africa since 2019 and the largest dataset of V. cholerae genomes sequenced locally in Africa (Figure 1B). This density of recent data enabled the subsequent analyses presented below, which may help guide targeted public health interventions tailored to the transmissibility and persistence of specific outbreaks.
To assess whether the most recent cholera outbreaks were derived from the known diversity of V. cholerae in Africa or novel introduction events, we performed a Bayesian phylogenetic reconstruction of the third wave of 7PET, which includes all lineages known to be currently circulating in sub-Saharan Africa (AFR9-AFR17 [2]). The reconstruction included 728 high-quality genomes in this dataset and 1,764 publicly available whole genome sequences from Africa and elsewhere (Figure 2A, Supp. Data 2). We assigned newly generated sequences to lineages using their phylogenetic placement and found that they all descended from previously described introductory events (Figure 2B). Consistent with prior analyses, our data indicated that different regions of Africa have distinct transmission patterns. In Western and Central African Member States—Cameroon, DRC, and Nigeria—our data showed that lineages AFR10 (DRC) and AFR12 (Nigeria, Cameroon) were still circulating, consistent with previous reports [13,23]. This pushes the persistence of AFR10 and AFR12 in the region to at least 29 and 15 years, respectively. Our data also supported previous studies showing that Eastern and Southern Africa continue to maintain a greater diversity of circulating lineages [17,24,25]. We observed AFR10, AFR11, AFR13, and AFR15 circulating in the region, with two or more lineages often present in a country over a five-year period.
Many of the newly generated genomes were classified as AFR15 and may therefore provide updated insight into the emergence of this lineage, which was previously linked to the unusually large outbreak in Malawi and major outbreaks in the Middle East, South Africa, and Zimbabwe [10,26]. In addition to these locations, we identified AFR15 isolates in DRC, Mozambique, and Zambia, indicating that AFR15 continues to spread rapidly across Southern Africa and has now been introduced into Central Africa (Figure 2B). The newly identified presence of this lineage in DRC is likely linked to the outbreak in Zambia, since all AFR15 sequences from DRC were collected in the province of Haut Katanga, a province on the Zambian border. We also observed cholera transmission between these two countries elsewhere in the phylogeny (AFR10), suggesting that movement across national boundaries plays an important role in the maintenance of cholera in this region [27].
Despite the likely importance of cross-border transmission, our phylogenetic analysis also suggested that there have been multiple introductions of AFR15 into Africa, rather than a single introduction identified by prior investigations [10,26,28] (Figure 2A). Our analysis showed that most AFR15 isolates descended from an introduction from Southeast Asia in late-2019 (95% highest posterior density [HPD]: February 2019 to June 2020), while a smaller subset of 11 isolates originated from an introduction from the Middle East in mid-2022 (95% HPD: December 2021 to December 2022; inset of Figure 2A), coinciding with outbreaks in Iraq, Lebanon, Syria, and Pakistan [26,28–30]. Isolates from all six AU Member States associated with the AFR15 lineages were linked to the earlier and larger introduction, while only Mozambique and Zimbabwe isolates descended from the later introduction. Generation and analysis of additional sequences from these AFR15 sub-clusters will reveal if major outbreaks can be attributed to one or multiple introductions.
Limited genotypic differences between lineages
We observed that the ancestral branch leading to the AFR15 lineage was long, not just in terms of time but also in terms of genetic changes (Figure 2A & Supp. Figure 1), calling into question whether large mutational changes or altered substitution rates were associated with the emergence of new strains in Africa (Figure 2A). To determine whether high rates of mutation accumulation accompanied the emergence of any of the lineages, we assessed whether any lineages were more divergent than expected given their sampling dates, and found that genomes from lineages observed to be currently circulating in CholGEN Member States (AFR10, AFR11, AFR12, AFR13, and AFR15) did not deviate significantly from the expected number of substitutions (Supp. Figure 2). We also estimated the mean substitution rate for each lineage using an uncorrelated relaxed clock rate model (see Methods), and found that all lineages share a similar substitution rate distribution (Supp. Figure 3). Therefore, we concluded that evolutionary processes are similar across lineages despite differences in geographic range, and that other factors must be responsible for the rapid spread of the AFR15 lineage.
The spread of antibiotic resistance may also be associated with changes in the epidemiological patterns of bacterial spread [31] and is essential to informing best practices for cholera treatment when outbreaks occur. To provide an updated picture of the spread of resistance, we examined all isolate genomes for the presence of AMR genes and found that the AMR profiles of isolates from most CholGEN Member States did not change throughout our sampling period (Figure 2C), even when we looked at a fine temporal resolution (Supp. Figure 4). Certain resistance-associated genes, particularly those that are canonically present in the V. cholerae core genome (almE, almF, almG, catB9, and varG), were present in all isolates from all countries. Other genes were associated with the lineage circulating in a region, including floR, sul2, and aminoglycoside-phosphotransferase genes that are all absent in AFR12 and AFR13 strains [32]. We did not observe any genes associated with resistance to tetracyclines, which is the primary class of antibiotics used for the treatment of cholera [33]. However, several countries observed phenotypic changes in resistance during the time period associated with these samples [21,34], suggesting the need for additional studies comparing genotypic and phenotypic AMR results [35], as well as continued monitoring of resistance in affected regions.
We found a single exception to the above in our dataset. We observed that isolates from Uganda gained several AMR genes from 2020 to 2023, including aad2, blaPER-7, mph(A), mph(E), mrs(E), and sul1 (Figure 2C). This was not due to the introduction of a new lineage because only AFR13 was observed in Uganda during this period (Figure 2B). AMR in cholera is often associated with the acquisition of plasmids containing multidrug resistance elements, so we cross-referenced the AMR genes to determine if they were present in the V. cholerae genome or on mobile genetic elements. We found that the AMR genes acquired by the isolates in Uganda by 2023 were not on the core genome, and were all found on the IncA/C plasmid, which is known to be a primary source of multidrug resistance for cholera [36,37]. This plasmid has been found sporadically in historical 7PET isolates (Supp. Figure 5) and was not found in isolates from Uganda collected in 2018-2019, though it was observed in isolates from the 2018-2019 AFR13 outbreak in Yemen and in three Lebanese isolates that phylogenetically clustered with the 2023 isolates from Uganda [28,38]. Phylogenetic placement of 2023 sequences from Uganda containing the plasmid (which were collected from districts along the shore of Lake Victoria) suggest a recent acquisition (Figure 2A). These findings point to possible intercontinental spread of cholera strains carrying IncA/C and support the connection between outbreaks in the Middle East and Africa.
Cross-border transmission is frequent but rarely observed
The results above suggest that transmission patterns, rather than genotypic differences, may be crucial for understanding and preventing future cholera spread. The recency of collected isolates and close proximity of the CholGEN AU Member States, including those with ongoing AFR15 outbreaks, enabled us to take a close look at the location and frequency of transmission of AFR15 and other lineages across international borders. These results, which include identification of numerous examples of cross-border transmission across all lineages currently circulating in Africa, point to the importance of regional coordination for both surveillance efforts and outbreak interventions such as vaccination campaigns.
To closely examine cross-border transmission, we quantified the timing and number of geographic transitions (also called Markov jumps [39]), between AU Member States across the full posterior of a Bayesian phylogeographic reconstruction. While we found multiple prominent examples of cross border transmission in the phylogeny (Figure 3A), our initial analysis suggested that these transitions were infrequent during the third wave of 7PET (average rate of 0.04 location transitions per year; 95% HPD: 0.03-0.05 transitions per year). However, historical sampling of isolates has been inconsistent and likely biased, and we found that location transitions were unevenly distributed over time (Figure 3B). Closer inspection revealed that the timing of location transitions was moderately correlated with sampling (R2 value = 0.48; p-value < 0.001; Figure 3C). This result suggests that international transmissions may be more frequent and consistent than what was captured in the dataset, but only observed whenever genomic surveillance was adequate. Therefore, we expect that increased surveillance may identify locations with yet-unsampled outbreaks, allowing for more effective targeting of containment efforts to locations that are continuously reseeding transmission.
Future surveillance should be directed towards high diversity locations
Our results indicate that sampling plays an important role in observing cholera transmission, and suggest that moving towards routine genomic surveillance may support international cholera elimination goals by revealing transmission patterns that are currently missed. Therefore, we attempted to estimate un-captured transmission and determine if additional surveillance is equally needed everywhere or if there are certain lineages, countries, or regions that are most in need of additional sampling. To do so, we developed a metric to assess the value of generating a new genome that involves evaluating the sampling strategy, phylogenetic diversity, and total information captured by sequencing.
We took this holistic approach to assessing the value of generating new sequences because both sampling strategy and underlying transmission patterns influence what can be learned from sequencing. To evaluate the impact of the sampling strategy, we built on a technique commonly used in ecology [40,41] and estimated how many new mutations would be expected when adding a new genome to the analysis, referred to here as phylogenetic diversity [42,43]. While this value provides an estimate of the potential information gained from additional sequencing, it must also be considered in the context of the total number of sequences previously generated. We also recognize that sequencing can still be useful in cases of limited phylogenetic diversity if there is other information to be gained, for example if additional sequencing may reveal new introductions into a country or region. Therefore, we used a rarefaction approach [44,45] to quantify how much we would likely learn about introductions from additional sequencing. We applied this comprehensive metric to data from each CholGEN Member State and saw that the amount of diversity and information gained per new genome varied between locations and surveillance intensities (Figure 4A-G). Below, we focus on three examples—Uganda, DRC, and Zambia—that illustrate the range of phylogenetic diversity captured and additional information revealed given the differing surveillance strategies employed.
First, new genomes from Uganda revealed marked phylogenetic diversity and numerous introductions (Figure 4D). This result is expected, as CholGEN samples from Uganda cover multiple years and longer sampling periods should capture more diversity assuming a constant clock rate [46]. It also aligns with transmission dynamics previously described in Uganda, where repeated introductions (rather than local spread) sustain cholera and thereby increase phylogenetic diversity [24,47]. These findings suggest that continued surveillance using a similar strategy, where genomic data is consistently collected over time, would likely reveal further introductions and greater diversity. This could help identify additional sources of cholera or new transmission routes that could be targeted for intervention.
In contrast to Uganda, new genomes from DRC revealed relatively less phylogenetic diversity and fewer introductions, though we estimated that there are far fewer introductions into DRC than into Uganda (Figure 4C). This is likely due to distinct transmission patterns in DRC, where cholera is highly persistent (Figure 2B). These results suggest that further surveillance may uncover more phylogenetic diversity, but it is unlikely to reveal new introductions without a change in the surveillance strategy or overall transmission pattern within the country. Therefore, routine surveillance will most likely provide a representative view of the situation in DRC without requiring intensive sampling of every outbreak. However, it will be important to ensure that sampling in DRC is representative across space in addition to over time. For instance, if surveillance is not geographically representative, it may be useful to first capture geographic gaps to ensure lineages are not being missed before finalizing a surveillance strategy.
In contrast to Uganda and DRC, new genomes from Zambia showed comparatively low levels of phylogenetic diversity. However, Zambia is an example of a situation in which phylogenetic diversity was low but the potential to capture additional information (for instance, number of introductions) was very high, as new data uncovered many additional introductions despite genetic similarity (Figure 4E). This finding aligns with the relatively limited range of collection dates for Zambia sequences generated by CholGEN, as well as with evidence for the close transmission relationships of Zambia with Malawi, DRC, and other locations. To better capture the diversity of circulating lineages and the locations that seed cholera in Zambia, additional surveillance would need to be conducted over a longer period. Additionally, during periods of rapid spread by a single lineage, resources may be better spent capturing temporal or geographic diversity rather than sequencing large numbers of likely identical sequences from concentrated areas or clonal outbreaks.
DISCUSSION
In this manuscript, we highlight how the CholGEN initiative led to the enhancement of local cholera genomic surveillance in several AU Member States, leading to the largest dataset of high-quality V. cholerae genomes sequenced locally in Africa. These genomes fill critical gaps in our understanding of recent cholera circulation, revealing distinct transmission patterns across different African regions and highlighting the rapid spread of the AFR15 lineage, including its expansion into additional Member States like the DRC. Despite the association of AFR15 with unusually large outbreaks in Malawi, Zambia, and beyond, we did not identify significant genotypic differences compared to other lineages previously identified in the region. Antimicrobial resistance signatures were also consistent across lineages and outbreaks. We found evidence of cross-border transmission between neighboring countries, and our data suggests that these events may be more frequent than currently recognized, which could have significant implications for how local experts develop surveillance strategies and cholera control efforts.
The proximity of CholGEN Member States and their distribution across Africa allowed us to compare the spread of cholera within and between different regions, particularly for the most recent outbreaks on the continent. This work provides further evidence that cholera readily spreads within certain regions of Africa, indicating that cross-border transmission has a prominent role in the maintenance of cholera on the continent. That said, these analyses clearly show that sampling strategy plays a significant role in the interpretation of phylogenetic analyses, and that our ability to comment on the directionality or absolute number of international transmissions is limited. Refining our understanding of cholera transmission will require continuous and routine genomic surveillance, both during and between cholera outbreaks. This will both fill in gaps in our current understanding and provide a more systematic dataset that can inform future prioritization through the sequencing information metrics described in Figure 4 above. By evaluating these metrics in the context of the transmission patterns present in a country, surveillance efforts can be more effectively tailored to meet specific needs and optimally allocate resources.
It is important to recognize, however, that the real barriers to sampling are often not in the strategy, but in the realities of sample collection, transport, sequencing, and pairing with epidemiological metadata. Although all sequencing described in this manuscript was performed from cultured isolates, we were only 60% effective in generating high-quality whole genome sequences. This is likely due to a combination of isolate contamination, storage conditions not being conducive to preserving bacterial cultures, and other challenges associated with multi-step processes like whole genome sequencing. To enable better surveillance of cholera and other local pathogens, we need to improve the full pathway leading to sequencing, from sample collection and V. cholerae identification through sample transport and laboratory processing. Additionally, it may be important to explore sequencing methods that do not rely on bacterial culture, such as direct sequencing from stool [17,48]. This will increase the total number of samples that can be selected for sequencing and open the door to improved environmental surveillance including viable but nonculturable specimens that may play an important role in understanding transmission [49].
Even once sequencing challenges are addressed, there are important next steps to take in terms of translating findings to public health impact. For example, we found that isolates in Uganda acquired a plasmid associated with AMR, but without phenotypic data paired to these samples, it is difficult to understand the implications of such an acquisition. Further in-depth studies comparing antimicrobial resistance phenotype to genotype [35] are needed to assess which genotypes correlate with actual antibiotic resistance, and which phenotypic changes are caused by de novo evolution of resistance-conferring mutations versus the acquisition of mobile genetic elements from other bacterial organisms (V. cholerae and otherwise) through horizontal gene transfer. This information would provide a clear and actionable benefit to public health, as genotypic surveillance could therefore also serve the role of current antimicrobial susceptibility assays, enabling the use of one holistic assay (whole genome sequencing) to obtain the results of what is currently at least two separate experiments.
Additionally, while our work provides further evidence of cross-border transmission of cholera, coordination with local epidemiology and policy teams is needed to figure out how this transmission is happening and how best to stop it. Regional analyses, as demonstrated here, can identify broad transmission patterns, such as our finding that transmissions mostly occur between neighboring countries. However, more local, targeted investigations using sequencing to confirm epidemiological hypotheses are needed to determine fine-scale patterns that can more directly enhance containment and intervention strategies. Beyond transmission, future studies could also address the effectiveness of interventions directly, for example by looking more closely at mutations in V. cholerae genes implicated in vaccine effectiveness.
Through the creation of CholGEN, we have laid the foundations for genomics-informed decision-making for cholera containment and elimination. In this manuscript, we showcase how developing and extending local sequencing and analytical capacity resulted in the discovery of new findings related to V. cholerae spread. Crucially, we have also created a collaborative, multicountry group working to advance cholera containment and elimination together. Our next steps include more in-depth analyses of the collected data, including exploration of virulence factors, vaccine targets, and within-country cholera spread. We will also use the information gathered here to inform prospective surveillance in all seven AU Member States. Coupled with effective data sharing and collaboration, we hope that these data, and the conclusions we draw from them, will bring us one step closer to the goal of global cholera elimination by 2030.
METHODS
Ethics statement
This research utilized samples collected through routine national surveillance programs and all data and samples were anonymized to ensure the privacy and confidentiality of the individuals involved. The use of these samples was approved by respective Ethical Review Boards/Institutions in each country, ensuring that the research adhered to the highest ethical standards and legal requirements. In Cameroon, ethical approval was granted by the National Committee on Ethics in Research for Human Health (2024/02/1640/CE/CNERSH/SP). In DRC, ethical approval was granted by the Board of the Ethics Committee of the School of Public Health at the University of Kinshasa (ESP/CE/148/2023 and ESP/CE/149/2023). In Malawi, this work was approved by the National Health Sciences Research Committee (Protocol #867). In Mozambique, ethical approval was granted by the National Bioethics Committee for Health (335/CNBS/23). In Nigeria, ethical approval was not required as it is based on data from Nigeria’s national surveillance program, collected by the Nigeria Centre for Disease Control. In Uganda, ethical approval was granted by the Uganda Ministry of Health National Health Laboratory Services (UNHL-2024-88).
Sample collection, cholera confirmation, and bacterial culture
V. cholerae isolates were collected from clinically suspected cholera cases in CholGEN Member States between 2018 and 2024. In each Member State, isolates came from patients from cholera treatment centers during outbreaks or from endemic areas. In all seven countries, the Cholera Rapid Diagnostic Test positive samples were transported in Cary-Blair media to laboratories then cultured on Thiosulfate-citrate-bile salts-sucrose (TCBS) agar medium and incubated for 20–24 hours. Phenotypic identification of V. cholerae colonies was based on morphology, motility, and biochemical characteristics (positive oxidase, saccharose, indole, and gelatinase). V. cholerae isolates were confirmed with agglutination tests with anti-O1 or anti-O139 serum (WHO antisera).
Genomic DNA extraction and quantification
DNA extraction was performed at multiple laboratories across the CholGEN participating countries: NPHL (Yaounde, Cameroon), CPHL (Kampala, Uganda), INS (Maputo, Mozambique), ZNPHI (Lusaka, Zambia), PHIM (Lilongwe, Malawi), INRB (Kinshasa, DRC), and NCDC (Abuja, Nigeria). Confirmed V. cholerae isolates were retrieved from storage and subcultured on selective TCBS, MH, and/or HCK agar plates and incubated at 37°C overnight. In all countries, V. cholerae DNA was extracted using the standard protocol from the Qiagen QIAamp DNA Mini Kit, with a final elution volume of 200 µL as per the manufacturer’s instructions. Genomic DNA concentrations were measured with the Qubit Fluorometer 4.0 (Thermo Fisher), following the dsDNA HS assay standard protocol and stored at 4°C.
Library preparation and sequencing
Illumina library preparation and sequencing was performed by laboratories in CholGEN participating countries. DNA samples were normalized to 0.6 ng/µL from which 30 µL was used as input for the Illumina DNA Prep Kit (Illumina). Library concentrations were assessed using the Qubit High Sensitivity DNA Kit (Invitrogen), and library size distributions were measured using a BioAnalyzer High Sensitivity DNA Kit (Agilent). Genomic libraries from individual samples that had expected size distributions and had a concentration greater than 1 ng/µL were normalized and pooled in equimolar amounts at 2 nM. The 2 nM library pool was sequenced on either an Illumina MiSeq or NextSeq 2000 using 300 cycles kits. Quality assessments of short reads were performed using FastQC [50].
Reference-based Genome Assembly
We developed a pathogen-agnostic bioinformatics pipeline to generate consensus sequences from raw sequencing reads, called bacpage, which is available as a local command line tool (available on Github: https://github.com/CholGen/bacpage; with an online manual: https://cholgen.github.io) on the cloud-based Terra platform [51], where a majority of analyses were performed. Briefly, as part of the pipeline, paired-end reads were aligned against the Vibrio cholerae N16961 isolate (accession AE003852/AE003853) using bwa-mem v0.7.17 [52]. Variants compared to the reference were identified using BCFTools v1.20 [53]. Variants were retained if they met the following criteria: (1) variant quality score of at least 20, (2) mapping score of at least 30, (3) supported by at least 15 reads, (4) present in at least 50% of reads covering a position, and (5) supported from both strands. A consensus sequence for each sample was generated by applying the supported variants to a concatenated reference genome.
Consensus sequences for each sample were kept for subsequent analyses if at least 90% of their reads mapped to the reference genome, had median coverage across all positions of the reference genome of at least 15, and had less than 10% ambiguous nucleotides.
Vibrio cholerae Genomic Data
The sequences generated in this study were combined with 1772 publicly available third wave (AFR9 onwards) genomes that were found by searching literature, Sequence Read Archives (SRA), and VibrioWatch for V. cholerae isolates from Africa and Asia (Supp. Data 2). Where raw sequencing data was available, we generated reference-based assemblies using the same pipeline and reference as described above. In other cases, where only assembled contigs were available, variants compared to the N16961 reference were identified using snippy v4.6.0 (https://github.com/tseemann/snippy) with the --ctgs flag and applied to the reference genome using BCFTools.
Antimicrobial Resistance Profiling
Short reads were assembled de novo using the TheiaProk Illumina paired-end sequencing workflow publicly available on DockStore [54]. Briefly, the workflow trims low quality reads and portions of reads using fastp [55] and removes sequencing adaptor sequences using BBDuk [56]. Assembly was performed on high-quality reads with the Shovill pipeline (https://github.com/tseemann/shovill) which uses the SKESA assembler [57]. We then used AMRFinderPlus v3.11.20 with database version 2023-09-26.1 to identify antimicrobial resistance genes and point mutations in the assemblies [58].
Epidemiological Data
We obtained yearly case counts for each country participating in CholGEN from data curated by the cholera taxonomy project at the Johns Hopkins Bloomberg School of Public Health (https://cholera-taxonomy.middle-distance.com/). Data contributions to the team have come from many sources, though the annual case counts are heavily dependent on the WHO Annual Cholera Reports and the Global Health Observatory data repository.
Phylogenetic Analysis
We used the phylogeny subcommand of bacpage to generate a recombination-masked maximum likelihood phylogeny. Briefly, the pipeline performs the following steps. The complete genomic dataset was assembled by concatenating the individual reference-based assemblies along with the N16961 reference into a pseudo-alignment. Problematic sites including known recombinant regions were masked using a custom GFF file and script. Novel recombinant regions were masked using gubbins v2.3.4 [59]. To reduce the computational burden of the phylogenetic analysis, we filtered the pseudo-alignment to only include variable positions using SNP-sites v2.5.1 [60]. We constructed a maximum likelihood phylogenetic tree for the dataset using IQ-TREE2 and an GTR substitution model [61,62].
The resulting phylogeny was rooted on the reference genome. To further reduce computation burden of the analysis we extracted the subtree descending from the most recent common ancestor of lineages AFR9-15, which corresponds to the third wave of 7PET. The subtree was time-resolved using Treetime v0.11.3 [63]. Taxa deviating more than three interquartile distances from the clock-rate regression were pruned from the phylogeny. Additionally, we randomly resolved polytomies in the phylogeny by adding zero-length branches with gotree [64].
We refined the time-resolved phylogeny using BEAST v1.10.5 [65]. We specified a GTR substitution model with gamma distributed rate heterogeneity under an uncorrelated relaxed molecular clock and a constant coalescent tree prior. For the uncorrelated relaxed molecular clock, we specified an informative lognormal distribution with a mean of 8.4*10-7 substitutions/site/year and a standard deviation of 1.4*10-6 consistent with Weill et al. 2017 [2]. To account for utilizing an SNP alignment rather than a full-genome alignment, we additionally supplied the number of constant sites in the cholera reference genome.
We ran an MCMC chain of 300 million steps utilizing the BEAGLE computational library [66]. Parameters and trees were sampled every 10,000 and 100,000 steps, respectively. Convergence and mixing of the MCMC chains were assessed with Tracer v1.7.2 and Beastiary v1.8.3 [67,68]. We generated a second chain by restarting the initial chain at the 200 millionth state with a different random seed and running it for an additional 100 million states, resulting in two chains of 100 million states each after discarding burnin. After combining chains, all estimated continuous parameters were determined to have effective sample sizes of greater than 100. BEAST XML and log files for the phylogenetic analysis are available on GitHub (https://github.com/CholGen/RegionalAnalysis-2024).
Lineage Assignment
Sequences collected from Africa were assigned to the lineages initially described by Weill et al. 2017 [2]. Assignment was performed by phylogenetic placement using a custom script. Briefly, for each lineage, we identified the node corresponding to the most recent common ancestor (MRCA) of lineage representatives collected from prior publications [2,16,26,28,69]. All sequences descending from the MRCA were assigned to the lineage. Sequences that were collected from Africa but did not descend from a previously described lineage were assigned a value of “sporadic.”
Phylogeographic Reconstruction
We performed a discrete state ancestral reconstruction on geographic states using BEAST [39]. This analysis reconstructed location-transition history across an empirical distribution of 2000 time-calibrated trees samples from the posterior tree distribution estimated above [70,71]. To conduct the analysis, we assigned each taxon a discrete location state based on the country where it was collected. To limit the number of transition rates needing to be estimated, we binned non-African sequences into a single “Other” state, and African countries with less than 20 reported sequences into a single “Other-Africa” state. Ultimately we used 14 distinct locations states. We assumed that geographic transition rates were reversible and used a symmetric substitution model. We used Bayesian stochastic search variable selection to infer non-zero migration rates [39]. The MCMC algorithm was run for 500,000 generations and sampled every 500 generations. We used the TreeMarkovJumpHistoryAnalyzer from the pre-release version of BEAST v1.10.5 to obtain the Markov jump estimates and their timings from the posterior tree distribution and assumed that they are a suitable proxy for the transmission between two locations. We used TreeAnnotator v1.10 to construct a maximum clade credibility (MCC) tree which we visualized with baltic (https://github.com/evogytis/baltic). BEAST XML and log files for the discrete phylogeographic analysis are available on GitHub (https://github.com/CholGen/RegionalAnalysis-2024).
Introduction Identification
We identified country-specific strains from the posterior tree distribution of the discrete phylogeographic reconstruction using a custom script. Country-specific strains were defined as two or more taxa from the same country that descend from a single, shared introduction of cholera into the country from another country. Descendant taxa from outside the specific country, do not belong to the strain. This concept and the algorithm for identifying strains is derived from the definition of Du Plessis et al. 2021 [72] (see the citation for an illustration of this concept, wherein the term transmission lineage is used in place of strain).
Phylogenetic diversity
We calculated the amount of evolutionary history contributed by a set of genomes as their phylogenetic diversity [43,73,74]. For each tree in the posterior distribution of the phylogenetic analysis, we converted the units of branch lengths from years to substitutions/site by multiplying branch lengths by the estimated substitution rate for each branch. For each country, we marked taxa that were collected from the country and were generated by CholGEN participants. We then calculated the phylogenetic diversity contributed by CholGEN for a given country as the difference between the total branch length of the tree and the total branch length of the tree after pruning the marked taxa. Lastly, this value was divided by the total number of sequences marked for a given country.
Rarefaction analysis
For each country, we identified unique introductions into each country and identified genomes that descended from those introductions for each tree in the posterior. For varying sample sizes, ranging from 1 to the total number of genomes collected for a country, genomes were randomly drawn without replacement and the number of unique introductions in the sample was computed. To incorporate uncertainty from the discrete state reconstruction, we performed this rarefaction analysis 1000 times, using a randomly sampled tree each time. For each trial, we parametrically estimated the total number of introductions into each country by fitting the rarefaction curve with a Michaelis-Menten equation. The Michaelis-Menten equation is a two parameter model with the following form: where x is the number of genomes sampled, y is the number of introductions observed, V is the asymptotic estimate of the total number of introductions, and K is a shape parameter describing the rate at which new genomes reveal introductions.
Data Availability
Raw sequencing reads are available on NCBI under the BioProject accession ID PRJNA1145341 and Sequence Read Archive accession IDs are provided in Supplemental Data 1. Accession IDs for the publicly available sequences acquired from NCBI or VibrioWatch are provided in Supplemental Data 2.
Data availability
Raw sequencing reads are available on NCBI under the BioProject accession ID PRJNA1145341 and Sequence Read Archive accession IDs are provided in Supp. Data 1. Accession IDs for the publicly available sequences acquired from NCBI or VibrioWatch are provided in Supp. Data 2.
Code availability
Code for all analyses and figure generation, as well as XMLs and log file for BEAST analyses are available at: https://github.com/CholGen/RegionalAnalysis-2024.
AUTHOR CONTRIBUTIONS
G.M., N.L.M., C.K.T., S.W., and S.K.T. conceptualized the study. G.M., N.L.M., A.S.A., and S.W. contributed the methodology. N.L.M. provided software. G.M. and N.L.M. conducted the formal analysis. M.K., G.K.K., O.A.A., J.J.C., Mathews Kagoli, R.G.E., A.A., I.S., V.J.N., A.A.A., O.A.B., B.M.A., A.B., E.B.M., F.O., C.E.C., N.I., O.K., O.L., P.K.M., G.A.E., I.C.M., A.A.M., G.N., J.M., K.M., A.M.N., M.E.N., M.P., D.M.S., C.M., S.M.S., D.W.W., P.O.W., M.Y., S.Y., L.Z., and CholGEN Consortium authors provided resources. G.M., M.K., G.K.K., O.A.A., J.J.C., Mathews Kagoli, R.G.E., A.A., I.S., V.J.N., A.A.A., O.A.B., B.M.A., A.B., E.B.M., F.O., C.E.C., N.I., O.K., O.L., P.K.M., G.A.E., I.C.M., A.A.M., G.N., J.M., K.M., A.M.N., M.E.N., M.P., D.M.S., C.M., S.M.S., D.W.W., P.O.W., M.Y., S.Y., L.Z., and CholGEN Consortium authors curated data. G.M., N.L.M., C.K.T., S.W., and S.K.T. wrote the original draft of the manuscript. All authors reviewed and edited the manuscript. N.L.M. performed visualization. J.E.B., R.C., H.H., J.I., J.P.L., D.M., S.N., A.K.D., D.A.S., J.K., Y.K.T., S.W., and S.K.T. supervised the study. G.M., C.K.T., S.W., and S.K.T. undertook project administration. A.K.D., D.A.S., and S.K.T. acquired funding.
COMPETING INTERESTS
The authors declare no competing interests.
SUPPLEMENTAL FIGURES
SUPPLEMENTARY INFORMATION
Supplementary Note 1. CholGEN Consortium Authors.
Supplementary Data 1. Details of V. cholerae O1 isolates generated by this study. Includes metadata, sequencing and assembly statistics, and antimicrobial resistance profiling data for all sequences included in analyses.
Supplementary Data 2. List of publicly available V. cholerae O1 isolates used in this study. Metadata including collection location, year, assigned lineage, and accession number for all sequences included in the background dataset for this study.
ACKNOWLEDGEMENTS
We extend our sincere gratitude to the Africa Centres for Disease Control and Prevention (Africa CDC), Africa Public Health Foundation (APHF), and Broad Institute of MIT and Harvard, for their invaluable support and collaboration. This work was supported by the Bill & Melinda Gates Foundation (INV-047157 and INV-047156).
Footnotes
↵* Jointly supervised this work