Abstract
Nipah virus (NiV), a highly lethal virus in humans, circulates silently in Pteropus bats throughout South and Southeast Asia. Difficulty in obtaining genomes from bats means we have a poor understanding of NiV diversity, including how many lineages circulate within a roost and the spread of NiV over increasing spatial scales. Here we develop phylogenetic approaches applied to the most comprehensive collection of genomes to date (N=257, 175 from bats, 73 from humans) from six countries over 22 years (1999-2020). In Bangladesh, where most human infections occur, we find evidence of increased spillover risk from one of the two co-circulating sublineages. We divide the four major NiV sublineages into 15 genetic clusters (emerged 20-44 years ago). Within any bat roost, there are an average of 2.4 co-circulating genetic clusters, rising to 5.5 clusters at areas of 1,500-2,000 km2. Using Approximate Bayesian Computation fit to a spatial signature of viral diversity, we estimate that each genetic cluster occupies an average area of 1.3 million km2 (95%CI: 0.6-2.3 million), with 14 clusters in an area of 100,000 km2 (95%CI: 6-24). In the few sites in Bangladesh and Cambodia where genomic surveillance has been concentrated, we estimate that most of the genetic clusters have been identified, but only ∼15% of overall NiV diversity has been uncovered. Our findings are consistent with entrenched co-circulation of distinct lineages, even within individual roosts, coupled with slow migration over larger spatial scales.
Introduction
Nipah virus (NiV) is a bat-borne zoonotic virus, classified by the World Health Organization as a priority pathogen (1). The majority of infected humans die following infection and while most infections occur following zoonotic spillover events, human-to-human transmission is responsible for around a third of known cases (2). There is currently no available treatment or vaccine. NiV was first identified in 1999 in Malaysia and has since been detected almost annually throughout South Asia (3–5). Pteropus fruit bats are the reservoir hosts for NiV (6). Detected spillovers in humans primarily occur when individuals drink raw date palm sap from trees upon which infected bats have fed (4). However, infection through intermediary species, such as pigs in the 1999 outbreak in Malaysia and horses in 2014 in the Philippines, has also been registered (4, 7).
Despite the substantial risk to human health, very little is known about the underlying genetic diversity of NiV. We do know that Pteropus bats are found throughout South and Southeast Asia and that NiV infection in Pteropus bats is common, with serostudies from across the region identifying 3-83% of adult bats having antibodies to NiV (3, 8, 9). However, the patterns of infection within bat populations remain unclear, including the number of discrete lineages circulating within any roost, the extent to which different lineages spread through the region or the ability for NiV to freely transmit between different species within the Pteropus genus. We also do not know if specific lineages are linked to increased spillover risk, including the specific routes of transmission. These critical knowledge gaps are problematic for assessments of spillover risk, and vaccine development.
Characterizing the genetic diversity of NiV is difficult as many isolates of NiV are from human cases, which remain rare events and are only detected in a few places. Even in Bangladesh, the country with most spillover events, there are only estimated to be ∼15 spillovers into humans per year (10). The representativeness of viruses obtained from human infections is also unclear. This has motivated efforts to instead obtain NiV from bat populations, but this is very difficult as bats appear to be largely asymptomatic and virus shedding is detected infrequently (9). These efforts have also been concentrated to a few locations within Thailand, Bangladesh and Cambodia (11–13). While it is difficult to make inferences about NiV diversity using sequences obtained from any one location, by pooling information across these different locations combined with appropriate analytical methods, we can obtain a more complete picture of diversity. Here we provide a comprehensive assessment of the diversity of NiV using all publicly available sequences coming from six countries, along with several previously unpublished sequences. We develop methods that are robust to strong biases in where and when sequences are obtained to track the diversity of NiV across spatial scales (within roost, within district, country and international) and quantify the extent to which diversity has been fully identified in locations that have implemented extensive virus surveillance efforts.
Results
We analyzed 257 sequences from six countries covering a 22-year time period (1999-2020, Figure 1A, Figure S1). 73 (28%) of the sequences came from human infections, 175 (68%) from bats and five (4%) from other sources (Figure 1B). Using Bayesian Evaluation of Temporal Signal (BETS) analysis (14), we found there was strong support for a model in which the sampling dates of the sequences were included over a model in which they were considered contemporaneous (Bayes Factor of 415). We therefore built a time-resolved phylogeny and found that sequences could be broadly characterized into two major genotypes: genotype I and genotype II, previously reported (15), and four minor genotypes I.A, I.B, II.A, II.B (Figure 1C). We estimated a mean nucleotide substitution rate of 4.5 x 10-4 subs/site/year (95%CI: 2.9 x 10-4, 6.0 x 10-4), consistent with previous estimates (12). We found that genotypes I and II diverged in 1937 (95%CI: 1838-1983), and the minor genotypes diverged in the 1970s. There was broad spatial structure in these genotypes, with genotype I being dominant in the west of the region and genotype II in the east of the region (Figure 1D). We found that countries on the Eastern (Indonesia/Malaysia) and Western edges (India) only had a single circulating sublineage. By contrast, Cambodia, with a central position in the region, had sequences from all four sublineages, consistent with the center of this region being the original location of NiV emergence.
Sequences obtained from human cases were all from Bangladesh, which had two circulating sublineages (IA and IB). Lineage IA was found throughout the period 2004-2018, whereas IB was found in the period 2006-2009 only (Figure S2). We found that all human sequences in Bangladesh (N=55/55) came from genotype IA, with none from genotype IB (N=0/55). By contrast, 64.9% (N=24/37) of the bat sequences come from IA and 35.1% (N=13/37) come from IB. These observations suggest that some genotypes may have an increased probability of spilling over into human populations (p-value of 0.007 using Fisher exact test).
To assess the extent of spatial structure among the sequences, we used the time-resolved phylogeny to compare the total evolutionary time that separated each pair of viruses with their spatial separation. To limit the effects of sampling bias from the sequences obtained from bat roosts we randomly sampled a single sequence from each roost and each year and sampled a single sequence from each human case cluster. We found that each 10-year increase in evolutionary time was associated with an average 186 km (95% CI 161 - 212) increase in the distance that separates them, equivalent to 19 km a year (Figure 2).
In order to provide a finer scale characterization of genetic diversity, we adapted a genetic clustering method to sort the sequences into different genetic clusters based on the distribution of evolutionary distances between pairs in the phylogeny (16). This approach resulted in 15 unique clusters (Figure 3A-D). The dates of divergence of the genetic clusters ranged from 1978 to 2002, with the mean evolutionary time from a cluster to the next-closest cluster of 15 years. We found that genetic clusters tended to aggregate within the same country (Figure 3A). The clusters were also separated by bat species with eleven of the genetic clusters exclusively found within one species (Figure 3B). Bat species themselves were also spatially structured at a country level, as all of the sequences from four (67%) countries came from one single bat species (Figure 3C). We estimated that 96.8% (95%CI: 96.4-97.2) of bat roosts separated by <100 km are of the same species, dropping to 53.4% (95%CI 47.9-58.8) for roosts separated by 500-1,000 km (Figure S3). On average, each 100 km increase in distance between bat roosts is associated with 0.62 (95%CI 0.61 - 0.63) times the odds of the roosts coming from the same species, providing a crude quantification in the spatial overlap of bat species in the region.
We found that there were an average of 2.41 (95%CI: 1.92-2.94) different genetic clusters found within the same bat roost. The probability that a pair of sequences belonged to the same genetic cluster fell from 36.6% (95%CI: 30.6 - 45.1) when they were found within <100 km of each other to 5.7% (95%CI: 2.5 - 9.7) when they were 500-1,000 km apart. Using logistic regression, we found that each additional 100 km in spatial distance separating roosts was associated with 0.75 (95%CI: 0.73-0.77) times the odds of being part of the same genetic cluster (Figure 3E). As the probability of being from the same bat species is strongly linked to the distance between locations, we could not disentangle the role of spatial segregation between lineages from being solely a spatial effect or due to the different Pteropus species occupying different locations. However, on average, we found that pairs of sequences coming from the same bat species had 19.3 (95%CI 15.9 - 23.6) times the odds of belonging to the same genetic cluster (Figure 3F). Importantly, this characterization of the changing probability of being from the same genetic cluster as a function of distance is robust to biased observation processes, and therefore can be considered a spatial signature of NiV ecology. Our estimates of spatial dependence were robust to broad variation in the definition of a genetic cluster that results in greater or fewer clusters (Figure S5).
To estimate the average number of genetic clusters circulating within an area, we used Approximate Bayesian Computation to fit our observed distribution of spatial clustering of genetic clusters (Figure 4A). We made a simplifying assumption that the spatial footprint covered by a genetic cluster is equivalent throughout the region, and that it can be captured by a multivariable normal distribution. We estimate that, on average, 95% of the infections from a genetic cluster were found within an area of 1.3 million km2 (95%CI: 0.6-2.3) and that there were an average of 14 discrete genetic clusters per land area of 100,000 km2 (95%CI: 6 - 24). We explore different ways to approximate the circulation area of Pteropus bats. For each country in the region, we estimate NiV diversity using existing estimates of Pteropus bats’ geographic range (from the International Union for Conservation of Nature [IUCN]). We also consider the area covered by tree cover (available from Global Forest Watch) as an alternative, indirect marker of the area covered by Pteropus bats. For this second approach, we separately consider over 10%, 30%, or 50% of tree cover as a proxy of the area covered by Pteropus bats (Figure 4B, Table 1, Figure S4, Table S6) (17–19). For most countries, we found little variation among estimates of diversity using these different approaches. For example, in Thailand, we estimate 15 unique genetic clusters (95%CI: 6 - 26) using the IUCN’s geographic range, and 17 unique genetic clusters (95%CI: 7 - 29) using the area with over 10% of tree cover. In Bangladesh, these same options suggest the presence of 15 genetic clusters (95%CI: 7 - 29), or of 11 genetic clusters (95%CI: 4 - 20), respectively. Taking South and Southeast Asia as a whole, we estimate between 66 and 118 separate NiV genetic clusters using the different approaches (range of confidence intervals of 28-211), compared to 15 currently detected, suggesting ∼80-90% of circulating genetic clusters remain undetected.
Finally, we consider the extent to which genetic diversity has been fully uncovered in the six long- term established surveillance sites in the region (three in Bangladesh, two in Thailand, and one in Cambodia), and whether there exist predictors of the number of genetic clusters in any place. There have been between 14 and 76 sequences obtained in these surveillance sites, resulting in the detection to date of between 2 and 7 different genetic clusters (Figure 4C). We estimated the number of new lineages that would be detected with additional sampling. This approach assumes that all clusters are equally likely to be detected and that all clusters circulating in an area are present at approximately equal levels. It also assumes stability in the clusters circulating in a location over time. We estimated that all the circulating genetic clusters have been identified in five of the six locations (Figure 4C, Figure S6). This can be directly inferred from the saturation in the number of observed clusters in those locations. The estimated total number of clusters circulating within a subnational division was not significantly associated with the size of the study area within the province, or with human population density. There was some evidence of an increase in diversity with percentage of forest cover (p-value of 0.015) (Figure S7).
Discussion
We have used NiV sequences, alongside information on location and host from multiple countries to generate a comprehensive characterization of the underlying genetic diversity of a pathogen that poses a major risk to human health. We found that NiV is strongly spatially structured. Overall, viral movement across the region is slow, with limited genetic similarity in the viruses circulating in different countries. These findings are consistent with a previous analysis using data from Bangladesh only (20). The evolutionary time separating sampled viruses in the two extremes of the Nipah region (i.e., India to Malaysia) was over 140 years. These patterns suggest that the virus is largely endemically entrenched within individual communities, with greatest diversity observed in the middle of this region. Cambodia was the only country in our dataset with all four sublineages detected. It remains unclear the extent to which the spatially structured nature of bat species, mixing patterns of bats across roosts and pre-existing immunity contribute to these observations.
The transmission dynamics of NiV within bat populations remains poorly understood, including whether infection results in long term immunity. In our study, we have found substantial overlap in the spatial footprint of genetic clusters, to the extent that even individual bat roosts host more than one distinct genetic cluster at any time. Pteropus bat roost size is highly dependent on the species, typically hosting around hundreds to thousands of bats at a time (13, 21–23). In order for roosts to maintain a sufficiently large susceptible population to sustain multiple independent transmission chains in populations of this size will require long durations of viral shedding, frequent reinfection or coinfection as suggested through the modeling of bat immune profiles (24). As movement between bat roosts is common, the wider population across multiple roosts may also facilitate the maintenance of multiple independent lineages. In support of a key role of roost population size in maintaining diversity, it is notable that the P. lylei roosts in Cambodia, which had the greatest diversity with the co-circulation of three NiV sublineages within each roost, have typically thousands of bats in each roost, many more than other locations (13).
The evolutionary separation between NiV lineages has previously been suggested as one possible explanation for the differences in the case fatality rate in Bangladesh (CFR of ∼70%) and Malaysia (CFR of ∼40%) (2, 25, 26). However, it remains difficult to disentangle differences in the virus from human behavior differences or differences in transmission routes. Spillover from bats into pig populations through the consumption of NiV infected fruit drove the outbreak in Malaysia whereas date palm sap consumption by humans appears key in Bangladesh (27). Viral loads, inoculation routes and sites in these two transmission modes are likely to be very different, which could affect subsequent probabilities of death. Animal models such as primate, Syrian Hamsters and ferrets have not been able to provide conclusive evidence of pathological or transmission differences between the two major clades, however, their relevance to human situation remains unclear (28–30). Here we found evidence of differences in spillover or disease risk within a lineage from a single setting. Genotype IB was found in three different years in bat roosts in an area of Bangladesh where human spillovers are frequently identified and where date palm sap consumption is common, however, no human cases were linked to this sublineage. It remains possible that underlying year- by-year variability in spillover risk, linked to temperature, may explain these findings, especially in the context when only a subset of human NiV cases are ever detected and have their viruses sequenced (8, 10, 31).
Using the correlation of genetic clusters over distance as a spatial signature of NiV, we were able to estimate the spatial footprint of individual genetic clusters and the number of circulating clusters. While these are crude estimates, they do provide a ballpark estimate of the expected number of NiV genetic clusters, which is especially useful in countries where little or no sampling has been conducted. We estimated that across the entire region there are ∼100 genetic clusters. This suggests the vast majority (∼80-90%) of genetic diversity remains undetected. In particular, India is likely to harbor significant diversity. Pteropus bats are found throughout the country and have been found to be positive in serology and PCR detection to NiV throughout (3). Increased sampling across South and Southeast Asia will help uncover additional lineages that have not yet been observed. Extending the sampling to new locations will expand the observed diversity with limited new diversity likely to be found from additional sampling in the established long-term sampling locations in Cambodia, Thailand and Bangladesh. Long-term surveillance in the same locations remains a critical source of information to characterize long-term evolutionary dynamics of NiV within roosts. Whole-genome sequencing should be prioritized to maximize the signal that can be captured through established and additional surveillance.
Obtaining NiV sequences from bats is difficult. The most common method requires the collection of bat urine under roosts using plastic or tarpaulin sheets. The urine can then be tested for evidence of virus using PCR tests. However, this approach makes it impossible to link urine to an individual bat. It is also impossible to be sure that multiple positive samples don’t all come from the same bat. Most viral isolates come from urine samples taken from individual bats, which makes them very rare. For this reason most available NiV sequences from bats are short sequences from the PCR process. Despite their frequent short nature, we were still able to consistently place sequences in different clades. In particular, in Cambodia, where most sequences were very short (<400 nucleotides in length), we were able to identify multiple clades circulating within the same roosts. As with other phylogenetic studies, this study is reliant on the available sequences, which have been collected in a spatially and temporally biased manner. However, the geographic clustering of genetic clusters is robust to these biases. There also remain uncertainties in phylogenetic estimation processes itself. We note that including a Hendra virus sequence to the phylogeny, results in major differences to the estimated clock rates and the dates of divergence (Figure S8). We preferred the model without this outgroup, as it is undesirable for a single sequence to affect the estimates to that extent and inference of time-resolved trees does not require an outgroup and the root location can be estimated under such a model from the in-group data. The classification of tips into genetic clusters necessarily relies on thresholds of evolutionary distance, however, in sensitivity analyses, changes in these thresholds resulted in minimal changes to the overall inferences on the genetic diversity of viruses circulating within any area. Finally, NiV sequences are ultimately reliant on the PCR primers used to detect the virus in the original sample. If the PCR primers are overly specific, they may systematically miss some viruses (32). Future efforts may want to consider using broader primer sets.
This project has demonstrated that even sparsely sampled genetic data, including many short sequences, from large areas can provide meaningful characterization of underlying diversity in populations when considered together. We have shown that even individual roosts typically have multiple circulating transmission chains, but with each genetic lineage covering a large spatial footprint, probably driven by bat mobility patterns. While most NiV diversity remains undetected, the surveillance sites which have been established appear to have uncovered a substantial proportion of the diversity in those locations.
Materials and Methods
Data collection and alignment
We collected all available NiV genomes in GenBank (n=301) (33) and compiled their date and place of collection, and the host species they were sampled from. We also included several previously unpublished sequences (n=26), collected between 2013 and 2016 in two bat roosts in Cambodia. The sampling and screening approach for NiV for these additional sequences is explained elsewhere (13).
Sequence alignment was performed using MUSCLE and manually corrected using MEGA-X (34). 70 sequences were excluded because of poor quality, because they were referenced on GenBank twice, or because they were different sequenced genes from one single sample (in which case they were combined into one genome). The resulting data set included 257 sequences from 6 countries (India, Bangladesh, Thailand, Cambodia, Indonesia, and Malaysia, see Figure 1A), spanning over a range of 22 years (1999–2020) (Table S1) (5, 12, 20, 29, 35–54). 175 sequences were sampled from six different bat host species: Pteropus lylei (n=120), Pteropus medius (formerly Pteropus giganteus, n=41), Pteropus vampyrus (n=6), Pteropus hypomelanus (n=6), Hipposideros larvatus (n=1), and one sequence came from a bat in the Taphozous genus, though its exact species is undetermined (Figure 1B) (54). The rest of the sequences were sampled from humans (n=73), pigs (n=7), and one sequence from a dog. 1 sequence had an uncertain host. Sequence length varies from 153 bp to 18.2k bp (full genome). In total, there are 64 full-length genomes, 185 genomes covering (at least partially) the nucleocapsid (N) gene, and the rest cover different parts of the NiV genome. Sequence lengths are detailed in Supplementary Table S2.
Preliminary Analyses
We ran a preliminary analysis using IQ-TREE (v 1.6.12) (55) in order to select the best-fitting nucleotide substitution model. The model was selected through a bootstrap with 1000 resampling events. According to the Bayesian Information Criterion (BIC), the TIM2 model was selected.
We performed Bayesian Evaluation of Temporal Signal (BETS) (56, 57) to determine if the temporal resolution of our NiV alignment was sufficient to calibrate a molecular clock. We compared two models: one in which the sequences were considered heterochronous, using their real sampling dates, and another one in which the sequences were considered isochronous, with a fixed evolutionary rate. The marginal log-likelihoods for both models were estimated using stepping- stone sampling (56), and they were compared to obtain a Bayes factor.
Phylogenetic reconstruction
We reconstructed a time-resolved phylogeny using BEAST v.1.10.4 (58), which implements MCMC to analyze time-explicit genetic sequences. We used a subset of 172 sequences, comprising only unique sequences per year and country. For each sequence, we specified the sampling date in the decimal year format, using the most precise information available (in GenBank or the corresponding publication). For sequences that we only knew the year of sampling (n=29), we added a one-year uncertainty around the specified time. We implemented three parallel runs of six consecutive chains of 200M states each. We used the generalized time reversible substitution model (GTR), as it was the only evolutionary model proposed by BEAST that offered at least as many degrees of freedom as the TIM2 model. We used a lognormal uncorrelated relaxed molecular clock, and the Bayesian Skygrid (59) coalescent tree prior. The chains were sampled every 20,000 states. For each of the three parallel runs, the six consecutive chains were combined into one and resampled with a frequency of 320,000 states using the LogCombiner software (60) from the BEAST suite.
We visualized the posterior distribution using Tracer (61) and obtained satisfactory ESS values (>200) for all estimated parameters. Based on the tree files combined in the same way as described in the previous paragraph, we summarized a maximum clade credibility (MCC) tree using TreeAnnotator (62). We visualized the MCC tree using the FigTree software (63) and the ggtree package in R v.4.2.0 (64–68).
Initial phylogenetic classification
Based on the reconstructed phylogeny’s structure, and on previous phylogenetic analyses of NiV genomes (15), we classified the NiV genomes into two major lineages (lineages I and II, corresponding to the lineages that have been previously described as the Bangladeshi and the Malaysian lineages, respectively), and four sublineages (I.A, I.B, II.A, II.B, see Figure 1C). Using this broad genetic classification, we analyzed NiV’s spatial structure across South and Southeast Asia by looking at the sublineage distribution per country, organizing countries according to their longitude (Figure 1D).
Distribution of human cases in Bangladesh
The distribution of human sequences in the reconstructed phylogeny is of interest to estimate if there are sublineages that are more prone to spillover events. We investigated the distribution of human sequences coming from Bangladesh and compared it to the distribution of bat sequences (Figure S2), because this is the only country where human behavior results in regularly documented spillovers (2). For sequences coming from Bangladesh, we considered only one human sequence per case cluster, and only one bat sequence per district-year. Sequences sampled from humans collected in 2004 were considered as part of the same cluster if they were sampled in the same district and the same year, otherwise they were considered from different case clusters. For human sequences sampled in 2008 and 2010, outbreak information was extracted from (5). For human sequences sampled in 2011, information about case clusters was extracted from (69). For human sequences sampled from 2012 onwards and included in (12) (n=19), the publication indicates if the human case was primary or not, if it was clustered with other cases and, in most cases, which other sequences belonged to the same case cluster. Two sequences from 2014 that were marked as clustered, with no specification about whether they belonged to the same cluster or not, and that were sampled in the same subdistrict on the same day, were considered as part of the same case cluster and only one of them was randomly selected and included at this stage of the analysis. The rest of human sequences sampled after 2012 either did not have any precise location information, in which case they were excluded from this part of the analysis (n=3), or they were published in (20) (n=14), and they were treated in the same way as 2004 sequences.
This resulted in a dataset of 38 sequences, with 30 human sequences, all of which came from sublineage I.A, and 8 bat sequences, out of which 5 (62.5%) came from sublineage I.A and 3 (37.5%) came from sublineage I.B. The purpose of this data subsetting is to avoid multiple sequences that are identical or too closely related from one single outbreak or roost visit. Due to the small size of this subset, we implemented Fisher’s exact test to analyze the significance of the contingency between human cases and each of the aforementioned sublineages.
Spatial spread
In order to analyze the speed at which NiV has spread across South and Southeast Asia, we computed the mean spatial pairwise distance as a function of the pairwise evolutionary distance for NiV sequences. We used a subset in which, for sequences sampled from bats, we only included one sequence per roost, per year (n=44). For sequences sampled from human infections, we only included one sequence per outbreak (n=34). This resulted in a subset of 78 sequences. For each sequence, we used the coordinates of the most precise sampling location available. For bat sequences, this was either the bat roost location (n=32), or the centroid of the district where the sequences were collected (n=12). We computed the geodesic pairwise spatial distances using the geodist package in R (70). We computed the pairwise evolutionary distance for all pairs of sequences using the cophenetic.phylo function from the ape package on the reconstructed phylogeny (71). We then computed the mean pairwise spatial distance for pairs of sequences separated by different cumulative evolutionary distances (Figure 2), implementing a non-parametric bootstrap with 1000 resampling events.
Phylogenetic clustering using a finer genetic resolution
We used the PhyCLIP module in Python to cluster the sequences in the tree into different lineages (Figure 3D) (16). PhyCLIP aims to do phylogenetic clustering based on statistically informed rather than user-defined divergence thresholds, with a linear integer optimization approach. The module takes a phylogeny as input, along with three parameters:
- 𝑆, the minimum number of sequences that can constitute a cluster,
- 𝛾, the multiple of deviations from the grand median of the mean pairwise distance that defines the within-cluster divergence limit, and
- 𝐹𝐷𝑅, a false discovery rate to ensure that every pair of clusters are significantly different from one another.
We ran PhyCLIP using a combination of different values for each of these parameters (S ∈ {1,2,3}, 𝛾 ∈ {1, …, 6}, 𝐹𝐷𝑅 ∈ {0.1, 0.15, 0.2, 0.25, 0.3}), using the module’s optimisation criteria to find an optimal combination of parameter values. With the aim of doing a comprehensive clustering of all available NiV sequences, we chose PhyCLIP’s intermediate optimization resolution (as opposed to high resolution), which optimizes the values of the parameters based on the maximum percentage of sequences assigned to a cluster, the minimum grand mean of within-cluster divergence, and the maximum mean inter-cluster divergence. Then, for sets of parameters that resulted in more than 98% sequences belonging to a cluster (Supplementary Table S3), we manually corrected PhyCLIP’s output to assign a cluster to all outliers, and to ensure that all clusters were monophyletic. We discarded options with fewer than 10 clusters, as these did not increase the resolution in comparison to the four sublineages described previously. Finally, we compared the resulting options for the classification of the NiV phylogeny, by computing the proportion of pairs of sequences belonging to the same cluster in function of their spatial distance (Supplementary Figure S5). Since there was little variation between these different options, we selected the classification into 15 clusters, as this corresponded to the highest genetic resolution (Figure 3D).
NiV cluster distribution
We analyzed the spatial and host species structure for sequences sampled from bats (n=175) using this finer phylogenetic classification (Figures 3A-C). Duplicate sequences that were not included in the BEAST or the PhyCLIP analyses were assigned the same cluster as the sequence that they were identical to from their country and year. For each sequence cluster, we analyzed its spatial distribution at a country scale (Figure 3A), and its distribution across all bat host species (Figure 3B). We then analyzed the relationship between these by looking, for each country, at the bat species distribution in our dataset (Figure 3C).
Models of cluster correlation
We developed a set of models to investigate the contribution of different covariates to the probability that a pair of sequences belong to the same genetic cluster. For pairs of sequences, we analyzed the effects of coming from the same location, of distance, and of coming from the same host species or not. Due to high correlation between the spatial and the host species covariates, we chose to separate them into two distinct models. We show the relationship between these two covariates by fitting a logistic model of the probability that a pair of sequences comes from the same host species to their spatial distance (Figure S3).
Spatial model
For all pairs of bat sequences with roost (n=132) or district information (n=36), we considered:
- Whether the sequences came from the same location or not (𝑙),
- For sequences coming from different locations, we also considered their spatial distance (𝑑),
As shown in Equation 1: We then used these parameters to fit a logistic regression to estimate the probability P that a pair of sequences comes from the same genetic cluster (Equation 2, see Figure 3E):
Host species model
Separately, for all pairs of bat sequences, we fit a logistic model of the probability that a pair of sequences comes from the same genetic cluster depending on whether the sequences came from the same host species or not (Figure 3F).
Assessment and prediction of diversity
We investigated the way in which the observation of different genetic clusters evolved with sampling in specific areas of South and Southeast Asia. We focused on divisions with at least 10 sequences, and in which at least 2 clusters have been observed: Dhaka, Rajshahi, and Rangpur divisions in Bangladesh, Kandal in Cambodia, and Eastern and Central Thailand. We used the iNEXT package (72, 73), with which we implemented a rarefaction analysis using Hill numbers of order 0 (equivalent to Species Diversity) to interpolate how many clusters had been observed in each division since sampling began until the current number of observations in each place, and to extrapolate how many more we could expect to observe if more NiV sequences were sampled in these regions. Average estimates can be seen on Figure 4C, and estimates with 95% Confidence Intervals can be seen on Supplementary Figure S6.
We investigated the relationship between the estimated number of clusters in these six regions and several ecological variables, with the aim of exploring if any of these covariates could be used to provide an estimation of the number of NiV genetic clusters in regions where sampling was not sufficient to conduct a rarefaction analysis. We looked at the mean pairwise geographic distance between sequences, at the human population density for each region (74–76), and at the percentage of tree coverage for each region (Supplementary Table S4, Supplementary Figure S7). For the tree coverage estimation, we averaged the estimates of coverage over a threshold of 30% for 2000 and 2010 from the Global Forest Watch (18, 19). We explored these measures as potential proxies for bat mobility, for human pressure on bat roosts, and for bat population size in each of these provinces, respectively. We plotted the estimated number of clusters for each of the six regions as a function of these parameters and fit a generalized Poisson regression for each of them.
Prediction of NiV diversity using Approximate Bayesian Computation
Approximate Bayesian Computation (ABC) is a method used when no explicit likelihood function can be written for a model, or when such a function would be too costly in terms of time or computational resources (77). Considering the number of underlying unknowns in terms of bat mobility and of interactions between different bat roosts and bat species, we adopted this approach to further investigate ways in which the spatial footprint of NiV genetic clusters can be captured with the aim of providing a rough estimate of how many different genetic clusters can be found in a specific area.
We computed a set of summary statistics on the NiV data set. For a specific spatial window 𝑤𝑖, we compute the median proportion 𝑖 of sequence pairs, within that window, that belong to the same cluster: We developed a simulation framework for the spatial footprint of NiV genetic clusters in a specific area. The input for a simulation was the cluster density 𝑐𝑙𝑢𝑠𝑡𝑑𝑒𝑛𝑠(in clusters·km-2), and the standard deviation 𝑠𝑑𝑐𝑙𝑢𝑠𝑡, a measure of the spread of each genetic cluster. In each simulation:
- We generated a polygon (square) as our area of study, of width 𝑤𝑖𝑑𝑡ℎ𝑜𝑢𝑡𝑒𝑟.
- Using the size of the polygon and the cluster density 𝑐𝑙𝑢𝑠𝑡𝑑𝑒𝑛𝑠, we computed the number of clusters to simulate.
- We randomly sampled the x and y coordinates of each cluster “center” within the initial polygon using a uniform distribution (Figure S9A).
- For each cluster, we randomly sampled 𝑛𝑜𝑏𝑠 points. The distance from each point to its corresponding cluster center was sampled using a truncated normal distribution of mean 0 and of standard deviation 𝑠𝑑𝑐𝑙𝑢𝑠𝑡, with a truncation at two times the standard deviation, allowing for 95% of the normal distribution to be sampled. The angle was sampled using a uniform distribution between 0 and 2𝜋 (Figure S9B).
- Then, we subset these simulated points to keep only those points within a sampling square of width 𝑤𝑖𝑑𝑡ℎ𝑖𝑛𝑛𝑒𝑟 and centered inside the initial square (Figure S9C).
- Finally, we computed the summary statistics 𝜏𝑠𝑖𝑚,𝑖 for each spatial window in the same way as with the NiV dataset.
We implemented this framework using Python 3.8 (78). We optimized the computation of the summary statistics for each simulation by partitioning the sampling square into a grid. Thus, with a grid width greater than the maximum distance for which 𝜏𝑠𝑖𝑚,𝑖 is computed, points within a specific cell only needed to be compared to points within their own or neighboring cells (see Figure S9D). Distance computation was performed using a Nearest Neighbors algorithm from the scikit-learn library v.1.2.1 (79). Parameter prior distributions and hyperparameter values are detailed in Table S5.
We computed the residual sum of squares for each of the 𝑛𝑠𝑖𝑚simulations, weighting each spatial window by the number of pairs within it in our actual NiV dataset. We then kept the 2% closest simulations. Model fit can be seen in Figure 4A.
We used this final set of simulations to compute the average number of unique observed clusters per area. For each simulation, and for each specific area, we randomly placed 1000 circles of the corresponding area in the simulated space, we counted the number of clusters that were contained within or that intersected each circle, and we computed the average of observed clusters per area across all simulations (Figure 4B).
Finally, we used this way of estimating diversity in function of area to compute rough estimates of the number of clusters that can be observed in the countries represented in our dataset. We explored different options for the area of circulation of Pteropus bats. We downloaded the geographic range estimates from the International Union for Conservation of Nature’s Red List of Threatened Species (IUCN) (17), and we computed the coverage in km2 of Pteropus bats per country, for all countries with Pteropus bat circulation. We also used an estimation of tree cover area for each country, since this represents the ecosystem where Pteropus bats live (8). For each country, we computed the mean tree cover in km2 over a threshold of 10%, 30%, and 50% for 2000 and 2010 from the Global Forest Watch (Table S6, Figure S9) (18, 19).
Ethics statement
This project was conducted using publicly available sequence data with no identifiable information and therefore did not require ethical approval.
Acknowledgments and Funding
This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/T022159/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk). We would also like to acknowledge funding from the European Research Council (804744), the National Institutes of Health (U01AI168287), the European Commission Innovate program (ComAcross project, grant no. DCI-ASIE/2013/315–047), and the DARPA PREEMPT program Cooperative Agreement (D18AC00031). The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Data Availability
Most data referred to in the manuscript are available online on GenBank (accession numbers are provided in Supplementary Table S1). This study includes some previously unpublished sequences, which will be made available on GenBank upon acceptance of the manuscript.
https://www.ncbi.nlm.nih.gov/nuccore/?term=nipah+henipavirus
Supplementary Information
Footnotes
↵† Joint senior authors