Abstract
During the 2015-2017 Zika epidemic, dengue and chikungunya – two other viral diseases with the same vector as Zika – were also in circulation. Clinical presentation of these diseases can vary from person to person in terms of symptoms and severity, making it difficult to differentially diagnose them. Under these circumstances, it is possible that numerous cases of Zika could have been misdiagnosed as dengue or chikungunya, or vice versa. Given the importance of surveillance data for informing epidemiological analyses, our aim was to quantify the potential extent of misdiagnosis during this epidemic. Using basic principles of probability and empirical estimates of diagnostic sensitivity and specificity, we generated revised estimates of Zika incidence that accounted for the accuracy of diagnoses made on the basis of clinical presentation with or without molecular confirmation. Applying this method to weekly incidence data from 43 countries throughout Latin America and the Caribbean, we estimated that 1,062,821 (95% CrI: 1,014,428-1,104,794) Zika cases occurred during this epidemic, which is 56.4% (95% CrI: 49.3-62.6%) more than the 679,743 cases diagnosed as Zika. Our results imply that misdiagnosis was more common in countries with proportionally higher incidence of dengue and chikungunya, such as Brazil.
Consistent and correct diagnosis is important for the veracity of clinical data used in epidemiological analyses (1–3). Diagnostic accuracy can depend strongly though on the uniqueness of a disease’s symptomatology. On the one hand, diagnosis can be straightforward when there are clearly differentiable symptoms, such as the hallmark rash of varicella (4). On the other hand, with symptoms that are common to many diseases, such as malaise, fever, and fatigue, it can be more difficult to ascertain a disease’s etiology (5–7). Further complicating clinical diagnosis is person-to-person variability in apparent symptoms and their severity (8,9). In many cases, symptoms are self-assessed by the patient and communicated verbally to the clinician, introducing subjectivity and resulting in inconsistencies across different patients and clinicians (10,11).
When they are used, molecular diagnostics are thought to greatly enhance the accuracy of a diagnosis, as they involve less subjectivity and can confirm that a given pathogen is present (12). Even so, molecular diagnostics do have limitations, particularly for epidemiological surveillance. As molecular diagnostics are often not the standard protocol, an infected person first has to present with symptoms in a medical setting and the clinician has to decide to use a molecular diagnostic. This is particularly unlikely to happen for emerging infectious diseases, as clinicians may not be aware of the pathogen or that it is in circulation (13). In this context, molecular diagnostics may also suffer from low sensitivity and specificity, high cost, or unavailability in settings with limited resources (12,14). As a consequence of factors such as these, retrospective analyses of the 2003 SARS outbreak in China identified SARS cases that were clinically misdiagnosed as atypical pneumonia or influenza up to four months before the first laboratory-diagnosed cases of SARS (15).
Challenges associated with disease diagnosis are magnified in scenarios with co-circulating pathogens, particularly when the diseases that those pathogens cause are associated with similar symptoms (16,17). Influenza and other respiratory pathogens, such as Streptococcus pneumoniae and respiratory syncytial virus (RSV), co-circulate during winter months in the Northern Hemisphere. The difficulty of correctly ascribing an etiology in this setting is so widely accepted that clinical cases caused by a variety of pathogens are often collated for surveillance purposes as “influenza-like illness” (18). Similar issues occur in malaria-endemic regions (16,19). One study in India found that only 5.7% of commonly diagnosed “malaria-infected” individuals actually had this etiology, while 25% had dengue instead (16).
One set of pathogens with potential for misdiagnosis during co-circulation includes three viruses transmitted by Aedes aegypti and Ae. albopictus mosquitoes: dengue virus (DENV), chikungunya virus (CHIKV), and Zika virus (ZIKV). Some symptoms of the diseases they cause can facilitate differential diagnosis, such as joint swelling and muscle pain with CHIKV infection (20,21) and a unique rash with ZIKV infection (22,23). Other symptoms, such as malaise and fever, could result from infection with any of these viruses (20–25). In one region of Brazil with co-circulating DENV, CHIKV, and ZIKV, Braga et al. (25) empirically estimated the accuracy of several clinical case definitions of Zika by ground truthing clinical diagnoses against molecular diagnoses. They found that misdiagnosis based on clinical symptoms was common, with sensitivities (true-positive rate) and specificities (true-negative rate) as low as 0.286 and 0.014, respectively.
Although the estimates by Braga et al. (25) provide valuable information about misdiagnosis at the level of an individual patient, they do not address how these individual-level errors might have affected higher-level descriptions of Zika’s epidemiology during its 2015-2017 epidemic in the Americas. The Pan American Health Organization (PAHO) reported 169,444 confirmed and 509,970 suspected cases of Zika across 43 countries between September, 2015 and July, 2017 (26). Meanwhile, PAHO reported 675,476 and 2,339,149 confirmed and suspected cases of dengue and 180,825 and 499,479 confirmed and suspected cases of chikungunya, respectively, during the same timeframe in the same region (27,28). The substantial errors in clinical diagnosis reported by Braga et al. (2017) (25), combined with the large number of cases lacking a molecular diagnosis (26–28), leave open the possibility that a considerable number of cases could have been misdiagnosed during the 2015-2017 Zika epidemic.
Our goal was to quantify the possible extent of misdiagnosis during the 2015-2017 Zika epidemic by leveraging the full extent of passive surveillance data for dengue, chikungunya, and Zika from 43 countries in the Americas in conjunction with empirical estimates of sensitivity and specificity. Our methodology was flexible enough to use either or both of suspected and confirmed cases, given that their availability varied and they both offered information about the incidence of these diseases. To account for variability in diagnostic accuracy, we made use of joint probability distributions of sensitivity and specificity, one for clinical diagnostics and one for molecular diagnostics, informed by empirical estimates. Using this approach, we updated estimates of Zika incidence during its 2015-2017 epidemic across the Americas.
Methods
To quantify the degree of misdiagnosis during the Zika epidemic, we leveraged the full extent of passive surveillance data on Zika, dengue, and chikungunya for 43 countries in the Americas and formulated a Bayesian model of the passive surveillance observational process. Our observation model was informed by the observed proportion of Zika and empirically estimated misdiagnosis rates (Fig. 1). We used the model to generate revised estimates of the number of Zika cases that occurred during the 2015-2017 Zika epidemic across the Americas (Fig. 1).
Data
We used suspected and confirmed case data for dengue, chikungunya, and Zika from PAHO for 43 countries in the Americas. We differentiated between confirmed and suspected cases on the basis of laboratory diagnosis versus clinical diagnosis (29). For confirmed and suspected cases of chikungunya, we used manual extraction and text parsing algorithms in perl to automatically extract data from epidemiological week (EW) 42 of 2013 through EW 51 of 2017 (28). For confirmed and suspected cases of Zika, we used the skimage (30) and numpy (31) packages in Python 3.6 to automatically extract incidence data from epidemic curves for each country from PAHO from EW 39 of 2015 to EW 32 of 2017 (26). For confirmed and suspected cases of dengue, we downloaded weekly case data available from PAHO from EW 42 week of 2013 to EW 51 of 2017 (27). We restricted analyses to EW 42 of 2015 (the beginning of the fourth quarter of 2015) to EW 32 of 2017 (the last week with Zika data in our dataset) to focus our analysis on the timeframe of the Zika epidemic (Fig. 2).
Probabilistic estimates of sensitivity and specificity
Due to variability in the sensitivity and specificity of different diagnostic criteria, we treated se and sp as jointly distributed random variables informed by empirical estimates. To describe variability in misdiagnosis for molecular diagnostic criteria, we included two empirical estimates of molecular sensitivity and specificity that were found using ZIKV RT-PCR on a panel of samples with known RNA status for ZIKV, DENV, CHIKV, or yellow fever virus (32). To describe variability in misdiagnosis for clinical diagnostic criteria, we included six empirical estimates of sensitivity and specificity that were measured in a region of Brazil with co-circulating ZIKV, DENV, and CHIKV (25). These empirical estimates of sensitivity and specificity were derived by clinically diagnosing a patient with Zika, dengue, or chikungunya based on different clinical case definitions, and then ground truthing against the case’s etiology determined by RT-PCR (25). We used the sample mean, μ, and sample variance-covariance matrix, Σ, for the molecular and clinical misdiagnosis rates as the mean and covariance in two independent, multivariate normal distributions, such that for each of the molecular and clinical diagnostic distributions.
Probabilistic estimates of the proportion of Zika
We used the proportion of cases that were diagnosed as confirmed or suspected Zika, and , where c and s refer to confirmed and suspected cases and the hat notation refers to observed data. Rather than using the point estimate for or , we made use of Bayesian posterior estimates of these variables obtained directly from reported Zika cases, ĈZ,c and ĈZ,s, and reported dengue and chikungunya cases, ĈO,c and ĈO,s, using the beta-binomial conjugate relationship (33). This assumed that the number of Zika cases was a binomial draw from the total number of cases of these three diseases, with a beta-distributed probability of success, or . We assumed uninformative priors on and ; i.e., and . Therefore, 1 + ĈZ,c and 1 + ĈZ,s were the alpha parameters of the two beta distributions and 1 + ĈO,c and 1 + ĈO,s were the beta parameters of the two beta distributions. For confirmed cases, this resulted in a posterior estimate of and for suspected cases,
Observation model of misdiagnosis
To calculate the updated proportion of Zika among the total of Zika, dengue, and chikungunya cases, pZ, we mathematically related to pZ using diagnostic sensitivity and specificity, such that
We then rearranged Eq. 4 to solve for
From Eq. 5, we determined two constraints for how se, sp, and can relate to one another. The first was , which follows from 0 ≤ pZ ≤ 1, or , and then simplifying the inequality. The second was se + sp ≠ 1, as this would lead to zero in the denominator of Eq. 5. Eqs. 4, 5, and subsequent constraints were applied independently to confirmed and suspected cases.
We used samples of pZ,c and pZ,s estimated from Eq. 5 to define a single distribution of pZ. As estimates of pZ,c and pZ,s were between 0 and 1, we approximated beta distributions for each using the fitdistr function in the MASS package in R (34) fitted to posterior samples of pZ,c and pZ,s. We then reconciled differences in estimates of pZ,c and pZ,s with an estimate of pZ, defined as
We then applied pZ to ĈZ,c + ĈZ,s + ĈO,c + ĈO,s to obtain revised Zika cases, CZ, and dengue and chikungunya cases, CO.
Applying the observation model
To apply our observation model to empirical data, we first drew 1,000 samples from the beta distributions of and and 1,000 samples from the multivariate normal distributions describing sensitivities and specificities of molecular and clinical diagnostics. We applied our observation model to four distinct scenarios with different spatial and temporal aggregations to assess the sensitivity of our results to different ways of aggregating incidence data: country-specific, temporal data (4,214 data points); country-specific, cumulative data (43 data points); region-wide, temporal data (98 data points); and region-wide; cumulative data (1 data point). Under each of these scenarios, we quantified posterior distributions of pZ, drew 1,000 Monte Carlo samples of pZ, and obtained distributions of CZ and CO.
Results
Illustrative example
We constructed a simple example with two generic diseases, A and B, to illustrate the relationship between reported cases and revised cases under different misdiagnosis scenarios. For these generic diseases, we varied the total cases of A and B such that the proportion of cases diagnosed as A, , varied from high to low. We used combinations of sensitivity and specificity that spanned all combinations of low, intermediate, and high misdiagnosis scenarios. Using the same methods applied to reallocate Zika, dengue, and chikungunya cases, we revised estimates of incidence of disease A in light of misdiagnosis with disease B.
Incidence of disease A was not revised when sensitivity and specificity were both low (Fig. 3, bottom left), which was due in some cases to the constraint of not being met and in other cases to the sum of sensitivity and specificity equaling 1. When was high (Fig. 3, pink lines), revised incidence of A was similar to observed incidence of A, as only high sensitivities were possible across a range of specificities (Fig. 3, top row). With high sensitivities (Fig. 3, top row), misdiagnosis only occurred with B misdiagnosed as A. When was low (Fig. 3, purple lines), revised incidence of A was higher than observed incidence, as only high specificities were possible across a range of sensitivities (Fig. 3, right column). With high specificities (Fig. 3, left column), misdiagnosis only occurred with A misdiagnosed as B. When was intermediate (Fig. 3, green lines), misdiagnosis occurred both ways, as a range of sensitivity and specificity values were possible. This resulted in scenarios in which incidence of A was higher or lower than the observed incidence.
Misdiagnosis through time
We estimated the revised proportion of Zika cases among all cases of Zika, dengue, and chikungunya at each time point for each country. As was low early in the epidemic, we estimated that there were 113,459 (95% CrI: 74,595-148,575) disease episodes caused by ZIKV that were misdiagnosed as dengue or chikungunya cases in the fourth quarter of 2015, prior to the start of reporting of Zika in most countries (Fig. 4). Then, as Zika incidence increased and peaked in 2016, the intensity of misdiagnosis increased (Fig. 4), but the direction of misdiagnosis differed by country, depending on how much increased. Drastic jumps and dips in the number of Zika cases misdiagnosed as dengue or chikungunya (Fig. 4) was a consequence of chikungunya cases not being reported on a continuous basis (Fig. 2D).
Revising cumulative estimates of the epidemic
We aggregated revised Zika incidence to estimate the cumulative size of the epidemic and to compare our estimate to that based on surveillance reports. In countries and territories with relatively high Zika incidence ( close to 1), such as Suriname, Martinique, and the U.S. Virgin Islands, our revised estimates of pZ closely matched (Fig. 5, bottom). In countries with relatively low Zika incidence ( close to 0), such as Brazil and Bolivia, our revised estimates of pZ were higher than (Fig. 5, bottom). In those countries that reported no Zika incidence (i.e., ), such as Bermuda and Chile, our estimates of pZ were much more uncertain (Fig. 5, bottom).
According to the PAHO reports that we used, the Zika epidemic totaled 679,414 confirmed and suspected cases throughout 43 countries in the Americas. When we accounted for misdiagnosis among Zika, dengue, and chikungunya, we estimated that the Zika epidemic totaled 1,062,821 (95% CrI: 1,014,428-1,104,794) cases across the Americas.
Estimates of epidemic size using different aggregations of data
We applied our observation model to different temporal and spatial aggregations of the PAHO data, wherein we used temporal incidence data for the region as a whole (Fig. S4), cumulative incidence data for each country (Fig. S5), or cumulative incidence data for the region as a whole (Table 2). When using temporal incidence data for the region as a whole, our estimate of the overall size of the Zika epidemic was 1,073,593 (95% CrI: 1,049,660-1,101,840). Under this spatially aggregated scenario, the majority of misdiagnosis occurred during the height of the epidemic (Fig. S2). In our analysis using cumulative incidence data for each country, our estimate of the overall size of the epidemic was 844,623 (95% CrI: 724,421-957,294), with country-specific estimates of pZ not well-aligned with (Fig. S3). When using cumulative incidence for the region as a whole, our estimate of the overall size of the Zika epidemic was 227,568 (95% CrI: 139,782-319,684).
Discussion
We leveraged empirical estimates of sensitivity and specificity for both clinical and molecular diagnostics to revise estimates of the 2015-2017 Zika epidemic in 43 countries across the Americas. After applying our methods to data from PAHO, we found that more than 300,000 disease episodes diagnosed as chikungunya or dengue from September, 2015 through July, 2017 may have been caused by ZIKV instead. Our revised estimates of the Zika epidemic suggest that the epidemic was more than 50% larger than case report data alone would suggest. Additionally, our estimates show that nearly a third of these instances of misdiagnosis occurred in 2015, prior to many countries reporting Zika cases to PAHO (26). An illustrative example of our method showed that these results were driven by the relative incidence of Zika and the two other diseases. Hence, differences in our results over time, across countries, and with respect to level of data aggregation resulted from differences in relative incidence of Zika and these other diseases across the different ways of viewing the data that we considered.
Some countries appeared to have higher Zika incidence than surveillance data alone suggest, such as Brazil and Bolivia, while others appeared to have lower Zika incidence than surveillance data alone suggest, such as Venezuela and Jamaica. In Brazil and Bolivia, our country-specific cumulative estimates of the epidemic were 90% and 180% larger than case report totals, respectively. In Venezuela and Jamaica, our country-specific estimates of the epidemic were 29% and 17% smaller than case report totals, respectively. These differences across countries can be explained by differences in the proportions of suspected Zika cases, , through time. In Brazil and Bolivia, was less than 0.2 at nearly every time point, whereas it mostly ranged 0.2-0.8 in Venezuela and Jamaica. When was low, as in Brazil and Bolivia, the constraint that allowed sensitivities to span a larger range, including lower sensitivities that would have resulted in the inference that more cases diagnosed as dengue or chikungunya were caused by ZIKV. When was moderate to high, as in Venezuela and Jamaica, the constraint that limited sensitivities to higher values, resulting in the inference that fewer cases diagnosed as dengue or chikungunya were caused by ZIKV. Similarly, because of a trade-off between sensitivity and specificity for clinical diagnoses, these constraints on sensitivity also imposed constraints on specificity.
Using different aggregations of data in our analysis led to different conclusions in multiple respects. Using cumulative data for the region as a whole led to the inference that the Zika epidemic was smaller than suggested by surveillance data, whereas using cumulative data at a country level led to the inference that the epidemic was larger than suggested by surveillance data, but with variation across countries. Using temporally explicit data led to the inference that the epidemic was even larger, regardless of whether data was aggregated at a country or regional level. Overall, these similarities and differences suggest greater consistency temporally than spatially in the relative incidence of Zika, chikungunya, and dengue across countries. At least in the case of an emerging disease such as Zika, this suggests that it may be most important to prioritize temporal data when inferring patterns of misdiagnosis. With respect to the timing of inferred misdiagnoses, there were more visible differences between scenarios in which temporal data were aggregated at a country or regional level. When temporal data were aggregated at a country level, we inferred that the majority of misdiagnosis occurred prior to 2016. When temporal data were aggregated at the regional level, we inferred that the majority of misdiagnosis occurred during the epidemic in 2016.
Our observation model incorporated basic features of how passive surveillance data for diseases caused by multiple, co-circulating pathogens are generated, including the potential for misdiagnosis and differences in misdiagnosis rates by data type. With respect to other features of how data such as these are generated, there were some limitations of our approach. First, we did not specify a process model of transmission dynamics. For example, a multiple-pathogen transmission model could be fitted to passive surveillance data on Zika, dengue, and chikungunya, using our observation model to relate the model’s predictions to the data (33). Accounting for misdiagnosis in this way could improve a transmission model’s ability to make inferences about drivers of transmission (35–37) or interactions among pathogens (38,39). Second, we aggregated chikungunya and dengue case data, meaning that we were unable to explore the potential for differences in the extent to which misdiagnosis occurs between each of these diseases and Zika. If additional studies resolve differences in diagnostic sensitivity and specificity of Zika compared to each of these diseases separately, our observation model could be extended to account for this. Third, our observation model relied on a limited set of empirical estimates of diagnostic sensitivity and specificity. Given that the use of different diagnostics could vary spatially or temporally, as could their sensitivities and specificities (40–42), incorporating more detailed information about diagnostic use and characteristics could improve future estimates using our observation model.
Although passive surveillance data has been central for understanding many aspects of the 2015-2017 Zika epidemic, our finding that there may have been 56% more Zika cases than described in PAHO case reports underscores the need to consider the observation process through which passive surveillance data is collated. Here, we accounted for misdiagnosis in the observation process to revise estimates of the passive surveillance data on which numerous analyses depend (35,43,44). The advancements made here contribute to our understanding of which pathogen may be circulating at a given time and place. By better accounting for the etiology of reported cases, it could become more feasible to implement pathogen-specific response measures, such as proactively testing pregnant women for ZIKV during a Zika epidemic (45,46). Given the potential for synchronized epidemics of these and other co-circulating pathogens in the future (47), continuing to develop methods that disentangle which pathogen is circulating at a given time will be important in future epidemiological analyses based on passive surveillance data.
Data Availability
Data available on GitHub at https://github.com/roidtman/zika_misdiagnosis.
SUPPLEMENTAL FIGURES
Acknowledgments
RJO acknowledges support from an Arthur J. Schmitt Leadership Fellowship in Science and Engineering and an Eck Institute for Global Health Fellowship. GE acknowledges support from Grant TL1TR00107 (A Shekhar, PI) from the National Institutes of Health, National Center for Advancing Translational Sciences, Clinical and Translational Sciences Award (https://www.indianactsi.org). TAP acknowledges support from a Young Faculty Award from the Defense Advanced Research Projects Agency (D16AP00114), a RAPID Award from the National Science Foundation (DEB 1641130), and Grant P01AI098670 (T Scott, PI) from the National Institutes of Health, National Institute of Allergy and Infectious Disease. The authors thank Dr. Ann Raiho for helpful comments on the manuscript.