Abstract
Dengue fever, a tropical vector-borne disease, is a leading cause of hospitalization and death in many parts of the world, especially in Asia and Latin America. In places where timely and accurate dengue activity surveillance is available, decision-makers possess valuable information that may allow them to better design and implement public health measures, and improve the allocation of limited public health resources. In addition, robust and reliable near-term forecasts of likely epidemic outcomes may further help anticipate increased demand on healthcare infrastructure and may promote a culture of preparedness. Here, we propose ensemble modeling approaches that combine forecasts produced with a variety of independent mechanistic, statistical, and machine learning component models to forecast reported dengue case counts 1-, 2-, and 3-months ahead of current time at the province level in multiple countries. We assess the ensemble and each component models’ monthly predictive ability in a fully out-of-sample and retrospective fashion, in over 180 locations around the world — all provinces of Brazil, Colombia, Malaysia, Mexico, and Thailand, as well as Iquitos, Peru, and San Juan, Puerto Rico — during at least 2-3 years. Additionally, we evaluate ensemble approaches in a multi-model, real-time, and prospective dengue forecasting platform — where issues of data availability and data completeness introduce important limitations — during an 11-month time period in the years 2022 and 2023. We show that our ensemble modeling approaches lead to reliable and robust prediction estimates when compared to baseline estimates produced with available information at the time of prediction. This can be contrasted with the high variability in the forecasting ability of each individual component model, across locations and time. Furthermore, we find that no individual model leads to optimal and robust predictions across time horizons and locations, and while the ensemble models do not always achieve the best prediction performance in any given location, they consistently provide reliable disease estimates — they rank in the top 3 performing models across locations and time periods — both retrospectively and prospectively.
1 Introduction
Dengue fever is a tropical vector-borne disease threatening an estimated 3.9 billion people [62] over 141 countries [2], with cases doubling every ten years since 1990 [26]. Over 390 million infections arise each year worldwide, with severe dengue causing 25,000 deaths annually, mostly in children [2, 3]. For many parts of Asia and Latin America, dengue is a leading cause of hospitalization and death, especially among children [16]. It also causes more morbidity and mortality than any other arthropod-borne virus [9]. While dengue infections often presents with flu-like or otherwise mild symptoms [1], about 1 in 20 people infected with dengue will develop severe dengue, which can further develop into dengue hemorrhagic fever [3] — a very serious condition marked by capillary leakage leading to potentially significant organ impairment, multiorgan failure, and death [1, 3, 21, 28]. The ability to forecast dengue fever cases can provide public health officials with a more accurate picture of future disease dynamics, empowering decision-makers to implement public health measures and better allocate limited resources.
Over the past few decades, multiple approaches have been developed for dengue forecasting. Dynamic, mathematical models that incorporate knowledge of dengue virus transmission biology, historical incidence, and climatological factors have been developed to predict the evolution of dengue epidemics [11, 64, 20, 51]. However, the intricate nature of dengue dynamics poses a significant challenge, as mechanistic assumptions usually remain unclear — or hard to quantify —, and acquiring the data to parameterize models often proves impossible. More recently, data-driven methodologies to predict the severity of an upcoming seasonal outbreak — a classification problem — have experienced a surge in popularity due to the increasing availability of epidemiological and exogenous dengue-related data. These include methods such as k-Nearest-Neighbors, Logistic Regression, and various boosting methods [44]; or the identification of weather (temperature, rain frequency) patterns that may help anticipate years with high incidence [35, 57, 56]. Additionally, classical time series methods like SARIMA (and its variants) have been widely explored to forecast confirmed case counts over time[6, 18, 46, 59], as well as more complex, non-linear methods such as generalized additive models, artificial neural networks, and exponential smoothing approaches [5, 6, 29, 46]. Other studies have explored the feasibility of leveraging Internet-based data sources such as Dengue-related Google search data [27, 50, 62], social media data [17, 33], and Wikipedia access logs [33] as additional predictors of Dengue activity.
The abundance of dengue forecasting methodologies presents a significant challenge for decisionmakers and stakeholders who ultimately need a single set of reliable predictions to make informed decisions to protect their communities. Factors such as data availability, computational resources, and the desired level of accuracy further complicate the decision-making process.
Therefore, navigating the array of forecasting methodologies requires careful consideration and expertise to ensure effective dengue prediction and response. Additionally, each model has its own limitations [18, 43], including robustness to variability in data quality and sensitivity to different outbreak phases, which can lead to inconsistent predictions.
Ensemble methods that intelligently and adaptively combine the predictions of multiple component models into a single prediction may be more robust alternatives for disease forecasting. For example, previous studies have used averaged and weighted-averaged ensembles combining models such as SARIMA, vector autoregression, neural networks, and linear regression to nowcast and forecast dengue in Brazil and India [23, 53]. Similar ensembling approaches combining models such as the Method of Analogues, Holt-Winter models, and Bayesian generalized linear mixed models, among other historical models, have shown promising results in Iquitos, Peru [8], and Vietnam [12]. Chakraborty et al. directly combined ARIMA and neural networks to forecast dengue in San Juan, Iquitos, and the Philippines as a composition of linear and non-linear signals [10]. Mahajan et al. use a gradient-boosting super-ensemble to combine ARIMA, exponential smoothing, and neural network for forecasting dengue in Hong Kong [32]. Ensemble models intentionally involving strong and weak learners to reduce over-fitting have also shown good predictive performance in Bangkok and Chiang Mai, Thailand [24].
However, the studies referenced above are (1) individually limited in their geographic scope –to study a few locations within a country at a time–, (2) they only focus on a few model choices as potential ensemble components, and (3) they were in all cases implemented retrospectively. The latter implies that issues of data availability, data completeness, and other challenges that emerge in real-time forecasting efforts were not at all considered [34]. As such, the approaches described are not guaranteed to generalize as robust, ready-to-use methodologies globally, in real-time and prospective forecasting efforts.
We would like to highlight that the challenges of the real-time implementation of disease forecasting systems have been documented extensively, and they are still the backbone that motivate multiple research studies ([37, 31, 54]). In fact, public health systems typically experience severe lags in reporting [13, 36], and reported case counts for a given month may be updated many times in the following months (this is commonly referred to as “backfill”) [50]. While some methods for addressing these data quality and backfill issues have been proposed [15], for example — methods that attempt to learn backfill patterns [22, 37] and methods that introduce auxiliary data sources [45] to improve forecasts; we did not include in our real-time prediction pipeline a comprehensive set of approaches to address these issues. Instead, we evaluated the ability of our machine learning-based ensemble methods to lead to improved or consistent forecasts in the presence of these challenges.
The primary prediction task addressed by this manuscript is the short-term forecast of reported dengue fever case counts one to three months ahead in province-sized localities. The main contributions of this work are twofold. First, we retrospectively formulate a family of ensemble system pipelines — comprised of multiple structurally heterogeneous individual component models — that generate more robust and accurate forecasts compared to their individual component models. Specifically, we include the following 11 diverse classes of individual component models as potential inputs to our ensembling pipelines: autoregressive (AR); autoregressive with Google Trends data as exogenous covariates (ARGO [63]); three variants of vector autoregression (NetModel and two variants of VAR [41] — VAR (Reg.) and VAR (Clust., Reg.); a combination of ARGO and NetModel (ARGONet [31]); a novel, mechanistic, repurposed dynamically-trained SIR; a classical error-trend-seasonality model (ETS); a mini-ensemble of machine learning methods (Stacked ML); a baseline naive persistence model; and a baseline seasonal model. We exhaustively demonstrate across 180+ province-sized locations that our ensemble models not only incur lower percent absolute errors on average compared to their individual component models (though such improvements may be modest), but also, and more importantly, they produce top performing forecasts (typically in the top three) more consistently than any individual model.
The second contribution of this study is the evaluation of a real-time dengue activity forecasting platform implemented as a prospective tool to identify when and where an upcoming dengue outbreak would be experienced 1, 2 and 3 months into the future. These predictive efforts were implemented as a decision-making support tool to guide the allocation of resources for prospective clinical trials in all provinces of Brazil, Colombia, Malaysia, Mexico, and Thailand. The ensemble techniques were assessed using a different set of component models that were chosen based on their suitability to be implemented in the presence of multiple data availability and data incompleteness issues. These models included: KNN, VAR (Reg.), Support Vector Machines (SVM), and SARIMA. The scope of our study and the consistency of our analyzes suggests that our ensemble approaches are generalizable across geographically diverse locations and individual component models, and thus may become a reliable first choice for forecasting teams who are interested in communicating concisely with public health officials and other decision makers[54].
Results
We evaluated the performances of eleven optimized component models and two selected ensemble models on forecasting reported dengue cases in over 180 province-sized locations worldwide in Brazil, Colombia, Malaysia, Mexico, and Thailand. Specifically, we retrospectively assessed each model’s ability to forecast reported dengue cases 1-, 2-, and 3-months ahead into the future, with performance quantified using Percent Absolute Error (see Supplementary Materials).
For each location, we partitioned the available data into distinct training and test periods. During the training phase, we engaged in a hyperparameter optimization for each component model, and two versions of optimal hyperparameters for our ensembles: a ‘Country’ set of hyperparameters (using the same set of hyperparameter with best on-average performance across all locations within a country), and a ‘Overall’ (using the set of hyperparameters that, on average, best performed within all the locations in the study). Following this stage, we proceeded to generate out-of-sample forecasts utilizing the designated test period. It is important to note that due to the varied availability of epidemiological data across different regions, the time periods for analysis varied from country to country. For specific details on the analysis period for each country, please consult Table 2.
We present our main results in the following way: First, we observe the heterogeneity in the performance of each of our component models in Figure 1, which presents a summary of the performance of each model across each country, and each horizon of prediction. Then, in Figure 2 we focus specifically on the capacity of the ensemble models to consistently succeed in generating reliable forecasts, independent of the location where they are trained on. Finally, we present an overview of the error reduction of each component model, emphasizing the consistency of the ensemble techniques to reduce error across locations.
Component model performance were significantly dependent on the location. Our results, summarized in Figure 1, present a table with the percent absolute error ranking (PAE) distribution for each model within Sections B through F (one for each country). Each row within the table represents a distinct model, while columns denote their respective rankings (1st, 2nd, 3rd, etc.). For instance, for Brazil, the first entry of the table indicates that the model ‘Ensemble (Country, EW)’ secured the lowest PAE, achieving first place in 7 out of 27 locations across the country. The tables showcase the top-performing model in terms of this ranking, listed in descending order. We found that our ensembles, including the ‘Country’ and ‘Optimized’ versions of the ‘Equal Weights’ (EW) ensemble, and ‘Country’ and ‘Overall’ ‘Performance Based Weights’ (PBW) tended to be in the top positions within our ranking tables across all horizons for Brazil, Colombia, Mexico, and Thailand (see the Methods section for details on the ensemble approaches). In the case of Malaysia, the EW ensemble appeared in the 6th position (almost half of the participating models). The Vector Autoregression (VAR) also consistently appeared in top positions for each of the ranking matrices for each country. On the other hand, the positioning of each of our component models varied from country to country. A country- by-country description, along with additional results for San Juan, Puerto Rico, and Iquitos, Peru can be found in the Supplementary Materials. We relegate Puerto Rico and Peru to the Supplementary Materials, since we only have one location in each of these regions, which does not facilitate interprovince analyses.
Ensemble models consistently achieve the most top performance rankings compared to any individual model
Our second main result is that the country-specific and overall ensembles consistently perform among the top 3 relative to the individual component models. Moreover, averaged across all locations, two ensemble variants incurred the lowest prediction error compared to all other component models. Fig. 2 shows the forecasting performance of both individual and ensemble models across all 187 tested locations. In panel (A), we provide one example emphasizing the advantages of using ensembles over component models. For the 1- and 2-month ahead horizons, the best country-specific ensemble used equal weights (EW) for all component models when producing an ensemble forecast at each timestep, while at the 3-month ahead horizon, the best country-specific ensemble used performance-based weights (PBW), where the weightings of the component models were determined based on which component models performed the best in the recent past. Specifically, at the 1- and 2-month ahead forecast horizons, the ensemble model predicted the ground truth (grey silhouette) much more closely and with less variance than that of its component models, which tended to display more significant oscillation relative to the ground truth. At the 3-month ahead horizon, the ensemble and its component models tended to produce less accurate predictions in this specific example. Given the broad geographical scope of our study, these findings suggest that ensemble models are a suitable default choice for generalizable and robust forecasts.
The heatmaps in Fig. 2 (B) show that the country-specific and overall ensembles had the most locations where they performed in the top 3 rankings in terms of percent absolute error, followed by regularized VAR and clustered + regularized VAR, out of the 187 tested locations. At the 2-month ahead horizon, country-specific ensembles still garnered the most top 3 rankings, but regularized VAR seemed to garner more top 3 rankings than the overall ensemble. At the 3-month ahead horizon, the country-specific and overall ensembles again attained the greatest number of top 3 rankings, followed by regularized VAR. While ensembles may not always be the absolute winner in every tested location, overall, they consistently placed on the top 3 rankings, being reliable and robust model choices. From the geographical maps in panel (C), ensemble models performed in the top 3 rankings in a vast majority of all 187 locations tested, as indicated by most of the maps being colored yellow. We found only a few consistent exceptions to this rule primarily in central Brazil and Colombia. These maps again corroborate the finding that ensembles are an effective option for stakeholders.
Ensembling also yields improvement in error distributions across locations
We analyzed the potential error reduction of our ensemble models with respect to individual models. Our results are displayed in 3, which shows the percent absolute error distributions for the individual and the ensemble models in the 187 locations tested. Fig. 4 shows the same percent absolute error distributions of the individual and ensemble models within each country and forecast horizon.
Overall performances
Despite the minor reduction in some cases, the country-specific and overall ensembles had the lowest mean percent absolute errors compared to all other models across all forecast horizons. Fig. 3 (A) to (C) displays the error distributions for each model, across the 3 different horizons of forecast. In all cases, we can observe that the Ensemble models were placed as the top performers, reaching values of 38.5%, 54.5% and 62.7% in terms of Percent Absolute Error (% AE). The next best model were there VAR variants (one incorporating regularization, and other implementing both regularization and clustering), reaching values of 40%, 55% 64%, for each respective task.
Performance by country
Fig. 4 shows the performance per country. Notably, an ensemble variant achieved the lowest mean percent absolute error in 14 out of 15 tested combinations of forecast horizon and country. The sole exception was Colombia at 2-months ahead, with clustered + regularized VAR having achieved the lowest mean percent absolute error. However, the mean percent absolute errors and the shapes of the errors’ distributions are very similar for the top-performing models. Nonetheless, our analyses shown in this figure still confirm that the ensembles were greater than the sum of their component models, even when looking only within a particular country.
Prospective Study: a real-life application and analysis of our methodology
In this section, we evaluate the performance of our forecasts generated by a real-time dengue activity forecasting platform. This platform was designed as a prospective tool to predict where dengue outbreak activity would occur 1, 2, and 3 months into the future. There are two primary differences between our retrospective and prospective studies. First, the prospective study was conducted in a real-world scenario where future ground truth data was unknown, whereas, in the retrospective study, we had access to the entire time series beforehand. Second, due to reporting delays in dengue case data, our prospective predictions did not utilize the most up-to-date epidemiological data. In contrast, the retrospective predictions were made using the most complete datasets available. This evaluation covered various locations in Brazil, Colombia, Mexico, Thailand, and Panama from May 2022 to July 2023. The forecasting models included Support Vector Machine (SVM), K-Nearest Neighbors (KNN), a regularized version of Vector Autoregression (VAR), and Seasonal Autoregressive Integrated Moving Average (SARIMA). Our ensemble models comprised a weighted ensemble, which assigned weights based on the mean squared error score of each model over the past six months, and a winner-takes-all ensemble, which selected the prediction from the ‘best’ base model with the least mean squared error over the past 3 months. Additionally, we used Persistence as our baseline model.
Figure 5 and Table 1 show a summary of the number of times our ensemble’s Mean Squared Error was among the top 3 best over all the locations within a Country (the analysis was repeated over each forecasting horizon). We also present the overall error reduction of each model with respect to persistence using a set of violin plots. We conducted the analysis for each location and each country (Figure 6).
Top 3 Analysis
Figure 5 shows the number of times a model reached within the top 3 Mean Squared Error reductions, for each country and each horizon. Each barplot represents the performance of a model within a Country, for a different prediction horizon. We can see VAR appearing eight times on the leftmost side (40%), and our Weighted Ensemble appearing seven times(35%). Our Winner-Takes-all approach appeared only two times on the leftmost side (10%), but had a comparable count with the top model whenever it was on second position (see Thailand in horizon 1, Panama in Horizon 1 and 4, Brazil in horizon 1, and Mexico in Horizon 1 and 2). Overall, most of our base models and ensembles had a higher count than persistence, with exception to Colombia in horizon 1, Panama in horizon 2, and Mexico in Horizon 1. Table 1 shows a total count over all locations, and all horizons. Our weighted ensemble had a total of 3318 counts, followed by VAR with 2989 and Winner-Takes-All with 2310 counts.
1.0.1 Overall Error reduction
Figure 6 exhibits a summary of the error reduction for the analyzed models with respect to the Persistence model . The violin plots are ordered so that the best model is to the rightmost side, and a dashed horizontal line plotted at y = 1 (y = 1 means the error of our model is equal to the error of Persistence) serves as a reference to know if a model consistently beat persistence or not.
The weighted ensemble (WE), vector autoregression and the winner-takes-all (WTA) ensemble were the three models that most frequently scored within top-3 error reduction. We observe that the weighted ensemble had median error reduction within the top 3 at every location and time horizon, except Panama in horizon 1. Regularized VAR scored the biggest error reduction in Colombia and Thailand for horizons 1,2, and 3, 4. Although less frequently, the Winner- Takes-All ensemble also remained within the top 3 performances with exception to Thailand in horizon 4, Panama in horizon 3 and 4, and Mexico in horizon 3 and 4.
Impact of reporting delays in our model’s performance
In conducting the prospective analysis, our forecasts were generated in a real-time scenario where the ground truth for each location was not fully reported at the time of prediction. The performance of our models were therefore likely affected by backfill issues in the data. Figure 7 illustrates our forecasts within the region of Sergipe, Brazil. At the time of prediction, the available information on confirmed cases (depicted in dark gray) differed from the most recently reported data (shown in light gray), which we employed as our ground truth for final metrics and error scores. Such backfill issues significantly impact real world applications as our models are trained solely on the information available at the given point in time.
Discussion
Dengue is a leading cause of hospitalization and death for many people around the world [16], and with cases doubling every ten years [26]. An essential component of dengue control is disease forecasting. Enhancing the accuracy and robustness of predictive models, particularly across multiple diverse geographical localities, empowers public health institutions to adopt a more proactive approach towards curbing the spread of the disease.
We tackled the task of predicting reported dengue fever case counts one, two, and three months ahead in various province-sized locations around the worldwide. Since individual model performances typically fluctuate across locations, there is a need for more robust and generalizable forecast models. In this context, our first contribution was the development of a family of ensemble models that retrospectively produced more accurate forecasts than their components across a broad range of geographically and socially diverse locations. Specifically, we investigated eleven types of data-driven, statistical, and mechanistic models as potential components of our ensemble. We also explored three ensembling mechanisms — performance- based weights, winner-takes-all, and simple average — and found that our ensemble models achieved lower percent absolute errors across 180+ geographically diverse locations. Our second contribution was a real-time dengue forecasting platform to predict when and where outbreaks will occur 1, 2, and 3 months in advance. Our forecasts were made in real-time without complete ground truth for each location. In fact, this task is not the same as performing retrospective studies, since dengue case data are typically updated months later (a problem commonly referred to as “backfill”). In this challenging scenario, our ensemble models still emerged as the top performers, producing better forecasts and reducing error compared to individual components.
Our ensemble models were robust predictors of dengue across the world. This fact is especially relevant because, as shown in Fig. 1, no individual model consistently achieved the lowest error. By contrast, while country-specific and overall ensembles were not always the best-performing models at a given location and forecast horizon, they almost always incurred the most top 3 performers compared to any of the component models. As shown in Fig. 3, our ensemble variants incurred the lowest error averaged across all 180+ tested locations in terms of percent absolute error compared to the component models. Even looking within a particular country and forecast horizon, as shown in Fig 4, ensemble models achieved the lowest error averaged across all locations within that country in 13 out of 15 combinations of country and forecast horizon. The two exceptions were Colombia at 2-months and 3-months ahead. Furthermore, in 9 out of 15 country-horizon combinations, both the country-specific and overall ensembles achieved the top 2 lowest errors averaged across locations. In our prospective study, the weighted ensemble (WE) model was overall the top performer in terms of low mean squared error, followed by VAR and the Winner Takes All (WTA) ensemble (Table 1). In terms of error reduction with respect to the persistence model, WE and WTA also consistently performed within top-3 performers across locations.
It is worth addressing the excellent performance of our VAR model in both retrospective and prospective studies. Examining our results more granularly, from Fig. 4, we observe that in Colombia, at the 2-month ahead horizon, VAR (Clust., Reg.) achieved more top 3 rankings than both ensemble variants. At the 3-month ahead horizon, VAR (Clust., Reg.) not only outperforms the overall ensemble in terms of the number of top 3 rankings but achieves more top 1 rankings than both ensemble variants. Similarly, at least one VAR model also outperforms at least one ensemble variant in terms of the number of top 3 rankings in all three forecast horizons of both Malaysia and Thailand. We hypothesize that VAR’s stellar performance in Colombia, Malaysia, and Thailand can be significantly attributed to the fact that these three countries are “province-dense” in the sense that individual provinces are relatively geographically small and, by extension, extremely close to each other. For example, Thailand has 77 provinces compacted into a relatively-small total surface area. In contrast, Brazil has 27 provinces spread out across a much larger area. The consequence of this geographical difference is that population centers between Brazilian provinces are much farther apart, and thus network effects are much weaker than their Thai counterparts. As such, VAR is much more effective in Thailand than Brazil because there are significantly stronger network effects between provinces to capture in our models.
We also investigated component models that were not exhaustively hyperparameter tuned but rather deployed straight out of the box, which we refer to as “standard” models. For details, we refer the reader to our Supplementary Information. As shown in Figs. 12 and 14, while our standard component models are nearly all unable to outperform our naive persistence baseline in terms of percent absolute error as averaged across locations, our ensemble models comprised of these standard component models outperformed the naive persistence baseline consistently. As such, our ensembling approach can take relatively weak, unoptimized learners and output a much stronger and more robust prediction. In this sense, the ensemble still performs better than its components.
From Tables 6 - 8 in the Supplemental Materials, we observe that when forecasting 1-month and 2-months ahead, the ensembling method most commonly employed (albeit plurality, not majority) was the equal weights method, followed by the performance-based weights model. At 3-months ahead, however, the performance-based weights mechanism was employed in most countries, including the overall ensemble. There does not appear to be a clear trend with respect to the ensemble training window sizes used to fit the performance-based weights and winner-takes-all ensembles.
Since the success of our forecasts is measured by achieving a lower percent absolute error than the naive persistence baseline model, ensembling enables us to include the naive persistence model itself as a component. As shown in our standard model results in the Supplementary Materials, we observe that when working with standard, non-fine-tuned component models, nearly all of the best country-specific and overall ensembles were comprised of the naive and seasonal basic models, coupled with one or two other models. From a bias-variance tradeoff perspective, the naive persistence model has very low variance, given its absence of tunable parameters. While other component models may overfit to noise and thus incur large errors, the naive persistence, by being simple, provides a stable component to the ensemble and thus allows the ensemble to outperform the other models, including its components.
One limitation of our work is that of the eight non-basic models that we include as potential components into the ensemble, six of them — AR, ARGO, ARGONet, NetModel, VAR (regularized), VAR (clustered + regularized) — can be interpreted as belonging to a common family tree of linear models involving autoregressive terms. In fact, we did not include many models in our analyses that were non-linear with respect to historical (logged) reported case counts, and our resultant ensembles may not be expressive enough. In the future, one could consider including more expressive but also more heavily-parameterized models such as Random Forests [65] and neural networks [4, 65] into our ensemble lineup to potentially increase performance. However, as explored in [4], heavily-overparameterized neural networks may underperform compared to simpler regression models at short-term disease forecasting tasks. One could also include additional traditional time series forecasting techniques like Holt-Winter, as explored in [52], into the ensemble lineup for extra non-linear models. With the exception of ARGO and ARGONet, all of our models were trained exclusively using historical dengue-reported case counts. Future work could include models that leverage climate data and earth observations into our ensemble lineup, as explored in [12].
Despite our efforts to forecast dengue cases in o ver 180 locations worlwide, there are still many other countries, especially in the Americas and Asia, where we can retrospectively and prospectively test our individual and ensemble models. Our methodology can also be easily extended to support uncertainty quantification. Please see our Supplementary Materials for additional details and proofs-of-concept of such an extension. Future studies could explore classification tasks of predicting whether a given location will experience an outbreak by thresholding our case count predictions. Methods like DT-SIR, while prone to over-predicting at outbreak peaks, are still excellent at capturing the outbreak progression trend. Another interesting research avenue involves combining regression and classification components together within ensembles. For example, one can consider an ensemble setup containing both regression models (predicting the number of dengue reported case counts) and classification models (predicting whether an outbreak will occur in the next months). Future work could also involve combining ensembles together into superensembles to further reduce variance.
2 Methods
2.1 Data Sources
In this section, we present our data collection and processing routines.
Reported Case Counts
We used two primary modalities of data to train our models. Raw weekly and monthly reported dengue case counts at the city and province levels were obtained from the following sources. For Brazil, data was obtained from SINAN (using the Datasus package in R) and from the Info Dengue website API (at https://info.dengue.mat.br). Data from both sources were reformatted and merged into a single data set. Area codes were translated to state and municipality names. For Thailand, National Disease Surveillance Reports were downloaded from http://doe.moph.go.th. Separate files are available for dengue fever, DHF, and DHF shock syndrome. Files were processed in R, reformatted and combined into a single data set. For all other countries, PDF-formatted reports were downloaded from the Ministry of Health websites (for Colombia: https://www.ins.gov.co; for Malaysia: https://www.moh.gov.my; for Mexico: https://www.gob.mx; for Peru: https://www.dge.gob.pe and for Puerto Rico: https://www.salud.gov.pr). Tables containing dengue case data on a regional level were identified and extracted using ABBYY FineReader PDF software, applying OCR where required, and saved to excel. Extracted tables were then processed in R, checking extracted region names and count values using regular expressions and verifying table totals where available. All tables were time-stamped with the date of the report they were extracted from, reformatted and combined into a single data set per country. Data at the city level were aggregated via summation into province-level resolution before input into our model training pipeline. All data at the weekly level were also aggregated via summation into monthly values before the start of model training.
Google Trends
We used the Google Health Trends API to obtain monthly dengue-related search terms’ frequencies at the province level for all of our locations. For a small number of locations where Google Trends data was not available at the provincial level, we used the Google Trends data of the dengue-related search terms at the country level as a proxy. Within each country, we used the same set of dengue-related search terms for each province. Country-specific lists of all the dengue-related search terms we used can be found in our Supplementary Materials.
To maintain consistency, we chose only the top 10 most useful dengue-related search terms in each country as input into the ARGO and ARGONet models that required Google Trends data. Specifically, we determined each term’s “usefulness” in a particular location by computing the Pearson correlation of these term’s search frequencies with the reported case counts on a time window directly preceding our model evaluation time window to avoid signal leakage. We ordered the Google Trends terms in decreasing order of Pearson correlation and used this ordered data as input into the ARGO and ARGONet models. Country-specific time windows for the Pearson correlation analyses can be found in our Supplementary Materials under the “Training Period” column of Table 2.
2.2 Fitting Methods
Fig. 8 illustrates our two methods for fitting our individual models at each timestep.
Fig. 8 shows our two mechanisms for model-fitting. Panel (A) shows the sliding window mechanism for model fitting. For example, if the current time is February 2019, our prediction task is to forecast one month ahead — March 2019. And suppose, for example, that we are using a 9-month sliding window. Then, when fitting a given model to forecast for March 2019, a 9-month sliding window fitting method implies that we will only use the reported case counts data between June 2018 and February 2019 (inclusive) as target variables in our training set, as indicated by the green shaded region inside of the two dotted green lines. After March 2019, our new prediction task will be to forecast for April 2019, and we will slide our fitting window to fit our model using only the reported case counts data from July 2018 to March 2019 (now observed) as the target variables in our training set, as indicated by the blue shaded region inside of the two solid blue lines. To emphasize, because of the “sliding” operation, June 2018 is no longer contributing to our model fit. The assumption behind the sliding window mechanism is that infectious disease dynamics change across time, and thus, relationships between reported case counts in the distant past are likely not very informative of the current dynamics of the disease.
In contrast, panel (B) shows the expanding window mechanism for model fitting. Suppose that the earliest date in our dataset of reported case counts for which we can assemble a full set of covariate features is April 2018. Suppose that the current time is December 2018, and our goal is to forecast one month ahead — January 2019. Then, when fitting the model using an expanding window mechanism, we will use all of the reported case counts between April 2018 and December 2018 as target variables in our training set, as indicated by the region shaded in green inside of the two green dotted lines. After January 2019, our new prediction task will be to forecast for February 2019, and we will “expand” our fitting window to use all reported case counts from April 2018 to January 2019 as target variables in the training set, with the expansion region shaded in blue and bound by the solid blue line. In contrast to the sliding window, April 2018 is still in our training set. The underlying assumption behind the expanding window mechanism is that there exists some stationary distribution / ground-truth autoregressive data-generating process that holds across all time.
2.3 Individual Models
In this section, we describe the individual, fine-tuned component models that we will later ensemble together for more robust forecasts.
2.3.1 Autoregression (AR)
As explained by [55], a k-month-ahead autoregressive model reported case counts in a specific location at month t + k as a linear combination of reported case counts at months t through t − L + 1 in said location, with a bias term. The hyperparameter L is the number of lags that we are using when forecasting. Initial experiments suggested that autoregressive models experienced significant performance boosts when working with log-transformed case counts, so we define yt as the log-transformed reported case counts of dengue in a given location at month t. An AR(L) model for k-month-ahead forecasting can thus be expressed as the following, where ɛ is an irreducible error term:
Standard Variant
For the standard variant, at all forecast horizons and locations, we used a generic AR(4) model fitted using Ordinary Least Squares, with no log-transformation of the reported case counts. We use an expanding window for model-fitting.
Optimized Variant
For the optimized variant, at all forecast horizons, we found that setting L = 24 yielded the best performance. We also trained all AR models in all locations and horizons using an expanding window approach. Computationally, we used the glmnet package [14] to minimize the L2 error subject to LASSO regularization (with regularization strength determined using cross- validation) to fit our model at each simulated month. From initial testing, we did not regularize the first two lags. For 2- and 3-months ahead forecasting, in addition to our choice of L = 24, we manually chose the specific lags of 1, 2, 3, 12, 13, 14, 15, and 24.
2.3.2 AutoRegression with Google Search Data (ARGO)
As introduced in [63], ARGO builds on the classical AR model and incorporates the most- recently-available Google Trends search frequencies of dengue-related keywords as covariates into the linear model architecture. While ARGO was originally designed for flu incidence forecasting, it has also been adopted by [27] for dengue forecasting in 20 Brazilian cities. Let xm,t be the log-transformed Google Trends search frequency of term m (for m = 1 to m = 10) at month t. Then, the ARGO model with L epidemiological lags can be given by
Standard Variant
For all forecast horizons, we used AR(4) coupled with the most recently observed Google Trends data as our features. We did use LASSO regularization via glmnet [14], but we did not protect any features from regularization. We also did not log-transform either the autoregressive features or the Google Trends features. We used an expanding window approach for model fitting.
Optimized Variant
Just as in AR, for all forecast horizons, we set L = 24 for maximum performance and use the same expanding window approach. We also use glmnet [14] with the same settings for model-fitting at each simulated month, with the first two epidemiological lags not subject to regularization. Unlike AR, we did not perform any manual selection of autoregressive lags and simply used all L = 24 lags as covariates.
2.3.3 NetModel
To model the network effects of dengue spreading between nearby provinces, we extend the original AR linear model by modeling log-transformed reported case counts for location j during month t + k as a linear combination of only recent reported case counts in location j, but also recently reported case counts for all provinces j′ ∈ J, where J is the set of all provinces in a given country. A similar approach was implemented for flu forecasting in [31].
Let La be the number of lags from location j itself (local lags) and Lb be the number of lags from each of the other locations j′ ∈ J (neighbor lags) that we will be adding into our linear model. Let yj,t be the log-transformed reported case counts in location j at month t. Then, our NetModel is given by
Standard Variant
For all forecast horizons, we used La = 4 and Lb = 1. We used LASSO via glmnet for model- fitting, but did not protect any lags from regularization. We used the expanding window fitting scheme for all locations and horizons. We did not log-transform the local lags nor the neighbor lags.
Optimized Variant
For 1-month ahead forecasting, we chose La = 24, with manual feature-selection of the lags 1, 2, 3, 12, 13, 14, 15, and 24, and chose Lb = 1, log-transforming both the local and neighbor lag features. We used glmnet with LASSO to fit our model, using an expanding window scheme, and refrained from regularizing the first two local lags.
For 2-months ahead forecasting, we chose La = 12 and Lb = 1, log-transforming both the local and neighbor lag features. We also used glmnet with LASSO to fit our model, using an expanding window scheme, and refrained from regularizing the first two local lags.
For 3-months ahead forecasting, we chose La = 12, with manual feature-selection of the lags 1, 2, 3, and 12, and chose Lb = 2, log-transforming both the local and neighbor lag features. We used glmnet with LASSO to fit our model, using an expanding window scheme, and refrained from regularizing only the first local lags.
2.3.4 ARGONet
As introduced in [31], we also include ARGONet with two-component models — ARGO and NetModel — into our list of component models. Specifically, across all forecast horizons and locations, ARGONet returns the mean of the individual ARGO and NetModel predictions as its ensemble prediction. While ARGONet was also implemented in [31] with a winner-takes-all approach, we found that for the prediction task of forecasting monthly dengue-reported case counts, such a scheme was not as effective as simply returning the mean.
We implemented the standard / optimized variants of ARGONet by taking the mean of the corresponding standard / optimized ARGO and NetModel predictions, respectively.
2.3.5 Exponential Smoothing
To implement exponential smoothing (ETS), we used the sktime [30] toolbox in Python. We used the pipeline functionality available within sktime with three different input-transforming preprocessors. The preprocessors we used were a power transformation, a robust scaler, and a min-max scaler. We then used AutoETS with a period of 12 (the number of months in a season) as the model in the pipeline. The pipeline was grid searched for the optimal input transformation using cross-validation with an expanding window splitter on the training data. After the pipeline hyperparameters were trained, it was used to make out-of-sample predictions in the usual manner with sequential addition of data, retraining parameters (not hyperparameters), and subsequent prediction.
2.3.6 Vector Autoregression
We implemented multiple vector autoregression (VAR) models with varying degrees of regularization and clustering. Prior to modeling, the data were transformed with a standard log transformation. As with other components of the manuscript, we tested several transformation approaches, but none were consistently better than a log transformation. As the nature of VAR requires that data be available for all time points in every time series included in the model, we decided to individually implement a different VAR model for each country in the data. This allowed us to use nearly the entire time series available in each country. The alternative approach would be to combine all countries into a single model. Unfortunately, this would have seriously compromised the length of the time series in several countries as historical data varies considerably by country.
For modeling, we began with the most common method implemented in R through the VARS package [49]. Unfortunately, with the amount of data available, a standard VAR model in some countries (e.g., Thailand) would not converge even with a lag order of 1. However, much of the diminished performance could be resolved by implementing geographic clustering (discussed below), suggesting that the issue was primarily a result of some degree of underdetermination.
Nevertheless, the regularized models uniformly outperformed their unregularized counterparts, whether clustered or unclustered, so we proceeded with exclusively regularized models. In an effort to simplify the process of lag selection, we decided to standardize all regularized VAR models to use a lag order of 4. Any selection greater than 4 seemed to make no difference to the out-of-sample accuracy of the model.
For regularized VAR, we used primarily the BigVAR package made available in R [42]. We trialed several regularization frameworks, including (in the terminology of the package) Basic Lasso, Basic Elastic Net, Lag, Own/Other, Sparse Lag, Sparse Own/Other, Hierarchical Componentwise, Hierarchical Elementwise, and Hierarchical Own/Other. We used the standard rolling cross-validation method, expanding window, and adjusted the train/test window. However, there was no benefit to altering these from the default selections. We tested several lambda grid depth values and selected 100 as a good tradeoff between sufficient depth and reasonable compute time. We refer to this model in our figures and tables as “VAR (Reg.).”
2.3.7 Vector Autoregression with Geographic Clustering
Next, we investigated to what extent geographic clustering could improve predictions. To that end, we obtained the latitude and longitude of every city in the data. We used the sp [7, 47] and geosphere packages available in R to compute the within-country distance matrix between all cities using the Haversine distance. The cities were then grouped with hierarchical clustering. The tree was produced using hclust and the clusters were produced using cutree as the highest number of clusters that allowed two or more cities in every cluster (which is a requirement for the application of VAR.
After within-country geographic clustering, VAR and regularized VAR were applied in the same manner as above. In all figures and tables, we refer to this model using the abbreviation “VAR (Clust., Reg.).”
2.3.8 Stacked Machine Learning
We produced a stacked regression (stackedML) model using sktime [30] and scikit-learn [48]. We constructed it as an ensemble of several available machine-learning models in the toolboxes.
Given the challenges posed by underdetermination for these data when using higher lag orders on univariate time series, before implementing the model, we utilized a simple pipeline with a preprocessor and an elastic network (EN) base model that we optimized for the best L1 ratio. We then used this model to trim higher lag orders that seemed to improve this simple model.
After trimming the higher lag orders, we implemented the stackedML model again as a pipeline to allow comprehensive hyperparameter selection via grid-searched cross-validation. For preprocessing, we included the standard scaler, the min-max scaler, and a log transformation. For base models, we included an EN model, a k-nearest neighbors model (KNN), a support vector machine model (SVM), and a gradient-boosted machine (GBM) model. The pipeline was ensembled with an independently cross-validated elastic network model allowing only positive covariates. Hyperparameters in the base models were optimized via a grid search of the entire pipeline at once; the hyperparameters that we tuned included the number of neighbors in the KNN base model, the C-parameter in the SVM base model, and the number of leaves in the base GBM model.
After extensive testing, there were a few obvious limitations. First, optimizing the pipeline was extremely computationally intensive and became exponentially complex as more hyperparameters were searched. Second, the first issue was particularly challenging when combined with rolling optimization and leave-one-out (out-of-sample) cross-validation. Third, the data of a single-city univariate time series was clearly insufficient to tune an extremely large hyperparameter space. As a result, we pared the base models to remove the GBM. The GBM alone requires tuning of so many hyperparameters for optimal predictions that it was not improving the model with its inclusion. Then, we observed that the KNN was essentially never used by the ensembling model, so it was removed as well. As a result, we were left with an EN and an SVM model stacked together.
2.3.9 Dynamically-Trained SIR
We introduce a novel, dynamically-trained SIR (DT-SIR) interpolator model for forecasting reported dengue cases. This model borrows the mathematical behaviors and properties of the traditional SIR dynamical system as introduced by Kermack and McKendrick [25] but re-purposes it for forecasting tasks.
As presented in [39], the traditional SIR model is governed by the following system of differential equations, parameterized by time t, where S is the number of susceptible individuals, I is the number of currently-infected individuals, R is the number of recovered (including deceased) individuals in a given population: Here, N is the total number of people in the population, which we assume to be fixed. The parameter β governs the rate at which susceptible individuals (in S) become infected, and the parameter γ governs the rate at which infected individuals (in I) become recovered (or deceased). Mathematically, we can define any set of solution trajectory curves for S(t), I(t), R(t) for any interval of time starting with t = 0 by specifying β and γ, as well as specifying our initial conditions S0 and I0 for the numbers of susceptible and infected individuals at our initial point of reference t = 0. Since the SIR model assumes for all timesteps t that S + I + R = N, R0 is always uniquely determined by S0 and I0.
We acknowledge that the SIR model, originally designed for modeling direct transmission-type diseases, is an oversimplified model for dengue, given the complex combination of mosquito vector-borne transmission dynamics and intricate systems of partial immunity acquisition to different serotypes. However, for the sole purpose of forecasting case counts, our empirical results suggest that the basic SIR model is still sufficient. In fact, the mechanisms of the mosquito intermediary between infected humans for dengue transmission can be absorbed into / suitably approximated by the basic SIR model. It is also worth noting that using more complex models with more compartments and parameters risks over parameterization and overfitting, which could be undesirable regarding bias-variance tradeoff considerations.
It is important to note that our public health data provides new infected cases per month and not the total infected number of infected people within a population at a given time. However, because the recovery period of dengue is almost always less than a month [1], we can assume that all individuals who become newly infected in month m will also become recovered in the same month m. With this train of thought, it follows that we can treat new infected cases at month m as interchangeable with total infected people within the population at month m. However, from previous works in the literature, we know that dengue case reporting rates are very low. This is due in part to the reality that dengue fever oftentimes manifests no symptoms and that even when symptoms are present, they are oftentimes similar to that of the common flu. As such, reporting rates tend to be low. To accommodate this underreporting, we introduce a learnable report rate parameter r ∈ (0, 1) that captures the proportion of infected individuals who are recorded by public health authorities. Given S(t), I(t), R(t), let us define C(t) = r×I(t) to represent the number of reported infectious people within the population at month t, which we clarified above is interchangeable with the reported number of new infections at month t.
For each month m, DT-SIR outputs reported case count predictions for the month m + h at a single location through the following algorithm. For simplicity, suppose we are forecasting 1- month ahead, with h = 1. But, we can output forecasts for the 2- and 3-month-ahead horizons analogously.
We query the historical dengue reported case counts for the past T months: m − T + 1, m − T + 2, …, m − 1, and m. In practice, we found through extensive testing that T = 5 performed the best across all locations and horizons, and thus we set T = 5 for both our standard and optimized DT-SIR variants. Let us denote these historical case counts as ym−T +1, ym−T +2, …, ym−1, and ym.
Using the scipy.integrate.odeint numerical integration package from SciPy [60] and the non-linear least-squares curve-fitting package lmfit [40], we find the best set of parameters β, γ, S0, I0, r such that the resultant integrated C(t) curve (always calibrated to start mathematically from t = 0, corresponding to month m − T + 1) best fits the historical dengue reported case counts in the past T months. To emphasize, the solution curves to our SIR differential equations systems will always be plotted mathematically starting with t = 0, regardless of what month m we are in. We can do this because our initial conditions of S0 and I0 render our resultant trajectory curves agnostic to the realtime month/phase of an outbreak that we are in. We define “best fit” as minimizing the RMSE of the resultant C(t) curve (evaluated at t = 0, 1, 2, …, T − 1) with respect to the historical dengue case counts observed at months m − T + 1, m − T + 2, …, m − 1, and m. Because this objective function is almost certainly non-convex, we cannot guarantee that our curve-fitting algorithm will find the global minima. The best we can do is find a very good local optimum.
Given that we are only using the most recent T months’ observations as input data for fitting our SIR system parameters, in other words, we are using a sliding window approach and shifting our sliding window after each prediction timestep. The reason for using a sliding window as opposed to an expanding window that we use for the other component models is because the SIR model, by nature, can only model one outbreak peak at a time. If we used an expanding window approach, we may have multiple peaks in our fitting data, and our resultant SIR parameter fit would be very poor. The sliding window approach is especially attractive if we accept the assumption that disease outbreak dynamics vary significantly across time and various historical outbreak cycles. As such, we must re-estimate our model parameters at each timestep to remain up-to-date with current transmission dynamics.
When fitting our C(t) curve to the observed monthly data, it should be noted that we limit the plausible range for β ∈ [0, 500] and γ ∈ [0, 8.5]. We limited β to a still sizeable range mainly for practicality and reproducibility. From existing literature [1], we know that the average human infectious period for dengue fever is 4-5 days. Canonically, we know that γ represents . As such, converting to months, we find corresponding γ values of 6 and 7.5. We expand the upper bound of our parameter interval by 1 to account for some possible anomalies. Of course, we also restrict S0 and I0 to never exceed N, the total population. We restrict r ∈ [0, 1].
To avoid being confined to one local optima, we perform 500 independent fits for β, γ, S0, I0, r, selecting the estimated parameters corresponding to our ”best” fit (in terms of RMSE) for our forecasting purposes. For 250 of these fits, we randomly initialize our parameter guesses across their entire permitted intervals. For the other 250 fits, we initialize the starting parameter guesses to be distributed uniformly on an interval that is within 20% of the previous timestep’s fitted optimal parameter values, to enforce some “continuity” of our disease dynamic parameters over time. For S0 and I0, we initialize S0 during each fit to a starting guess of N, the total population, because intuitively, the proportion of people with dengue in a population is relatively low.
Using our best-estimated parameters of β, γ, S0, and I0, and r, we evaluate C(t) at t = T − 1 + h to produce our provisional prediction for the number of reported dengue case counts at month m + h. Let us call our provisional prediction ŷm+h.
However, as observed in [58], SIR-type models are prone to “overshooting,” or significantly overestimating the number of reported dengue cases at outbreak peaks. Our DTSIR, without modification, also experiences such limitations. To mitigate this potential in-accuracy, we implement an anti-overshooting mechanism at each prediction timestep m, comprising of a ”threshold” and a ”compensator”. We will explain this anti-overshooting mechanism for the 1-month-ahead forecasting task (i.e., h = 1), but the 2-month and 3- month-ahead setups are analogous.
(a) Threshold: We calculate yt − yt−h, the observed h-month-apart differences between reported dengue case counts, for the past n months. Next, we calculate the mean observed historical differences in these past n months and add s standard deviations to this value. This computed value is our threshold k. Formally, we have
where ȳ is the sample mean of the yt−yt−h, for t = m to t = m−n+1. From extensive testing, we found that n = 24 and s = 4 were the most suitable and generalizable hyperparameter settings across all of our locations and forecast horizons.
(b) Compensator: If the difference between our provisional prediction for month m + h, ŷm+h, and the true reported case count for our most recently observed month, ym, is greater than our threshold k, then our threshold is triggered and we adjust our prediction. Define ȳ+ to be the mean positive h-month-apart differences between reported dengue case counts for the past n months:
where 1 is an indicator function. Our compensator value is formulated as the sum of ȳ+ and s times the standard deviation of the positive h-month-apart differences between reported dengue case counts for the past n months. Let δ be our compensator value, which is formally defined as
With our compensator value computed, we output the following adjusted prediction for month m + h, ŷ′m+h:
Intuitively, it would make sense to set δ = k. However, this alternative adjustment is suboptimal because setting it renders it very difficult for our model to predict an outbreak that is truly significantly more intense (in terms of peak reported case counts) than it has historically seen in its training data. In contrast, using our compensator formulation with the historical positive differences allows us to better capture the intuition that cases will indeed increase rapidly during outbreak peaks.
4. Finally, to predict reported case counts for the next month, month m + h + 1, we slide our fitting window forward by one month and repeat the steps enumerated above.
We use similar DT-SIR model settings in both our standard and optimized ensembling tests, for all locations and all forecast horizons — specifically, we use the most recent T = 5 months of observations as our sliding training window. However, for the standard model setting, we disable the anti-overshooting mechanism.
2.3.10 Basic Models
In addition to the more complex individual models described above, we also include two relatively-simple baselines as potential component models in our ensemble.
Naive Persistence
If we are simulating being in month t and forecasting k months ahead, the naive persistence model will return the number of reported cases currently observed in month t as its forecast for month t + k. We use the same naive persistence model in both our standard and optimized ensembling tests.
Seasonal
Suppose we are simulating being in month t and forecasting k months ahead. Without loss of generality, suppose that month t + k is January. Then, the seasonal model will query our historical reported case counts for all observed January reported case counts, and return the mean of all the historical January dengue case counts as its forecast for month t + k. The idea behind the seasonal model is that dengue has been found to be seasonal in many locations around the world (see [19], [38], [61]). We use the same seasonal model in both our standard and optimized ensembling tests.
2.4 Ensemble Systems
Fig. 9 illustrates the three ensembling methods that we tested. Let ŷi,t refer to the prediction generated by model i for time t, ŷt represent our component models’ predictions for time t stored as a vector, and yt be the ground-truth reported case counts at time t. Panel (B) illustrates the Performance-Based Weights (PBW) ensemble. In our toy example shown in Fig. 9, for this prediction timestep and our choice of a 5-month ensemble fitting window, the PBW ensemble finds the optimal set of non-negative weights w that minimizes the following objective function, if we calibrate t = 5 to correspond to the last observed timestep of April 2021: , subject to the constraint that 1T w = 1. Such optimization was performed using the scikit-learn [48] package. To make our ensemble prediction for time t+1, we output said prediction as wT ŷt+1, where ŷt+1 is the vector of our individual models’ predictions for time t + 1. Panel (C) illustrates the Winner-Takes-All (WTA) ensemble. Extending the notation from our discussion of the PBW ensemble, the WTA ensemble outputs ŷi∗,t+1 as our ensemble prediction, where encodes the index corresponding to the component model that outputted the most accurate predictions during the ensemble fitting window. Panel (D) represents the Equal Weights (EW) ensemble, which simply outputs the unweighted mean of the components’ predictions at time t + 1 as the ensemble prediction.
Please see Tables 3-5 and 6-8 in the Supplementary Materials for the best ensemble variants using the standard and optimized individual models, respectively, at each forecast horizon and within each country.
2.5 Model Evaluation
All individual and ensemble models were evaluated on the time periods under “Test Period” in Table 2 in the Supplementary Materials. While available data timeframes differed significantly across countries, we aimed to secure at least 3 years (the most recent years) of evaluation data points for each country. The remainder of the data were binned as training data, whether that be used directly for model-fitting, and/or for feature engineering.
To compare model performances, we used percent absolute error (PAE) on the predicted versus ground-truth reported case counts as our evaluation metrics. The “percent” implies that we divided the raw mean absolute errors for each model in a given location by the mean number of cases present in said location during the evaluation period.
3 Supplementary Materials
3.1 Extension of methodology for uncertainty quantification
Though not the focus of our main manuscript, our methodology can be easily extended to incorporate uncertainty quantification via approximate 95% predictive intervals for our forecasts. Our predictive interval generation algorithm is as follows:
1. Suppose today is month m and we would like to provide uncertainty quantification intervals for our h-month-ahead ensemble forecast. In practice, the ensemble can be replaced with any individual component model, too.
2. We can compute the predictive residuals ɛt = ŷt − yt for previously-observed months t = 1 through t = m, where yt was the true reported dengue case count at month t and ŷt was our h-month-ahead dengue forecast for that month (i.e., generated in month t − h). Let us name our expanding-each-month vector of residuals at month m as ɛm.
3. Using our ensemble, we can generate our point-forecast for month m + h, and denote it as ŷm+h. To provide uncertainty quantification, we can compute the standard deviation of this model’s historical residuals stored in ɛm and, assuming approximate Normality, multiply by 1.96 to obtain the width of an approximate 95% predictive interval.
4. Then, our uncertainty-quantified ensemble forecast for month m + h would be
5. Because case counts cannot be negative, we may also clip our forecast intervals to be strictly non-negative.
Below, we demonstrate this uncertainty quantification algorithm on 1-, 2-, and 3-month country- specific ensemble forecasts in Ceara, Brazil, with approximate 95% predictive intervals generated for January, May, and September of 2020 and 2021.
No one individual model is best across all locations: model performance is significantly dependent on location
Fig. 1 summarizes the comparative performances of our optimized individual and ensemble models’ within each of the five countries, tested across the three prediction tasks of forecasting 1-, 2-, and 3-months ahead. The reader can find additional details on our fine-tuning and optimization processes in the Supplementary Information.
Panel (A) illustrates our three forecasting tasks of forecasting 1-, 2-, and 3-months ahead. Panels (B) to (F) show the models’ performances within each country in two ways. On the left of each panel, we present a heatmap where each row represents a model, and each column encodes the number of locations within the country of interest where a model achieved a certain rank in terms of percent absolute error (PAE) compared to the other individual component and ensemble models (1st through 13th rankings). The models in each heatmap are listed in decreasing order by the sum of the number of locations where each model performed in the first, second, or third ranks. As an example, “Ensemble (Country, EW)” having a value of 7 corresponding to Ranking 1 in Brazil (1-Month Ahead) means that the country-specific ensemble incurred the lowest (best) PAE compared to all other individual and ensemble models in 7 out of the 27 provinces of Brazil. On the right of each panel, we have a geographical map where each province is colored according to the model that incurred the lowest PAE in that province, with the legend displayed at the bottom of the overall figure. Overall, ensemble models were robust within any of our tested countries. They demonstrated the most or nearly the most top-3 rankings compared to other models.
Formally, we define percent absolute error (PAE) on the predicted versus ground-truth reported case counts as the raw mean absolute error divided by the mean number of monthly reported cases observed in a given location during the evaluation period. Mathematically, let y1 … yT be the ground-truth reported case counts, and ŷ1 … ŷT be our predicted case counts. Then, On the right of panels (B) - (F), we plot a map of each country along with the best-performing model for each province. We observed that no particular model consistently performs the best across all locations and prediction tasks within a country. Our findings thus suggest that, in practice, it is not feasible to train, evaluate, and find the best-performing models for every single province in every single country. Instead, using an out-of-the-box ensemble setup would require significantly less computation and optimization while still effectively guaranteeing strong performance. In what follows, we present the results for each country.
Brazil
Fig. 1 (B) shows that, within the 27 provinces of Brazil, the country-specific ensemble achieved the most top 3 rankings compared to any other model, with the overall ensemble having reached the second most top 3 rankings at the 1-month and 2-month horizons. However, at the 3-month horizon, ARGONet and AR outperformed the overall ensemble in the number of top 3 rankings. From the accompanying color-coded geographical maps, we observed that at any forecast horizon for Brazil, the ensemble did not rank first in many locations - the winning model varied widely across sites.
Colombia
Fig. 1 (C) shows our 13 models’ ranking in the 33 provinces of Colombia. While the country- specific and overall ensembles achieved the most top 3 rankings at the 1-month ahead forecast horizon, clustered + regularized VAR achieved the most top 3 rankings, superseding the two ensemble variants, at 2-month ahead. At 3-months ahead, the country-specific ensemble achieved the top 3 rankings, but the overall ensemble was still outperformed by clustered + regularized VAR.
Malaysia
Fig. 1 (D) shows our results for the 15 provinces in Malaysia. While the country-specific ensemble achieved the most top 3 rankings at the 1-month and 3-month ahead horizons, clustered + regularized VAR achieved the most top 3 rankings at the 2-month horizon. Notably, the overall ensemble did not generalize very well in Malaysia, being ranked even below the naive persistence baseline at the 1-month ahead horizon. We note, however, that the two easternmost (and largest) Malaysian provinces both saw the country-specific ensemble performing the best out of all 13 models at the 1-month and 3-month horizons.
Mexico
Fig. 1(E) shows our 13 models’ rankings in the 32 provinces of Mexico. Regularized VAR achieved the top 3 rankings at the 1-month ahead forecast horizon, followed by the country-specific ensemble. At the 2-month ahead horizon, the two ensembles achieved the top 3 rankings, though at 3-months ahead. In contrast, the overall ensemble maintained the top 3 rankings, regularized VAR overtook the country-specific ensemble for the second most top 3 rankings. From the geographical maps, we see that no individual model achieved the top rank across most of locations.
Thailand
In Fig. 1(F), we show the results in the 77 provinces of Thailand. From the heatmaps, we observed that the country-specific ensemble achieved the most top 3 rankings at the 1-month and 2-month horizons. By comparison, regularized VAR superseded both ensembles for the most top 3 rankings at the 3-month horizon. Notably, at the 2-month horizon, regularized VAR achieved more first-place rankings than both ensembles. At the 1-month horizon, regularized VAR achieved the same number of first-place rankings as the overall ensemble and significantly more first-place rankings than the country-specific ensemble. The geographical maps show that no particular model performed the best across most Thailand provinces.
No one individual model is best across all locations: model performance is significantly dependent on location
Fig. 1 summarizes the comparative performances of our optimized individual and ensemble models’ within each of the five countries, tested across the three prediction tasks of forecasting 1-, 2-, and 3-months ahead. The reader can find additional details on our fine-tuning and optimization processes in the Supplementary Information.
Panel (A) illustrates our three forecasting tasks of forecasting 1-, 2-, and 3-months ahead. Panels (B) to (F) show the models’ performances within each country in two ways. On the left of each panel, we present a heatmap where each row represents a model, and each column encodes the number of locations within the country of interest where a model achieved a certain rank in terms of percent absolute error (PAE) compared to the other individual component and ensemble models (1st through 13th rankings). The models in each heatmap are listed in decreasing order by the sum of the number of locations where each model performed in the first, second, or third ranks. As an example, “Ensemble (Country, EW)” having a value of 7 corresponding to Ranking 1 in Brazil (1-Month Ahead) means that the country-specific ensemble incurred the lowest (best) PAE compared to all other individual and ensemble models in 7 out of the 27 provinces of Brazil. On the right of each panel, we have a geographical map where each province is colored according to the model that incurred the lowest PAE in that province, with the legend displayed at the bottom of the overall figure. Overall, ensemble models were robust within any of our tested countries. They demonstrated the most or nearly the most top-3 rankings compared to other models.
Formally, we define percent absolute error (PAE) on the predicted versus ground-truth reported case counts as the raw mean absolute error divided by the mean number of monthly reported cases observed in a given location during the evaluation period. Mathematically, let y1 … yT be the ground-truth reported case counts, and ŷ1 … ŷT be our predicted case counts. Then, On the right of panels (B) - (F), we plot a map of each country along with the best-performing model for each province. We observed that no particular model consistently performs the best across all locations and prediction tasks within a country. Our findings thus suggest that, in practice, it is not feasible to train, evaluate, and find the best-performing models for every single province in every single country. Instead, using an out-of-the-box ensemble setup would require significantly less computation and optimization while still effectively guaranteeing strong performance. In what follows, we present the results for each country.
Brazil
Fig. 1 (B) shows that, within the 27 provinces of Brazil, the country-specific ensemble achieved the most top 3 rankings compared to any other model, with the overall ensemble having reached the second most top 3 rankings at the 1-month and 2-month horizons. However, at the 3-month horizon, ARGONet and AR outperformed the overall ensemble in the number of top 3 rankings. From the accompanying color-coded geographical maps, we observed that at any forecast horizon for Brazil, the ensemble did not rank first in many locations - the winning model varied widely across sites.
Colombia
Fig. 1 (C) shows our 13 models’ ranking in the 33 provinces of Colombia. While the country- specific and overall ensembles achieved the most top 3 rankings at the 1-month ahead forecast horizon, clustered + regularized VAR achieved the most top 3 rankings, superseding the two ensemble variants, at 2-month ahead. At 3-months ahead, the country-specific ensemble achieved the top 3 rankings, but the overall ensemble was still outperformed by clustered + regularized VAR.
Malaysia
Fig. 1 (D) shows our results for the 15 provinces in Malaysia. While the country-specific ensemble achieved the most top 3 rankings at the 1-month and 3-month ahead horizons, clustered + regularized VAR achieved the most top 3 rankings at the 2-month horizon. Notably, the overall ensemble did not generalize very well in Malaysia, being ranked even below the naive persistence baseline at the 1-month ahead horizon. We note, however, that the two easternmost (and largest) Malaysian provinces both saw the country-specific ensemble performing the best out of all 13 models at the 1-month and 3-month horizons.
Mexico
Fig. 1(E) shows our 13 models’ rankings in the 32 provinces of Mexico. Regularized VAR achieved the top 3 rankings at the 1-month ahead forecast horizon, followed by the country- specific ensemble. At the 2-month ahead horizon, the two ensembles achieved the top 3 rankings, though at 3-months ahead. In contrast, the overall ensemble maintained the top 3 rankings, regularized VAR overtook the country-specific ensemble for the second most top 3 rankings. From the geographical maps, we see that no individual model achieved the top rank across most of locations.
Thailand
In Fig. 1(F), we show the results in the 77 provinces of Thailand. From the heatmaps, we observed that the country-specific ensemble achieved the most top 3 rankings at the 1-month and 2-month horizons. By comparison, regularized VAR superseded both ensembles for the most top 3 rankings at the 3-month horizon. Notably, at the 2-month horizon, regularized VAR achieved more first-place rankings than both ensembles. At the 1-month horizon, regularized VAR achieved the same number of first-place rankings as the overall ensemble and significantly more first-place rankings than the country-specific ensemble. The geographical maps show that no particular model performed the best across most Thailand provinces.
3.2 Forecasting performance of ensemble models built from standard non-optimized individual components
It is not always feasible to fine-tune the hyperparameters of each individual component model into their most optimized, highest-performing variants, as we did in this study. In many situations, the lack of high-quality epidemiological data and/or computational resources may impose significant challenges, likely forcing users to apply standard, off-the-shelf models and systems without significant fine-tuning. In this section, we demonstrate that even in such a situation, ensembling is a very effective and robust solution for short-term forecasting.
We present the performances of 11 standard component models and two ensemble variants on forecasting dengue in our tested province-level locations. The forecasting tasks, evaluation ranges and metrics, data sources, and training processes are identical to those of the results that we present in the main manuscript for the optimized models. We also refer the reader to our Methods section and the Additional Methods Details in our Supplementary Information for specific details on our standard individual component models, intended to replicate off-the- shelf deployment.
Our key findings from this section not only corroborate but also, in fact, enhance our reported findings in the main manuscript. First, we find that there does not exist one standard model that consistently outperforms all others in all locations, reinforcing our corresponding finding in the main manuscript with optimized models. Second, and most importantly, even though our standard individual component models are overall markedly inferior to the naive persistence baseline model, combining such weak individual component models together produces ensemble models that consistently and significantly outperform the naive persistence baseline. Indeed, even when given weak individual learners, our ensemble methods are robust and generalizable forecasting tools. Fig. 11 summarizes our standard individual and ensemble models’ comparative performances within each of the five tested countries, across our three prediction tasks of forecasting 1-, 2-, and 3-months ahead. The individual models presented here are deployed with standard off-the-shelf settings and are not fully-optimized.
The heatmaps on Fig. 11 represent the distribution of rankings for each model in each country, with models listed in decreasing order by the number of top 3 rankings accrued. We observe that in all combinations of country and forecast horizon presented, an ensemble model achieved the most top 3 rankings. This corroborates our finding in the main manuscript that the ensembles, while not always the top 1 ranked model, are generally very high-performing and robust.
From the geographical color-coded maps encoding the top 1 model in each province, we observe that no one model — standard individual nor ensemble — consistently outperformed the rest of the models. This result mirrors our finding in the main manuscript that no individual model performs the best across all locations consistently. Finally, the heatmaps and geographical maps emphasize that our standard individual component models are indeed very weak learners, being mostly outperformed by the naive persistence model.
Fig. 12 shows the forecasting performances of our standard individual and ensemble models across all 187 tested locations. Panel (A) emphasizes the primary advantage of ensemble models over their individual component models: while the individual component models fluctuate wildly in underpredicting and overpredicting, the ensemble model is much more invariant to such fluctuations and much more closely matches the ground truth.
From panel (B), we observe that for all three forecast horizons, both ensemble models achieved the most top 3 rankings compared to any other model, including the naive persistence baseline, which is ranked higher than all other standard individual component models. Indeed, one main advantage of the ensemble models is that they can take in the naive persistence models’ inputs as a component model, absorbing the robustness of the naive persistence model in times when the more complex data-driven models fail to perform well.
Panel (C) corroborates the main message in panel (B): the presence of only a few sparse patches of grey indicates that the ensemble models performed in the top 3 rankings for almost all tested locations. It should be mentioned that there are more yellow patches on this grid of maps than the corresponding grid presented in the main manuscript (see Fig. 2) — which is to be expected, as the standard individual models are consistently weaker than their optimized counterparts.
Fig. 13 displays the percent absolute error distributions for all of our standard individual component and ensemble models across all 187 tested locations. The ensemble models incurred the best mean percent absolute error across all locations compared to all other models. If compared to the optimized model results in the main manuscript, the improvements from individual component models to ensemble models were much larger when working with standard models. Finally, while all of the individual component models, at any forecast horizon, incurred significantly worse errors than the naive persistence model, the resultant ensemble models — taking as input these very same weak learners — were generally able to outperform the naive persistence baseline.
Fig. 14 provides more granular, country-specific summaries of the standard individual and component models’ percent absolute error distributions within each country and forecast horizon. Corroborating our findings in the main manuscript, an ensemble model achieved the lowest mean percent absolute error in 12 out of 15 tested combinations of forecast horizon and country. The only exceptions were Malaysia at 1- and 2-months ahead, and Mexico at 1-month ahead, where the ensembles were still unable to outperform the naive persistence model in terms of mean percent absolute error. Nonetheless, even when the naive persistence model ranks higher, the error distributions of the two ensembles are visually very similar to that of the naive persistence.
Compared to Fig. 4, we observe that the differences in variance between the two ensembles and the standard individual component models, as measured through the spread of their ridgeline distributions, were much more pronounced. This corroborates our continual finding that ensembles yield relatively larger performance improvements when provided with weaker learners as input.
3.3 Additional Methods Details
3.4 Optimized Models’ PAE by Country
3.4.1 1-Month Ahead
3.4.2 2-Month Ahead
3.4.3 3-Month Ahead
3.5 PAE of Optimized Models versus Naive Persistence Baseline
3.6 Standard Models’ PAE by Country
3.6.1 1-Month Ahead
3.6.2 2-Month Ahead
3.6.3 3-Month Ahead
3.7 PAE of Standard Models versus Naive Persistence Baseline
Data Availability
All data produced in the present study are available upon reasonable request to the authors
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵