Abstract
Combining predictions from multiple models into an ensemble is a widely used practice across many fields with demonstrated performance benefits. The R package hubEnsembles provides a flexible framework for ensembling various types of predictions, including point estimates and probabilistic predictions. A range of common methods for generating ensembles are supported, including weighted averages, quantile averages, and linear pools. The hubEnsembles package fits within a broader framework of open-source software and data tools called the “hubverse”, which facilitates the development and management of collaborative modelling exercises.
1 Introduction
Predictions of future outcomes are essential to planning and decision making, yet generating reliable predictions of the future is challenging. One method for overcoming this challenge is combining predictions across multiple, independent models. These combination methods (also called aggregation or ensembling) have been repeatedly shown to produce predictions that are more accurate (Clemen 1989; Timmermann 2006) and more consistent (Hibon and Evgeniou 2005) than individual models. Because of the clear performance benefits, multimodel ensembles are commonplace across fields, including weather (Alley, Emanuel, and Zhang 2019), climate (Tebaldi and Knutti 2007), and economics (Aastveit et al. 2018). More recently, multi-model ensembles have been used to improve predictions of infectious disease outbreaks (Viboud et al. 2018; Johansson et al. 2019; McGowan et al. 2019; Reich et al. 2019; Cramer et al. 2022).
In the rapidly growing field of outbreak forecasting, there are many proposed methods for generating ensembles. Generally, these methods differ in at least one of two ways: (1) the function used to combine or “average” predictions, and (2) how predictions are weighted when performing the combination. No one method is universally “the best”; a simple average of predictions works surprisingly well across a range of settings (McGowan et al. 2019; Paireau et al. 2022; Ray et al. 2023) for established theoretical reasons (Winkler 2015). However, more complex approaches have also been shown to have benefits in some settings (Yamana, Kandula, and Shaman 2016; Ray and Reich 2018; Reich et al. 2019; Colón-González et al. 2021). Here, we present the hubEnsembles package, which provides a flexible framework for generating ensemble predictions from multiple models. Complementing other software for combining predictions from multiple models (e.g., Pedregosa et al. (2011); Weiss, Raviv, and Roetzer (2019); Bosse et al. (2023); Couch and Kuhn (2023)), hubEnsembles supports multiple types of predictions, including point estimates and different kinds of probabilistic predictions. Throughout, we will use the term “prediction” to refer to any kind of model output that may be combined including a forecast, a scenario projection, or a parameter estimate.
The hubEnsembles package is part of the “hubverse” collection of open-source software and data tools. The hubverse project facilitates the development and management of collaborative modelling exercises (Consortium of Infectious Disease Modeling Hubs 2024). The broader hub-verse initiative is motivated by the demonstrated benefits of collaborative hubs (Reich et al. 2022; Borchering et al. 2023), including performance benefits of multi-model ensembles and the desire for standardization across such hubs. In this paper, we focus specifically on the functionality encompassed in hubEnsembles. We provide an overview of the methods implemented, including mathematical definitions and properties (Section 2) as well as implementation details (Section 3), a basic demonstration of functionality with simple examples (Section 4), and a more in-depth analysis using real influenza forecasts (Section 5) that motivates a discussion and comparison of the various methods (Section 6).
2 Mathematical definitions and properties of ensemble methods
The hubEnsembles package supports both point predictions and probabilistic predictions of different formats. A point prediction gives a single estimate of a future outcome while a probabilistic prediction provides an estimated probability distribution over a set of future outcomes. We use N to denote the total number of individual predictions that the ensemble will combine. For example, these predictions will often be produced by different statistical or mathematical models, and N is the total number of models that have provided predictions. Individual predictions will be indexed by the subscript i. Optionally, the package allows for calculating ensembles that use a weight wifor each prediction; we define the set of model-specific weights as w = {wi|i ∈ 1, …, N}. Informally, predictions with a larger weight have a greater influence on the ensemble prediction, though the details of this depend on the ensemble method (described further below).
For a set of N point predictions, p = {pi|i ∈ 1, …, N}, each from a distinct model i, the hubEnsembles package can compute an ensemble of these predictions using any function C and any set of model-specific weights w. For example, an arithmetic average of predictions yields , where the weights are non-negative and sum to 1. If wi = 1/N for all i, all predictions will be equally weighted. This framework can also support more complex functions for aggregation, such as a (weighted) median or geometric mean.
For probabilistic predictions, there are two commonly used classes of methods to average or ensemble multiple predictions: quantile averaging (also called a Vincent average (Vincent 1912)) and probability averaging (also called a distributional mixture or linear opinion pool (Stone 1961)) (Lichtendahl, Grushka-Cockayne, and Winkler 2013). To define these two classes of methods, let F (x) be a cumulative density function (CDF) defined over values x of the target variable for the prediction, and F −1(θ) be the corresponding quantile function defined over quantile levels θ ∈ [0, 1]. Throughout this article, we may refer to x as either a ‘value of the target variable’ or a ‘quantile’ depending on the context, and similarly we may refer to θ as either a ‘quantile level’ or a ‘(cumulative) probability’. Additionally, we will use f(x) to denote a probability mass function (PMF) for a prediction of a discrete variable or a discretization (such as binned values) of a continuous variable.
The quantile average combines a set of quantile functions, , with a given set of weights, w, as This computes the average value of predictions across different models for each fixed quantile level θ. For a normal distribution or any distribution with a shape and scale parameter, the resulting quantile average will be the same distribution, and the shape and scale parameters will be the average of the shape and scale parameters from the individual distributions (Figure 1, panel B). In other words, this method inteprets the predictive probability distributions that are being combined as uncertain estimates of a single true distribution. It is also possible to use other combination functions, such as a weighted median, to combine quantile predictions.
The probability average or linear pool is calculated by averaging probabilities across predictions for a fixed value of the target variable, x. In other words, for a set ℱ = {Fi(x)|i ∈ 1, …, N} containing the values of CDFs at the point x and weights w, the linear pool is calculated as For a set of PMF values, {fi(x)|i ∈ 1, …, N}, the linear pool can be equivalently calculated: . Statistically this amounts to a mixture of the probability distributions, and the resulting probability distribution can be interpreted as one where the constituent probability distributions represent alternative predictions of the future, each of which has a probability wi of being the true one. For a visual depiction of these equations, see Figure 1 below.
The different averaging methods for probabilistic predictions yield different properties of the resulting ensemble distribution. For example, the variance of the linear pool is where μi is the mean and is the variance of individual prediction i, and although there is no closed-form variance for the quantile average, the variance of the quantile average will always be less than or equal to that of the linear pool (Lichtendahl, Grushka-Cockayne, and Winkler 2013). Both methods generate distributions with the same mean, , which is the mean of individual model means (Lichtendahl, Grushka-Cockayne, and Winkler 2013). The linear pool method preserves variation between individual models, whereas the quantile average cancels away this variation under the assumption it constitutes sampling error (Howerton et al. 2023).
3 Model implementation details
To understand how these methods are implemented in hubEnsembles, we first must define the conventions employed by the hubverse and its packages for representing and working with model predictions. We begin with a short overview of concepts and conventions needed to utilize the hubEnsembles package, supplemented by example predictions provided by the hubverse, then explain the implementation of the two ensembling functions included in the package, simple_ensemble() and linear_pool().
3.1 Hubverse terminology and conventions
A central concept in the hubverse effort is “model output”. Model output is a specially formatted tabular representation of predictions. Each row represents a single, unique prediction with each column providing information about what is being predicted, its scope, and its value. Per hubverse convention, each column serves one of three purposes: (i) denote which model has produced the prediction (called the “model ID”), (ii) provide details about what is being predicted (called the “task IDs”), or (iii) specify the prediction itself and how it is represented (called the “model output representation”) (Consortium of Infectious Disease Modeling Hubs 2024).
Predictions are assumed to be generated by distinct models, typically developed and run by a modeling team of one or more individuals. Each model should have a unique identifier that is stored in the model_id column. Then, the details of the outcome being predicted can be stored in a series of task ID columns, the second type of column. These task ID columns may also include additional information, such as any conditions or assumptions that were used to generate the predictions (Consortium of Infectious Disease Modeling Hubs 2024). For example, short-term forecasts of incident influenza hospitalizations in the US at different locations and amounts of time in the future might represent this information using a target column with the value “wk inc flu hosp”, a location column identifying the location being predicted, a reference_date column with the “starting point” of the forecasts (not shown), and a horizon column with the number of steps ahead that the forecast is predicting relative to the reference_date (Table 1). All these variables make up the task ID columns.
Alternatively, longer-term scenario projections may require different task ID columns. For example, projections of incident COVID-19 cases in the US at different locations, amounts of time in the future, and under different assumed conditions may use a target column of “inc case”, a location column specifying the location being predicted, an origin_date column specifying the date on which the projections were made (not shown), a horizon column describing the number of steps ahead that the projection is predicting relative to the origin_date, and a scenario_id column denoting the future conditions that were modeled and are projected to result in the specified number of incident cases (Table 2). Different modeling efforts may use different sets of task ID columns and values to specify their prediction goals, or may simply choose distinct names to represent the same concept (e.g., reference_date versus origin_date for a date task ID). Additional examples of task ID variables are available on the hubverse documentation website (Consortium of Infectious Disease Modeling Hubs 2024).
The third group of columns in model output specify the model predictions and details about how the predictions are represented. This “model output representation” includes the pre-dicted values along with metadata that specifies how the predictions are conveyed and always consists of the same three columns: (1) output_type, (2) output_type_id, and (3) value. The output_type column defines how the prediction is represented and may be one of "mean" or "median" for point predictions, or "quantile", "cdf", "pmf", or "sample" for probabilistic predictions (although the sample output type is not yet directly supported by the hubEnsembles package and should be converted to another output type before computing an ensemble). The output_type_id provides additional identifying information for a prediction and is specific to the particular output_type (see Table 3). For quantile predictions, the output_type_id is a numeric value between 0 and 1 specifying the cumulative probability associated with the quantile prediction. In the notation we defined above, the output_type_id corresponds to θ and the value is the quantile prediction F −1(θ). For CDF or PMF predictions, the output_type_id is the target variable value x at which the cumulative distribution function or probability mass function for the predictive distribution should be evaluated, and the value column contains the predicted F (x) or f(x), respectively. Requirements for the values of the output_type_id and value columns associated with each valid output type are summarized in Table 3.
This representation of predictive model output is codified by the model_out_tbl S3 class in the hubUtils, package, one of the foundational hubverse packages. Although this S3 class is required for all hubEnsembles functions, model predictions in other formats can easily be transformed using the as_model_out_tbl() function from hubUtils. An example of this transformation is provided in Section 5.
3.2 Ensemble functions in hubEnsembles
The hubEnsembles package includes two functions that perform ensemble calculations: simple_ensemble(), which applies some function to each model prediction, and linear_pool(), which computes an ensemble using the linear opinion pool method. In the following sections, we outline the implementation details for each function and how these implementations correspond to the statistical ensembling methods described in Section 2. A short description of the calculation performed by each function is summarized by output type in Table 4.
3.2.1 Simple ensemble
The simple_ensemble() function directly computes an ensemble from component model out-puts by combining them via some function (C) within each unique combination of task ID variables, output types, and output type IDs. This function can be used to summarize predictions of output types mean, median, quantile, CDF, and PMF. The mechanics of the ensemble calculations are the same for each of the output types, though the resulting statistical ensembling method differs for different output types (Table 4).
By default, simple_ensemble() uses the mean for the aggregation function C and equal weights for all models. For point predictions with a mean or median output type, the resulting ensemble prediction is an equally weighted average of the individual models’ predictions. For probabilistic predictions in a quantile format, by default simple_ensemble() calculates an equally weighted average of individual model target variable values at each quantile level, which is equivalent to a quantile average. For model outputs in a CDF or PMF format, by default simple_ensemble() computes an equally weighted average of individual model (cumulative or bin) probabilities at each target variable value, which is equivalent to the linear pool method.
Any aggregation function C may be specified by the user. For example, a median ensemble may also be created by specifying median as the aggregation function, or a custom function may be passed to the agg_fun argument to create other ensemble types. Similarly, model weights can be specified to create a weighted ensemble.
3.2.2 Linear pool
The linear_pool() function implements the linear opinion pool (LOP) method for ensembling predictions. Currently, this function can be used to combine predictions with output types mean, quantile, CDF, and PMF. Unlike simple_ensemble(), this function handles its computation differently based on the output type. For the CDF, PMF, and mean output types, the linear pool method is equivalent to calling simple_ensemble() with a mean aggregation function (see Table 4), since simple_ensemble() produces a linear pool prediction (an average of individual model cumulative or bin probabilities).
However, implementation of LOP is less straightforward for the quantile output type. This is because LOP averages cumulative probabilities at each value of the target variable, but the predictions are quantiles (on the scale of the target variable) for fixed quantile levels. The value for these quantile predictions will generally differ between models, and as a result we are typically not provided cumulative probabilities at the same values of the target variable for all component predictions. This lack of alignment between cumulative probabilities for the same target variable values impedes computation of LOP from quantile predictions and is illustrated in panel A of Figure 1.
Given that LOP cannot be directly calculated from quantile predictions, we must first obtain an estimate of the CDF for each component distribution from the provided quantiles, combine the CDFs, then calculate the quantiles using the ensemble’s CDF. We perform this calculation in three main steps, assisted by the distfromq package (Ray and Gerding 2024) for the first two:
Interpolate and extrapolate from the provided quantiles for each component model to obtain an estimate of the CDF of that particular distribution.
Draw samples from each component model distribution. To reduce Monte Carlo variability, we use quasi-random samples corresponding to quantiles of the estimated distribution (Niederreiter 1992).
Pool the samples from all component models and extract the desired quantiles.
For step 1, functionality in the distfromq package uses a monotonic cubic spline for interpolation on the interior of the provided quantiles. The user may choose one of several distributions to perform extrapolation of the CDF tails. These include normal, lognormal, and cauchy distributions, with “normal” set as the default. A location-scale parameterization is used, with separate location and scale parameters chosen in the lower and upper tails so as to match the two most extreme quantiles. The sampling process described in steps 2 and 3 approximates the linear pool calculation described in Section 2.
4 Basic demonstration of functionality
In this section, we provide a simple example to illustrate the two main functions in hubEnsembles, simple_ensemble() and linear_pool().
4.1 Example data: a forecast hub
We will use an example hub provided by the hubverse to demonstrate the functionality of the hubEnsembles package (Consortium of Infectious Disease Modeling Hubs 2024). This example hub was generated with modified forecasts from the FluSight forecasting challenge, which will be discussed in detail in Section 5. The example hub includes both example model output data and target data (sometimes known as “truth” data), which are stored in the hubExamples package as data objects named forecast_outputs and forecast_target_ts. Note that the toy model outputs contain predictions for only a small subset rows of select dates, locations, and output type IDs, far fewer than an actual modeling hub would typically collect.
The model output data includes quantile, mean and median forecasts of future incident influenza hospitalizations and PMF forecasts of hospitalization intensity. Each forecast is made for five task ID variables, including the location for which the forecast was made (location), the date on which the forecast was made (reference_date), the number of steps ahead (horizon), the date of the forecast prediction (a combination of the date the forecast was made and the forecast horizon, target_end_date), and the forecast target (target). Table 5 provides an example set of quantile forecasts included in this example model output. In Table 5, we show only the median, the 50%, and 90% prediction intervals, although other intervals and mean forecasts are included in the example model output data.
We also have corresponding target data included in the hubExamples package (Table 6). The example target data provide observed incident influenza hospitalizations (observation) in a given week (date) and for a given location (location). This target data could be used as calibration data for generating forecasts or for evaluating the forecasts post hoc. The forecast-specific task ID variables reference_date and horizon are not relevant for the target data.
We can plot these forecasts and the target data using the plot_step_ahead_model_output() function from hubVis, another package for visualizing model outputs from the hubverse suite (Figure 2). We subset the model output data and the target data to the location and time horizons we are interested in.
Next, we examine the PMF target in the example model output data. For this target, teams forecasted the probability that hospitalization intensity will be “low”, “moderate”, “high”, or “very high”. These hospitalization intensity categories are determined by thresholds for weekly hospital admissions per 100,000 population. In other words, “low” hospitalization intensity in a given week means few incident influenza hospitalizations per 100,000 population are predicted, whereas “very high” hospitalization intensity means many hospitalizations per 100,000 population are predicted. These forecasts are made for the same task ID variables as the quantile forecasts of incident hospitalizations, other than the target, which is “wk flu hosp rate category” for these categorical predictions.
We show a representative example of the hospitalization intensity category forecasts in Table 7. Because these forecasts are PMF output type, the output_type_id column specifies the bin of hospitalization intensity and the value column provides the forecasted probability of hospitalization incidence being in that category. Values sum to 1 across bins. For the MOBS-GLEAM_FLUH and PSI-DICE models, incidence is forecasted to decrease over the horizon (Figure 2), and correspondingly, there is lower probability of “high” and “very high” hospitalization intensity for later horizons (Figure 3).
4.2 Creating ensembles with simple_ensemble
Using the default options for simple_ensemble(), we can generate an equally weighted mean ensemble for each unique combination of values for the task ID variables, the output_type and the output_type_id. Recall that this means different ensemble methods will be used for different output types: for the quantile output type in our example data, the resulting ensemble is a quantile average, while for the PMF output type, the ensemble is a linear pool.
The resulting model output has the same structure as the original model output data (Table 8), with columns for model ID, task ID variables, output type, output type ID, and value. We also use model_id="simple-ensemble-mean" to change the name of this ensemble in the resulting model output; if not specified, the default will be “hub-ensemble”.
4.2.1 Changing the aggregation function
We can change the function that is used to aggregate model outputs. For example, we may want to calculate a median of the component models’ submitted values for each quantile. We do so by specifying agg_fun=median.
Custom functions can also be passed into the agg_fun argument. We illustrate this by defining a custom function to compute the ensemble prediction as a geometric mean of the component model predictions. Any custom function to be used must have an argument x for the vector of numeric values to summarize, and if relevant, an argument w of numeric weights.
As expected, the mean, median, and geometric mean each give us slightly different resulting ensembles. The median point estimates, 50% prediction intervals, and 90% prediction intervals in Figure 4 demonstrate this.
4.2.2 Weighting model contributions
We can weight the contributions of each model in the ensemble using the weights argument of simple_ensemble(). This argument takes a data.frame that should include a model_id column containing each unique model ID and a weight column. In the following example, we include the baseline model in the ensemble, but give it less weight than the other forecasts.
4.3 Creating ensembles with linear_pool
We can also generate a linear pool ensemble, or distributional mixture, using the linear_pool() function; this function can be applied to predictions with an output_type of mean, quantile, CDF, or PMF. Our example hub includes median output type, so we exclude it from the calculation.
As described above, for quantile model outputs, the linear_pool function approximates the full probability distribution for each component prediction using the value-quantile pairs provided by that model, and then obtains quasi-random samples from that distributional estimate. The number of samples drawn from the distribution of each component model defaults to 1e4, but this can be changed using the n_samples argument.
In Figure 5, we compare ensemble results generated by simple_ensemble() and linear_pool() for model outputs of output types PMF and quantile. As expected, the results from the two functions are equivalent for the PMF output type: for this output type, the simple_ensemble() method averages the predicted probability of each category across the component models, which is the definition of the linear pool ensemble method. This is not the case for the quantile output type, because the simple_ensemble() is computing a quantile average.
5 Example: in-depth analysis of forecast data
To further demonstrate the utility of the hubEnsembles package and the differences between the two ensembling functions, we examine a more complex example. Unlike the previous section’s basic showcase of functionality, we use this case study to provide a more complete analysis that compares and evaluates ensemble model performance using real forecasts collected by a modeling hub, with an overarching goal of choosing a single best ensembling approach for the application.
Since 2013, the US Centers for Disease Control and Prevention (CDC) has been soliciting forecasts of seasonal influenza from modeling teams through a collaborative challenge called FluSight (CDC 2023). We use a subset of these predictions to create four equally-weighted ensembles with simple_ensemble() and linear_pool() and compare the resulting ensembles’ performance. The ensembling methods chosen for this case study consist of a quantile (arithmetic) mean, a quantile median, a linear pool with normal tails, and a linear pool with log-normal tails. Note that only a select portion of the code is shown in this manuscript for brevity, although all the functions and scripts used to generate the case study results can be found in the associated GitHub repository (https://github.com/hubverse-org/hubEnsemblesManuscript). More specifically, the figures and tables supporting this analysis are generated reproducibly using data from rds files stored in the analysis/data/raw-data directory and scripts in the inst directory of the repository.
5.1 Data and Methods
We begin by querying the component forecasts used to generate the four ensembles from Zoltar (Reich et al. 2021), a repository designed to archive forecasts created by the Reich Lab at UMass Amherst. For this analysis we only consider FluSight predictions in a quantile format from the 2021-2022 and 2022-2023 seasons. These forecasts were stored in two data objects, split by season, called flu_forecasts-zoltar_21-22.rds and flu_forecasts-zoltar_22-23.rds, and a subset is shown below in Table 9.
Forecasts must conform to hubverse standards to be fed into either of the ensembling functions, so we first transform the raw forecasts using the as_model_out_tbl() 1 function from the hubUtils package. Here, we specify the task ID variables forecast_date (when the forecast was made), location, horizon, and target.
Prior to ensemble calculation (shown later in this section), we filter out any predictions (defined by a unique combination of task ID variables) that did not include all 23 quantiles specified by FluSight (θ ∈ {.010, 0.025, .050, .100, …, .900, .950, .990}). The FluSight baseline and median ensemble models generated by the FluSight hub are also excluded from the component forecasts. We chose to remove the baseline to match the composition of models used to create the official FluSight ensemble.
With these inclusion criteria, the final data set of component forecasts consists of predictions from 25 modeling teams and 42 distinct models, 53 forecast dates (one per week), 54 US locations, 4 horizons, 1 target, and 23 quantiles. In the 2021-2022 season, 25 models made predictions for 22 weeks spanning from late January 2022 to late June 2022, and in the 2022-2023 season, there were 31 models making predictions for 31 weeks spanning mid-October 2022 to mid-May 2023. Fourteen of the 42 total models made forecasts for both seasons.
In both seasons, forecasts were made for the same locations (the 50 US states, Washington DC, Puerto Rico, the Virgin Islands, and the US as a whole), horizons (1 to 4 weeks ahead), quantiles (the 23 described above), and target (week ahead incident flu hospitalization). The values for the forecasts are always non-negative. In Table 10, we provide an example of these predictions, showing select quantiles from a single model, forecast date, horizon, and location.
Next, we combine the component model outputs to generate predictions from each ensemble model. We begin by excluding the baseline model from the set of predictions that will be combined. Then, we create one object to store the ensemble results generated from each method we are interested in comparing.
We evaluate the performance of these ensembles using scoring metrics that measure the accuracy and calibration of their forecasts. Here, we choose several common metrics in forecast evaluation, including mean absolute error (MAE), weighted interval score (WIS) (Bracher et al. 2021), 50% prediction interval (PI) coverage, and 95% PI coverage. MAE measures the average absolute error of a set of point forecasts; smaller values of MAE indicate better forecast accuracy. WIS is a generalization of MAE for probabilistic forecasts and is an alternative to other common proper scoring rules which cannot be evaluated directly for quantile forecasts (Bracher et al. 2021). WIS is made up of three component penalties: (1) for over-prediction, (2) for under-prediction, and (3) for the spread of each interval (where an interval is defined by a symmetric set of two quantiles). This metric weights these penalties across all prediction intervals provided. A lower WIS value indicates a more accurate forecast (Bracher et al. 2021). PI coverage provides information about whether a forecast has accurately characterized its uncertainty about future observations. The 50% PI coverage rate measures the proportion of the time that 50% prediction intervals at that nominal level included the observed value; the 95% PI coverage rate is defined similarly. Achieving approximately nominal (50% or 95%) coverage indicates a well-calibrated forecast.
We also use relative versions of WIS and MAE (rWIS and rMAE, respectively) to understand how the ensemble performance compares to that of the FluSight baseline model. These metrics are calculated as where model m is any given model being compared against the baseline. For both of these metrics, a value less than one indicates better performance compared to the baseline while a value greater than one indicates worse performance. By definition, the FluSight baseline itself will always have a value of one for both of these metrics.
Each unique prediction from an ensemble model is scored against target data using the score_forecasts() 2 function from the covidHubUtils package, as a hubverse package for scoring and evaluation has not yet been fully implemented. This function outputs each of the metrics described above. We use median forecasts taken from the 0.5 quantile for the MAE evaluation.
5.2 Performance results across ensembles
The quantile median ensemble has the best overall performance in terms of WIS and MAE (and the relative versions of these metrics), and has coverage rates that were close to the nominal levels (Table 11). The two linear opinion pools have very similar performance to each other. These methods have the second-best performance as measured by WIS and MAE, but they have the highest 50% and 95% coverage rates, with empirical coverage that was well above the nominal coverage rate. The quantile mean performs the worst of the ensembles with the highest MAE, which is substantially different from that of the other ensembles.
Plots of the models’ forecasts can aid our understanding about the origin of these accuracy differences. For example, the linear opinion pools consistently have some of the widest prediction intervals, and consequently the highest coverage rates. The median ensemble, which has the best WIS, balanced interval width with calibration best overall, with narrower intervals than the linear pools that still achieved near-nominal coverage on average across all time points. The quantile mean’s interval widths vary, though it usually has narrower intervals than the linear pools. However, this model’s point forecasts have a larger error margin compared to the other ensembles, especially at longer horizons. This pattern is demonstrated in Figure 6 for the 4-week ahead forecast in California following the 2022-23 season peak on December 5, 2022. Here the quantile mean predicted a continued increase in hospitalizations, at a steeper slope than the other ensemble methods.
Averaging across all time points, the median model can be seen to have the best scores for every metric. It outperforms the mean ensemble by a similar amount for both MAE and WIS, particularly around local times of change (see Figure 7 and Figure 8). The median ensemble also has better coverage rates than the mean ensemble in the tails of the distribution (95% intervals, see Figure 9) and similar coverage in the center (50% intervals). The median model also outperforms the linear pools for most weeks, with the greatest differences in scores being for WIS and coverage rates (Figure 8 and Figure 9). This seems to indicate that the linear pools’ estimates are usually too conservative, with their wide intervals and higher-than-nominal coverage rates being penalized by WIS. However, during the 2022-2023 season there are several localized times when the linear pools showcased better one-week-ahead forecasts than the median ensemble (Figure 8). These localized instances are characterized by similar MAE values (Figure 8) for the two methods and poor median ensemble coverage rates (Figure 9). In these instances, the wide intervals from the linear pools were useful in capturing the eventually-observed hospitalizations, usually during times of rapid change.
In this analysis, all of the ensemble variations outperform the baseline model; yet, different ensembling methods perform best under different circumstances. While the quantile median has the best overall results for WIS, MAE, 50% PI coverage, and 95% PI coverage, other models may perform better from week-to-week for each metric. Around the 2022-2023 season’s peak in early December, the remaining four models (including the baseline) each have instances in which they achieve the lowest WIS, like the linear pool ensembles for the one week ahead horizon over several weeks of this period.
The choice of an appropriate ensemble aggregation method may depend on the forecast target, the goal of forecasting, and the behavior of the individual models contributing to an ensemble. One case may call for prioritizing high coverage rates while another may prioritize accurate point forecasts. The simple_ensemble() and linear_pool() functions and the ability to specify component model weights and an aggregation function for simple_ensemble() allow users to implement a variety of ensemble methods.
6 Summary and discussion
Ensembles of independent models are a powerful tool to generate more accurate and more reliable predictions of future outcomes than a single model alone. Here, we have demonstrated how to utilize hubEnsembles, a simple and flexible framework to combine individual model predictions into an ensemble.
The hubEnsembles package is situated within the larger hubverse collection of open-source software and data tools to support collaborative modeling exercises (Consortium of Infectious Disease Modeling Hubs 2024). Collaborative hubs offer many benefits, including serving as a centralized entity to guide and elicit predictions from multiple independent models (Reich et al. 2022). Given the increasing popularity of multi-model ensembles and collaborative hubs, there is a clear need for generalized data standards and software infrastructure to support these hubs. By addressing this need, the hubverse suite of tools can reduce duplicative efforts across existing hubs, support other communities engaged in collaborative efforts, and enable the adoption of multi-model approaches in new domains.
When using hubEnsembles, it is important to carefully choose an ensemble method that is well suited for the situation. Although there may not be a universal “best” method, matching the properties of a given ensemble method with the features of the component models will likely yield best results (Howerton et al. 2023). Our case study on seasonal influenza forecasts in the US demonstrates this point. The quantile median ensemble performs best overall for a range of metrics, including weighted interval score, mean absolute error, and prediction interval coverage. Yet, the linear pool method, which generates an ensemble with wider prediction intervals, demonstrates performance advantages during periods of rapid change, when outlying component forecasts are likely more important. Notably, all ensemble methods outperform the baseline model. The performance improvements from ensemble models motivate the use of a “hub-based” approach to prediction for infectious diseases and in other fields.
Ongoing development of the hubEnsembles package and the larger suite of hubverse tools will continue to support multi-model predictions in new ways, including for example supporting additional types of predictions, enabling scoring and evaluation of those predictions, and allowing for cloud-based data storage. All such infrastructure will ultimately provide a comprehensive suite of open-source software tools for leveraging the power of collaborative hubs and multi-model ensembles.
Data Availability
All code and data necessary to reproduce this manuscript can be found at https://github.com/hubverse-org/hubEnsemblesManuscript. The related R package can be found at https://github.com/hubverse-org/hubEnsembles.
Consortium of Infectious Disease Modeling Hubs
Consortium of Infectious Disease Modeling Hubs authors include Alvaro J. Castro Rivadeneira (University of Massachusetts Amherst), Lucie Contamin (University of Pittsburgh), Sebastian Funk (London School of Hygiene & Tropical Medicine), Aaron Gerding (University of Massachusetts Amherst), Hugo Gruson (data.org), Harry Hochheiser (University of Pittsburgh), Emily Howerton (The Pennsylvania State University), Melissa Kerr (University of Massachusetts Amherst), Anna Krystalli (R-RSE SMPC), Sara L. Loo (Johns Hopkins University), Evan L. Ray (University of Massachusetts Amherst), Nicholas G. Reich (University of Massachusetts Amherst), Koji Sato (Johns Hopkins University), Li Shandross (University of Massachusetts Amherst), Katharine Sherratt (London School of Hygene and Tropical Medicine), Shaun Truelove (Johns Hopkins University), Martha Zorn (University of Massachusetts Amherst)
Acknowledgements
The authors thank all members of the hubverse community; the broader hubverse software infrastructure made this package possible. L. Shandross, A. Krystalli, N. G. Reich, and E. L. Ray were supported by the National Institutes of General Medical Sciences (R35GM119582) and the US Centers for Disease Control and Prevention (U01IP001122 and NU38FT000008). E. Howerton was supported by the Eberly College of Science Barbara McClintock Science Achievement Graduate Scholarship in Biology at the Pennsylvania State University. L. Contamin and H. Hochheiser were supported by NIGMS grant U24GM132013. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIGMS, the National Institutes of Health, or CDC.